System Monitoring – Part 2 : Basic Architecture

This is the second article in a series on application systems monitoring for software developers. In the first article I discussed the basic idea and concepts around systems monitoring. In this second article I will go over the basic architecture of how an application monitor can work. The system will start out simple to begin with and be extended over time. This iterative approach to building up the design mirrors how I implemented this type of system for my current employer. In true agile style I wanted to get something basic working as quickly as possible so we could start getting the benefit from it early on. Once the system was out there and working, I then built upon the basic idea with new features.

Use Case

Use Case Diagram for a Basic System Monitor.

Use Case Diagram for a Basic System Monitor.

First of all let’s look at the basic use case diagram above. This diagram shows 2 subsystems, the monitor sensors and the monitor dashboard. This article will focus on the first subsystem, Monitor Sensors. At this stage in the systems evolution the main actors here are Developers and Technical Support.

Roles may differ in your own organisation, but because I work in Financial Services we have to have a clear separation of concerns between the development infrastructure and the production infrastructure. This means developers can not directly deploy too or modify a production environment. This is for good reason; we are very good at breaking stuff!! This is why we and many other large organisations have a separate technical services teams who maintain the production environment.

That means both developers and technical support have a vested interest in the results of the monitor. Just because developers can’t make changes to a production environment, there is no reason why they can’t see the telemetry data collected by the monitoring systems. In fact it would be a very bad idea if they couldn’t. By seeing how the production systems are running directly from a snapshot data stream, developers can get a good impression of how their systems perform in a real production environment. This may well help to influence future design decisions when the systems are extended. This also gives developers a realistic impression of what volumes their systems cope with in real life. As we all know a development test environment, no matter how hard we try, never matches what we see in a real production scenario with real users and data.

Layer Diagram

Layer Diagram for a System Monitor

Layer Diagram for a System Monitor

Another view of this system can be expressed as a basic layer diagram. In the bottom half of the layer diagram we have the Data Collectors. This comprises of a system monitor layer that has dependencies of multiple sensors or monitors (I sometimes interchange the name sensor and monitor, but essentially I am referring to the same thing). This layer diagram represents this type of system that I built which is a 1 process system with multiple sensors executed serially. I will cover monitor types in a bit.

The top half of the diagram represents the dashboard layer. The dash board is an application or website that reads the telemetry data produced by the monitor. I will cover the dashboard in a future article.

Sensor Architecture

I am going to describe 2 basic architectures that you can use for your monitoring system. There are most probably lots more. There are of course a 100 different ways to skin a cat, but I like to keep things simple. The first option below is the option I went with, One Process – Multiple Monitors. After that I make things even easier and describe, One Process – Single Monitor.

One Process – Multiple Monitors

Class Diagram for One Process - Multuple Monitors

Class Diagram for One Process – Multuple Monitors

The class diagram above is a simple example of what this system looks like. A monitor derives from a common interface, IMonitor. In this example the monitor interface has 2 methods, ProcessLogs() and GetXmlElement(). In the example there are 2 concrete classes that implement IMonitor. The main Program object registers the 2 objects into a Dictionary. Once they are registered, Program calls RunMonitors() on the MonitorRegister object. This iterates through the monitors registered in the Dictionary and calls ProcessLogs() on each object. It will be this method that initiates the monitoring for that particular sensor that may be log parsing or extracting information from a database. Once ProcessLogs() has finished running, the MonitorRegister calls GetXmlElement() on the monitor object. This returns an XML fragment containing the data for that monitor.

Once the MonitorRegister object has finished executing all the monitors, it will call SendEmail() to send a summary report to a chosen distribution list and WriteXmlLog() will output each XML fragment to a file as a single XML file.

This can all be summed up in the following sequence diagram.

One Process - Multuple Monitors Sequence Diagram

One Process – Multuple Monitors Sequence Diagram

For my own particular implementation, I kept the whole thing single threaded. All of the monitors I have implemented take less than a minute to fully execute and send out the summaries, so I didn’t see the point in over complicating things, but running these monitors in concurrent threads is an option should you require it. I think the following image sums up the intention of the architecture perfectly.

System Monitoring - Information Overload

Taken by Marina Noordegraaf

At the top of the sieve we have lots of information flowing in from each of our system monitors and at the bottom we have one stream of information flowing out.

One Process – Single Monitor

One Process – Single Monitor

One Process – Single Monitor

Another way of writing the sensors is to have 1 process with a single monitor, which basically means each monitor as a separate executable. I didn’t really see the need for this in my own system, but this could make sense for you. I preferred having one application handle the execution of all the monitor objects and then collating the information together.

One way this may make sense is if the systems you are monitoring have to be physically separated on different networks and the system that controls the monitors cannot physically access logs for another system. You could either use this simpler model, or multiple versions of the single process, multiple monitor approach. The rest of this series will be based around the former.

Scheduling the Monitors

When you have a basic set of monitors developed you need to decide how they are going to run and when. You could write scheduling into the monitor process itself, but I don’t see the point in re-inventing the wheel. Windows has a perfectly good task scheduler built into it, and this is what I used for my own monitoring solution.

Windows Task Scheduler for Executing a System Monitor

Windows Task Scheduler for Executing a System Monitor

As I mentioned previously, I opted for the single process and multiple monitor’s model. This gives me a nice, neat executable that I can plug into the task scheduler. The task scheduler can also pass in command line arguments into the application and this enables you to have different modes of operation. The modes I have are:

  •  Silent: This is a silent running mode. That means the monitor will not send out an email summary report, but it will write out the snapshot xml file.  I currently have this mode set to run every 15 minutes, 24 hours a day.
  • Email Report: This mode does the same as the silent mode above, except that after it has finished running it will send out the summary report to key stakeholders. I currently have this running every hour between 5:30am and 7:30pm, 7 days a week.
  • End of Day Run: This is a special mode that I have triggered to run at 23:55pm every day. This end of day run appends to a set of CSV files which I use to drive some of the charting in the dashboard. I will be covering the dashboard in a future article.
  • Send Alerts: This mode runs the monitor, but also sends out alerts to different stakeholders if certain thresholds are breached. I will be covering alerting in more details in a future article.
  • Get Replicated Data: This is more of a mode that is specific to our legacy architecture at the company I work for. Essentially it gets product performance metrics from the previous day and is used as the basis for a management information summary report that is generated. The replicated data part isn’t relevant for this set of articles, but the idea of producing management reports is quite relevant and will be covered in a future article.

The beauty of using a task scheduler is you have complete control over how and when your monitors run, and you can tweak the timings as much as necessary.

Monitor Outputs

The final piece I wanted to cover in this article was an overview of the output from the monitor. As mentioned above in the running modes, the system sends out an email summary report and writes out an XML snapshot file.

Summary Report

When the report is run in ‘Email Report’ mode a concise summary of the key facts from the monitor gets emailed out to a set distribution list. The majority of the people on that list will check the file on a blackberry smartphone so the formatting is kept simple. In this file you can see an example report that is sent out.

Disclaimer: This example report doesn’t contain details of all my monitors. Some of the monitor names have been changed and the values in the report are made up. This is just an example to give you an idea.

The key with a summary report is not to drown the recipient in information. You just want to cover key facts and errors. If the recipient sees anything odd, then they can contact the relevant people. If the report is too long people will stop reading, so you want to make it easy enough to glance over quickly. In the past I have spotted many oddities in our system by just glancing over the report when it comes through every hour and then firing off an email to someone in my team to check something out. It really does work well as an early warning system.

Another benefit is that as glance over these reports you will start to get a good feeling as to how

Your systems perform and work each day. You will start to recognise patterns in their operation. This works as a great tool for getting people in your team / business to appreciate how each system in your production environment fits together in the grand scheme of things.

For example, look at the following example:

Loan Importer Report

————

Start Time: 03/01/2013 00:05:09

Stop Time: 03/01/2013 00:08:57

Elapsed Time in Minutes: 3.8

Number of Loans Transferred: 100

Amount of Loans Transferred: £100000.00

Get Loans Errors

   No Errors Reported.

Insert Debt Errors

   No Errors Reported.

From this, over time, you would be able to recognise certain patterns. If the ‘Elapsed Time in Minutes’ averages 5 minutes on a normal day, you would think it odd if this was 45minutes one day. It would make you ask yourself questions. Why is it 45minutes? Is it due to the number of loans being imported? Is the load sufficiently high enough to warrant 45 minutes? If not, then could the server be struggling? This might prompt you to talk to your technical services team. This is just a simple example, but it gives you an idea of what I am talking about.

This is just a summary report of course; if you want more detailed information then you would go and look at the dash board (covered in a different article) for more information.

Xml Snapshot File

The other type of file that the monitor process outputs is an XML snapshot file. If you think back to the class diagram for the single process, multiple monitor design.

System Monitor : one process - multuple monitors

System Monitor : one process – multuple monitors

Each monitor that implements the IMonitor interface provides 2 methods, ProcessLogs() and GetXmlElement(). ProcessLogs() is the method that does all the work of parsing log files etc. GetXmlElement() returns an XML fragment containing a snapshot of the data for that particular sensor.

If we think of our loan importer example from above, that may look something like the following:

  <LoansImporter>

    <StartTime>02/01/2013 00:05:43</StartTime>

    <StopTime>02/01/2013 00:28:12</StopTime>

    <ElapsedTimeInMinutes>22.48</ElapsedTimeInMinutes>

    <NumberOfLoans>1000</NumberOfLoans>

    <AmountOfLoans>100000.00</AmountOfLoans>

    <GetLoansErrors />

    <InsertDebtsErrors />

  </LoansImporter>

Each XML element fragment from each monitor is collected together and put into a master XML file. This XML file represents a complete snapshot of your system at that particular time. Here is a more complete example XML file.

Disclaimer: This example report doesn’t contain details of all my monitors. Some of the monitor names have been changed and the values in the report are made up. This is just an example to give you an idea.

How you manage your snapshots is up to you. You may only have 1 snapshot per day at any one time and that snapshot is replaced each time the monitor is run (this is currently how my system works.) This may be fine for most situations. If you have a dashboard (covered in another article) then that dashboard may just need to contain the latest data. This would mean that at the end of each day you have a complete snapshot of that day.

You requirements may call for more granularity in that you require a new file to be written each time the monitor runs so you can see how the data is built up over the day. This can create a lot of data though. If for example you run the monitor every 15 minutes (as I do) then that would result in 96 files per day. It all depends on your requirements.

You may decide that you do not want to work with XML files at all and you want to place the data into a database. This is of course perfectly fine. This is actually a future upgrade I plan to make to my own system. I am planning on going from XML files to storing the data as JSON objects in a NoSql document store like RavenDB or MongoDB. I will write about that nearer the time when I start this piece of work.

That concludes this article in my system monitoring series. I hope you have found it useful. If you have any comments / suggestions, then please do leave a comment on this post. I don’t consider these articles as the finished piece of work. They are a starting point and it would be great for people to collaborate on them.

Part 3 of this series on systems monioring is about how to deploy your monitoring system.

Participate with Coding in the Trenches on Facebook

Participate with Coding in the Trenches on Facebook by Click the button above.

2 thoughts on “System Monitoring – Part 2 : Basic Architecture

  1. Pingback: System Monitoring – Part 3 : Deployment | Stephen Haunts { Coding in the Trenches }

  2. Pingback: System Monitoring – Part 1 | Stephen Haunts { Coding in the Trenches }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s