This is the first part of a set of articles about systems monitoring for software developers. I am writing about this as it is something I have been working with a fair amount over the last 6 months in my current role here in the UK.
The company I work for is pretty typical for a large firm. We have a mix of different legacy systems that we have to keep running as we try to update them, and integrate new systems. A lot of these systems vary in quality and the majority of the original system developers no longer work for the company. When any of these systems go wrong, it can be quite difficult to diagnose what the problem is.
As an example, we have one particular system that handles file synchronisation between multiple sites that runs over night. This system does write out log files, but they are the most unfriendly log files I have ever seen. For a start they are so large, people do not even bother to look at them anymore. Even if you do open up these files, the formatting is so strange and complicated you really struggle to see what is going on. This again is why people don’t bother to look at them. This particular system is one of the more extreme cases. Other systems operate a mix of writing data to log files or adding huge amounts of data into a logging database. In some cases the amount of data logged is quite extreme, which makes searching of meaningful information in the event of a system failure extremely difficult. Even more so if you are under pressure with a system outage which is having a direct revenue impact to the business.
In the rest of this series, I will talk about the system I designed and developed. I will cover the design and implementation. I will also talk about some of the problems I faced, and how this system has helped divert a few major live incidents. For obvious reasons I can’t discuss actual systems and internal process at my company as I don’t think they would like that too much, so some details have had to be changed or omitted, but that really doesn’t matter for the purpose of this article. The intention is to discuss the architecture of the system. I hope this series will help you in thinking about your monitoring needs in your own organisation.
Why Have Monitoring
Having the ability to monitor your systems is very important. I personally feel quite uncomfortable if I can’t get an instant insight into what a particular system is doing. The system I developed takes its data from 2 sources. Where possible, if there are log files for a system, I will parse data for the current day from the logs. When logs files are not available, I will take data from a database. In some cases I do both.
When writing a monitor for a particular system, especially if it is a legacy system that already exists, you try to use any information you can; for example, you may want to know about any errors and exceptions. You may want to parse out any statistical information. An example of this could be, how many successful payments have been taken today, and for how much? How many customers’ loans have gone into debt? How many credit scores have been requested etc.?
System errors and exceptions is an obvious one. Gather all the error information and then notify someone. How many errors happen on each day? Do certain days have more errors than others? Can you spot any trends? You can only answer these questions when you have the data laid out in front of you in a sensible, clean format as opposed to a huge 200meg log files slit across 9 servers.
When you have been running a monitoring solution for a while over a set of systems, as you record out all this information you can use this data to gain valuable insight into the business and how your systems affect the business. I am not saying this would replace any existing business intelligence tools that you use, but it gives you an alternative view over what the critical systems are trying to tell you.
Once I had developed monitors for all of our company’s key systems I could see how different systems affect each other. My company is a consumer finance company, so part of our business is lending money out to people. We have IT solutions that work across the entire business process of money being loaned out through to customers going into a debt collection process. As the monitors collect data over a period of time you can see how a change in a system at the front end of the process can affect systems at the other end. This ended up becoming an invaluable tool for my team as hardly anyone had a complete, joined up, view of what all the system did and how they affected each other.
If by parsing log files and mining information out of the database you can capture key business metrics like revenue, costs, account processes etc., then you can even start to use this information to map out your companies system value stream, but that is the topic for another article.
3rd Party Systems
There are numerous 3rd party systems out there for doing monitoring; we even use some ourselves at work, but these tend to monitor more at the infrastructure level. These systems may, for example, check for network traffic, health check particular servers, or look for CPU spikes etc. The monitors I am talking about in this article deal more with the applications that run on the servers. These are applications that would most likely have been developed at the company. There may well be 3rd party systems on the market that you can use for log file monitoring, but I think it is actually quite beneficial if you and your team develop your own as part of your application development process. This should help drive out issues like the quality of your logs files and the type of data that is logged. If you are trying to write a parser for a log file and you are finding it really difficult to do, then this is telling you that your log format is too complicated and could be simplified.
Legacy versus New Systems
Most companies, especially if they have been around for a while, will have old legacy systems. These may even be written in some arcane language like Delphi or Visual Basic 6. It is normally these legacy systems that are the back bone of the company, even though developers and tech support analysts seem to be terrified of them. I have some legacy systems myself that give my developers heart palpitations by even mentioning their names. When I started building a monitoring solution for my company, it was some of these legacy systems that I decided to tackle first. Some systems were easier than others to write monitors for, and they actually output out half decent log files, which made my life easier. Some of the other systems were an absolute nightmare. One particular system, that shall remain nameless, replicates information from our large retail store network to head office. That log file required me to chop backwards and forwards through the file matching up information to get a more complete picture. As you could imagine, these files were not particularly useful, especially if they went wrong and we end up in a support scenario, so getting a monitor around them was very useful, especially as the appetite to change/update the system was minimal!!
The picture is a little different for new systems that you are developing. As the code is new, or at least relatively up to date, you are more in control of how your output logs and in what format. This will be the subject of another article on logging formats and standards.
Monitoring as Part of your Development Process
I mentioned earlier that I think it’s important you include the development of systems monitors as part of your development process. The reason for this is to get your developers to understand the importance of accurately reporting information in a clear and intuitive format. Logs files are not only used for monitoring purposes. When there is a system outage or incident, the log files are normally your first port of call to try and work out what is going on. Some poor engineers, maybe even you, will have to look at these logs files whilst under pressure at 3 in the morning (Why do these incidents always seem to happen at 3am?).
As you work on new systems or maintain older systems, think about logging as part of the development process. As you design how the system will hang together via object diagrams, deployment diagrams etc. Also spend time thinking about how you will log and report information for the poor sod that will need it if there is an incident. You need to attack this from 2 angles
- Are the log files easy to read by a person?
- Are the log files easy to parse by an application?
The file has to be readable by a person. The format needs to be consistent and on cluttered. If I need to chop through log files by hand, I tend to use a tool like Notepad++. The file also needs to be easy to parse by code. File parsing should not be difficult to do. If you find the file is getting complicated to parse, then the format is wrong. Having data points that are easily identifiable and extractable makes your life a lot easier.
Here is an example of a log entry that is easy to parse:
Timestamp: 02/01/2013 08:02:57
Message: PerformCreditSearchOperation request: CustomerCode: 0123456789B, CustomerId: 123456, ProductId: InstantLoan
Service Category: CreditScoringService
This first article has been an introduction to the series on system monitoring. In the next article I will go into more details about an initial design, deployments and scheduling.
Part 2 of this series on systems monitoring is about the basic architecture.