This is the first part of a set of articles about systems monitoring for software developers. I am writing about this as it is something I have been working with a fair amount over the last 6 months in my current role here in the UK.
The company I work for is pretty typical for a large firm. We have a mix of different legacy systems that we have to keep running as we try to update them, and integrate new systems. A lot of these systems vary in quality and the majority of the original system developers no longer work for the company. When any of these systems go wrong, it can be quite difficult to diagnose what the problem is.
As an example, we have one particular system that handles file synchronisation between multiple sites that runs over night. This system does write out log files, but they are the most unfriendly log files I have ever seen. For a start they are so large, people do not even bother to look at them anymore. Even if you do open up these files, the formatting is so strange and complicated you really struggle to see what is going on. This again is why people don’t bother to look at them. This particular system is one of the more extreme cases. Other systems operate a mix of writing data to log files or adding huge amounts of data into a logging database. In some cases the amount of data logged is quite extreme, which makes searching of meaningful information in the event of a system failure extremely difficult. Even more so if you are under pressure with a system outage which is having a direct revenue impact to the business.
In the rest of this series, I will talk about the system I designed and developed. I will cover the design and implementation. I will also talk about some of the problems I faced, and how this system has helped divert a few major live incidents. For obvious reasons I can’t discuss actual systems and internal process at my company as I don’t think they would like that too much, so some details have had to be changed or omitted, but that really doesn’t matter for the purpose of this article. The intention is to discuss the architecture of the system. I hope this series will help you in thinking about your monitoring needs in your own organisation.