networked day to day technical issues


What to monitor on a (Linux) server

It is surprisingly how many articles are out there about server monitoring, referring to how to use a specific tool, and the lack of sources of documentation regarding what you actually need to monitor from a best practices point of view.
A well monitored server allows to fix possible issues proactively or solve service interruptions a lot faster as the problem can be located faster and solved.

So here goes my list of things I always monitor, independent of actually what the specific purpose of the server is.

  • hardware status - if fans are spinning, cpu temperature, mainboard temperature, environment temperature, physical memory status, power source status, cpu's online. Most of the well know vendors (Dell, HP, IBM) provide tools to check the hardware for the above list of items
  • disk drive S.M.A.R.T. status - you can find out things like if the hdd is starting to count bad blocks or if the bad blocks are increasing fast which will give you a heads up that you need to prepare to replace the disk. Also most of the times you can monitor the HDD's temperature
  • hardware raid array status / software raid status - you really want to know when an array is degraded. Unfortunately most of the organization's don't actually monitor this
  • file system space available - I start with a warning when usage is at 80% and a critical alarm if usage is above 90%. For big filesystems ( >= 100G) of course this needs to be customized as 20% means at least 20G
  • inodes available on the file system - again I use the 80% warning, 90% critical . This is something which isn't always obvious (when you run out of inodes) and can create a whole of other problems. Of course it applies only to file systems which have a finite amount of inodes like ext2,3,4