networked day to day technical issues

3Dec/110

What to monitor on a (Linux) server

It is surprisingly how many articles are out there about server monitoring, referring to how to use a specific tool, and the lack of sources of documentation regarding what you actually need to monitor from a best practices point of view.
A well monitored server allows to fix possible issues proactively or solve service interruptions a lot faster as the problem can be located faster and solved.

So here goes my list of things I always monitor, independent of actually what the specific purpose of the server is.

  • hardware status - if fans are spinning, cpu temperature, mainboard temperature, environment temperature, physical memory status, power source status, cpu's online. Most of the well know vendors (Dell, HP, IBM) provide tools to check the hardware for the above list of items
  • disk drive S.M.A.R.T. status - you can find out things like if the hdd is starting to count bad blocks or if the bad blocks are increasing fast which will give you a heads up that you need to prepare to replace the disk. Also most of the times you can monitor the HDD's temperature
  • hardware raid array status / software raid status - you really want to know when an array is degraded. Unfortunately most of the organization's don't actually monitor this
  • file system space available - I start with a warning when usage is at 80% and a critical alarm if usage is above 90%. For big filesystems ( >= 100G) of course this needs to be customized as 20% means at least 20G
  • inodes available on the file system - again I use the 80% warning, 90% critical . This is something which isn't always obvious (when you run out of inodes) and can create a whole of other problems. Of course it applies only to file systems which have a finite amount of inodes like ext2,3,4
  • system load average - as a rule of thumb I put a warning alarm at 1.5 X cpu threads on the system and a critical alarm when 2 X cpu threads. Of course depending on the server's purpose this may get customized
  • swap usage - warning at 50% usage, critical at 70%
  • memory usage - I don't actually monitor this by default as it is highly dependent on the server's purpose. If you monitor this be sure not to take into account memory allocated for disk caching (as this will be automatically freed up by the kernel if memory is needed)
  • uptime lower than a day - this is a great indication that the system rebooted and otherwise you risk of not noticing a unscheduled system restart, especially with VMs which boot really fast as there is no actual POST to do
  • network interface resets, errors, packet collisions, up/down changes, interface speed and duplex - any changes in this list may be a good signal of trouble ahead. For example servers mounting NFS exported file systems have a hard time when interfaces flap
  • total number of processes and threads - this is dependent on your system (application, amount of cpu cores, etc) but definitively worth while monitoring as you want to know when processes rise above a limit. Generally I start with a 150 warning and 180 processes critical alarm for system with up to 4 cpu threads
  • number of zombie processes - warning at 1 , critical alarm at 5 . Something is always wrong if you end up with zombie processes.
  • check if syslog is running - just a simple check to see the process is there as it is really bad to not have it running
  • check if crond is running - again things will slowly but surely start to go wrong if cron is stopped and regular maintenance tasks like logrotate and tmpwatch/tmpreaper are not running when scheduled
  • the number of running cron processes - warning if more than 5 are running at the same time and critical alarm if more than 10. This generally signals if we have cron jobs which never finish running due to badly written scripts or system issues
  • check if ntp client is running - while this is not mandatory to have, it is generally a best practice to have a synchronized/accurate clock
  • out of band management running and is reachable - this refers to things like HP's ILO, Dell's DRAC, SUN's LOM/ALOM/ILOM, IBM's RSA . It is really bad to discover during a server outage that you can't reach any more the server's out of band management solution because: it is frozen, it is unreachable (network issues) or that you don't even know how to reach it
  • smtp daemon running - in case you have it (i always recommend it on bound on the loopback interface for all servers even if they don't provide email services) you should have a check to see if it is running and accepting connection on the loop back interface

Once you monitor the basics then you need to see if the applications related to the server's specific purpose are running. So:

  1. check that the application's processes are running - example Apache, Mysql, Memcached, Postfix, etc
  2. check that you can connect to the service - for example on a smtp server check with a smtp client that you can connect
  3. check that network based resources are reachable - if for example your webserver needs to connect to mysql on another system then do check if from the server running Apache you can connect to mysql on the other system. Even if you check locally Mysql on the other system that doesn't guarantee Apache can reach it as there can be configuration issues or firewall issues

The last thing would be the "advanced" section which is always hard to achieve. Here you need to check things like application logic is working as expected, using customized tools. Also it is worth monitoring for things like I/O wait time, network latency , I/O throughput , network throughput . The latter mentioned are hard to monitor as you need to have working knowledge of the specific system and how it is behaving under heavy load.