networked day to day technical issues


What to monitor on a (Linux) server

It is surprisingly how many articles are out there about server monitoring, referring to how to use a specific tool, and the lack of sources of documentation regarding what you actually need to monitor from a best practices point of view.
A well monitored server allows to fix possible issues proactively or solve service interruptions a lot faster as the problem can be located faster and solved.

So here goes my list of things I always monitor, independent of actually what the specific purpose of the server is.

  • hardware status - if fans are spinning, cpu temperature, mainboard temperature, environment temperature, physical memory status, power source status, cpu's online. Most of the well know vendors (Dell, HP, IBM) provide tools to check the hardware for the above list of items
  • disk drive S.M.A.R.T. status - you can find out things like if the hdd is starting to count bad blocks or if the bad blocks are increasing fast which will give you a heads up that you need to prepare to replace the disk. Also most of the times you can monitor the HDD's temperature
  • hardware raid array status / software raid status - you really want to know when an array is degraded. Unfortunately most of the organization's don't actually monitor this
  • file system space available - I start with a warning when usage is at 80% and a critical alarm if usage is above 90%. For big filesystems ( >= 100G) of course this needs to be customized as 20% means at least 20G
  • inodes available on the file system - again I use the 80% warning, 90% critical . This is something which isn't always obvious (when you run out of inodes) and can create a whole of other problems. Of course it applies only to file systems which have a finite amount of inodes like ext2,3,4

Linux: realtime traffic monitoring and path determination

There are situations when one needs to give the answer to questions like:

- a) - what application/process is listening for inbound connections
- b) - what application/process is causing network traffic
- c) - what hosts are right now doing network traffic with our server
- d) - current rate of traffic going through the network interfaces
- e) - how much traffic is causing each workstation/server directly connected to the Linux server
- f) - which path is an outgoing packet going to take when you have multiple network cards and several routes (and more than one routing tables)


Visualize sar reports with awk and gnuplot

On several systems where only sar (part of sysstat) is collecting and storing performance data i needed to troubleshoot performance issues which occurred several hours earlier . Sar is a great tool but  it is annoying that it doesn't have any option to output at the same time , on the same page, output from different reports (like cpu usage, memory usage and disk usage). If you try to request those three at the same time, it will output each report on it's on page and from there it's hard to visualize how each performance indicator evolved at a specific point in time. A solution would have been to load the data in a spreadsheet application and use vlookup function to group the data but this is time consuming and with my spreadsheet skills i don't think it can be automated.

I used awk and order to create a report from sar output, choosing the fields i considered useful in 95% of the times. Because my display resolution width is 900 i managed too squeeze in a lot of fields. In order to get a report for the date of 18th from 10 AM to 6 PM i use: