networked day to day technical issues

3Dec/110

What to monitor on a (Linux) server

It is surprisingly how many articles are out there about server monitoring, referring to how to use a specific tool, and the lack of sources of documentation regarding what you actually need to monitor from a best practices point of view.
A well monitored server allows to fix possible issues proactively or solve service interruptions a lot faster as the problem can be located faster and solved.

So here goes my list of things I always monitor, independent of actually what the specific purpose of the server is.

  • hardware status - if fans are spinning, cpu temperature, mainboard temperature, environment temperature, physical memory status, power source status, cpu's online. Most of the well know vendors (Dell, HP, IBM) provide tools to check the hardware for the above list of items
  • disk drive S.M.A.R.T. status - you can find out things like if the hdd is starting to count bad blocks or if the bad blocks are increasing fast which will give you a heads up that you need to prepare to replace the disk. Also most of the times you can monitor the HDD's temperature
  • hardware raid array status / software raid status - you really want to know when an array is degraded. Unfortunately most of the organization's don't actually monitor this
  • file system space available - I start with a warning when usage is at 80% and a critical alarm if usage is above 90%. For big filesystems ( >= 100G) of course this needs to be customized as 20% means at least 20G
  • inodes available on the file system - again I use the 80% warning, 90% critical . This is something which isn't always obvious (when you run out of inodes) and can create a whole of other problems. Of course it applies only to file systems which have a finite amount of inodes like ext2,3,4
13Aug/110

KSM (Kernel Samepage Merging) status

KSM allows physical memory de-duplication in Linux, so basically you can get a lot more out of your memory at expense of some cpu usage (because there is a thread which scans memory for duplicate pages). Typical usage is for servers running virtual machines on top of KVM but applications aware of this capability could also use it even on OS instances which aren't VMs running on KVM.
The requirements are a kernel version of at least 2.6.32 and CONFIG_KSM=y. For more details you can check the official documentation and a tutorial on how to enable it.

Below is a small script (called ksm_stat) which I wrote in order to see how much memory is "shared" and how much memory is actually being saved by using this feature.

#!/bin/bash
if [ "`cat /sys/kernel/mm/ksm/run`" -ne 1 ] ; then
       echo 'KSM is not enabled. Run echo 1 > /sys/kernel/mm/ksm/run' to enable it.
       exit 1
fi
echo Shared memory is $((`cat /sys/kernel/mm/ksm/pages_shared`*`getconf PAGE_SIZE`/1024/1024)) MB
echo Saved memory is $((`cat /sys/kernel/mm/ksm/pages_sharing`*`getconf PAGE_SIZE`/1024/1024)) MB
if ! `type bc &>/dev/null`  ; then
        echo "bc is missing or not in path, skipping ratio calculation"
        exit 1
fi
if [ "`cat /sys/kernel/mm/ksm/pages_sharing`" -ne 0 ] ; then
        echo -n "Shared pages usage ratio is ";echo "scale=2;`cat /sys/kernel/mm/ksm/pages_sharing`/`cat /sys/kernel/mm/ksm/pages_shared`"|bc -q
        echo -n "Unshared pages usage ratio is ";echo "scale=2;`cat /sys/kernel/mm/ksm/pages_unshared`/`cat /sys/kernel/mm/ksm/pages_sharing`"|bc -q
fi

Example of a machine where it just has been enabled, so it takes a while until all pages are scanned

# ksm_stat
Shared memory is 67 MB
Saved memory is 328 MB
Shared pages usage ratio is 4.87
Unshared pages usage ratio is 17.04
#

20Feb/117

Visualize sar reports with awk and gnuplot

On several systems where only sar (part of sysstat) is collecting and storing performance data i needed to troubleshoot performance issues which occurred several hours earlier . Sar is a great tool but  it is annoying that it doesn't have any option to output at the same time , on the same page, output from different reports (like cpu usage, memory usage and disk usage). If you try to request those three at the same time, it will output each report on it's on page and from there it's hard to visualize how each performance indicator evolved at a specific point in time. A solution would have been to load the data in a spreadsheet application and use vlookup function to group the data but this is time consuming and with my spreadsheet skills i don't think it can be automated.

I used awk and order to create a report from sar output, choosing the fields i considered useful in 95% of the times. Because my display resolution width is 900 i managed too squeeze in a lot of fields. In order to get a report for the date of 18th from 10 AM to 6 PM i use: