networked day to day technical issues


A Tool to Backup Files to Amazon S3

For the past year I've been working on and off on a little project to create a tool which:

  • runs on at least Linux, MacOS X and FreeBSD
  • allows to backup your files to Amazon S3 while providing optional server side encryption (AES-256)
  • is cost effective for large numbers of files (the problem with things like s3cmd or aws s3 sync is that they need to compare local files with metadata retrieved on the fly from AWS and this can get expensive)
  • is easy to install
  • provides meaningful error messages and the possibility to debug

I've ended up creating a tool called S3backuptool (yeah, not that original) which does the above and in order to run it requires Python 2.7 , PyCrypto and the Boto library.

Details are available on the project's page and it can be installed from prebuilt packages (deb or rpm) for several Linux distributions or from Python's PyPi for far more Linux distributions and OSes.

So far it's been quite the educative enterprise while also catering to my needs.

Metadata about all backed up files is stored locally in SQLite database(s) and in S3 as metadata for each uploaded file. When a backup job runs it compares the state of files with the one stored in the local SQLite database(s) and action is needed on S3 only then actual S3 api calls are performed (those cost money). In case the local SQLite databases are lost then they can be reconstructed from the S3 stored metadata.


Secure and Scalable WordPress In the Cloud (Amazon S3 for content delivery and EC2 for authoring)

Several months ago I decided to move all of the stuff running on my server (Droplet on Digital Ocean) to various cloud providers. My main motivation was that I did not have time any more to manage my email server which was made up on Postfix + Zarafa + MailScanner + SpamAssassin + ClamAV + Pyzor/Razor/DCC + Apache2 + Mysql . Then I was also dealing with monitoring + backups.
Anyway moving the mail was easy as there are plenty of cloud solutions which are mature.

With my blog (which I did not post to since a long time ago) I decided to try something a bit more interesting so I decided to move it to Amazon S3 as a static website.
In order to achieve this I had to solve the following:

  • convert WordPress from dynamically generated pages to static ones. This was easy using the plugin "WP Static HTML Output" which does what it says
  • find a solution for the comments as with a static page you won't be able to add comments. The solution was to start using Disqus. I've installed the plugin "Disqus Comment System", created a Disqus account and then using the plugin proceeded to import all of the comments which were stored in WordPress' database
  • find a solution for search. Again this was not hard and I've moved to using Google Search (plugin "WP Google Search")
  • once I had the above I generated the a static release which was a .zip file.
  • I've created an S3 bucket called . The bucket must be named as your site/blog and bucket names are unique across all of AWS S3 which means that if someone else is already having a bucket called like that then you're out of luck and your remaining option then is to use CloudFront together with a differently named S3 bucket

Fast MySQL database restore / import from full dump files

With MySQL Community Edition in most of the cases you have two ways of creating a full database backup:

  • using the command line utility mysqldump which works with both Myisam and Innodb tables, while the database server is running
  • shutting down the MySQL server and performing a copy of the full data dir in case of Innodb databases or just the database folder in the data dir in case of Innodb based databases

The full list of methods to do backups is available on Mysql's site.

While a binary backup will be the fastest to "restore" it has limitations, mainly that if using Innodb storage engine then you have to restore the whole MySQL instance and not just the specific database; and that you can safely restore on the same Mysql version (though it may work on newer ones too).
On the other hand a db dump created using mysqldump will allow you to restore only the needed database (or all of them if you want to and you have a full dump of all databases), it will allow you to restore on different Mysql versions as long as the features required are supported (if restoring on an older MySQL version) and it is also the most disk space efficient way to restore (see how MySQL manages disk space for Innodb tables)

The problem lies in the details and when restoring a large dump created with mysqldump you can disover it can take even days if the dump file is large (i've seen it for a 30GB dump file which isn't that large). The problem lies in the fact that the dump file is a series of SQL statements and each INSERT will trigger and index update.

To speed as much as possible a dump file import do as much as possible from the following list:

    Filed under: Linux, mysql Continue reading

    mysql backups using mysqldump

    I keep encountering all sort of bad attempts or at least not optimal attempts at doing Mysql full db backups. While it looks like a trivial task using the mysqldump tool, there are several things one needs to take into account:

    • if you backup all databases into a file (--all-databases) then when you will need to restore only one database from the backup you will be in trouble as in order to restore it you need to restore all databases on a staging server and afterward dump just the needed one or remove all of the rest of databases but with the second approach you still have the "mysql" database changed; or use some tool which can extract from a full dbdump just the needed one (it's basically a text file so you could scrip around it). Update: you can use mysql (mysql -D db_name -o < dump_file.sql) to restore a particular db from a dump done with --all-databases , just take care to have db_name created before attempting the restore
    • if you backup separately each database to it's own dump file then you will quickly learn that you should have also backed up the usernames and passwords which are allowed to access/modify the database
    • if you run a cron script each night which creates the dump(s) and overwrites the previous night's backup file(s) then you may learn it the hard way that a 0 bytes dump or incomplete dump will leave you not only without today's backup but also possibly without yesterday's valid backup (in case it wasn't bad too). So in this case the advice is keep more then tonight's and/or yesterday's backup. I generally keep at least 7 of them if done nightly as generally by the time someone realizes they need a backup a day might have passed. Also it is recommended to have a real backup infrastructure in place (with the associated retention policy)
    Filed under: Linux, mysql Continue reading

    What to monitor on a (Linux) server

    It is surprisingly how many articles are out there about server monitoring, referring to how to use a specific tool, and the lack of sources of documentation regarding what you actually need to monitor from a best practices point of view.
    A well monitored server allows to fix possible issues proactively or solve service interruptions a lot faster as the problem can be located faster and solved.

    So here goes my list of things I always monitor, independent of actually what the specific purpose of the server is.

    • hardware status - if fans are spinning, cpu temperature, mainboard temperature, environment temperature, physical memory status, power source status, cpu's online. Most of the well know vendors (Dell, HP, IBM) provide tools to check the hardware for the above list of items
    • disk drive S.M.A.R.T. status - you can find out things like if the hdd is starting to count bad blocks or if the bad blocks are increasing fast which will give you a heads up that you need to prepare to replace the disk. Also most of the times you can monitor the HDD's temperature
    • hardware raid array status / software raid status - you really want to know when an array is degraded. Unfortunately most of the organization's don't actually monitor this
    • file system space available - I start with a warning when usage is at 80% and a critical alarm if usage is above 90%. For big filesystems ( >= 100G) of course this needs to be customized as 20% means at least 20G
    • inodes available on the file system - again I use the 80% warning, 90% critical . This is something which isn't always obvious (when you run out of inodes) and can create a whole of other problems. Of course it applies only to file systems which have a finite amount of inodes like ext2,3,4

    Multiple domain selfsigned ssl/tls certificates for Apache (namebased ssl/tls vhosts)

    This is an old problem: how to have ssl/tls name based virtual hosts with Apache .
    The issue is that the ssl/tls connection is established before Apache even receives a HTTP request.When Apache receives the request already the SSL connection is established with a particular hostname - ip & ssl certificate combination so this means that it is capable of serving NameBased virtual hosts only for that particular ssl/tls certificate.

    There are two possible solutions here:

    • Multi domain or wildcard SSL/TLS certificates. Those are certificates which are configured with more than one name so you can create virtual hosts (in case of apache) for those domains. This is fairly easy to set up and at least for me it has worked ok in the past.
    • Server Name Indication (SNI) which is an extension to the SSL/TLS protocol and allows the client to specify the desired domain earlier and the server to be notified so it supplies the correct SSL/TLS certificate depending on the requested hostname. The problem is SNI is fairly new and few server side software supports it, also client side software needs to be fairly new. On the long run this is going to be the best solution as it has been designed to overcome this specific problem
    Filed under: Apache, Linux, ssl, Web Continue reading

    KSM (Kernel Samepage Merging) status

    KSM allows physical memory de-duplication in Linux, so basically you can get a lot more out of your memory at expense of some cpu usage (because there is a thread which scans memory for duplicate pages). Typical usage is for servers running virtual machines on top of KVM but applications aware of this capability could also use it even on OS instances which aren't VMs running on KVM.
    The requirements are a kernel version of at least 2.6.32 and CONFIG_KSM=y. For more details you can check the official documentation and a tutorial on how to enable it.

    Below is a small script (called ksm_stat) which I wrote in order to see how much memory is "shared" and how much memory is actually being saved by using this feature.

    if [ "`cat /sys/kernel/mm/ksm/run`" -ne 1 ] ; then
           echo 'KSM is not enabled. Run echo 1 > /sys/kernel/mm/ksm/run' to enable it.
           exit 1
    echo Shared memory is $((`cat /sys/kernel/mm/ksm/pages_shared`*`getconf PAGE_SIZE`/1024/1024)) MB
    echo Saved memory is $((`cat /sys/kernel/mm/ksm/pages_sharing`*`getconf PAGE_SIZE`/1024/1024)) MB
    if ! `type bc &>/dev/null`  ; then
            echo "bc is missing or not in path, skipping ratio calculation"
            exit 1
    if [ "`cat /sys/kernel/mm/ksm/pages_sharing`" -ne 0 ] ; then
            echo -n "Shared pages usage ratio is ";echo "scale=2;`cat /sys/kernel/mm/ksm/pages_sharing`/`cat /sys/kernel/mm/ksm/pages_shared`"|bc -q
            echo -n "Unshared pages usage ratio is ";echo "scale=2;`cat /sys/kernel/mm/ksm/pages_unshared`/`cat /sys/kernel/mm/ksm/pages_sharing`"|bc -q

    Example of a machine where it just has been enabled, so it takes a while until all pages are scanned

    # ksm_stat
    Shared memory is 67 MB
    Saved memory is 328 MB
    Shared pages usage ratio is 4.87
    Unshared pages usage ratio is 17.04


    upstart (System-V init replacement on Ubuntu) tips

    Since Ubuntu Server 10.04 LTS (lucid)  Canonical's System-V init replacement, Upstart has most of the init scripts converted to Upstart jobs. Upstart is event based and it is quite different from sysV init so one needs to adjust to it's config file structure and terminology; it is present in the server release since 8.04 LTS but then it didn't have the init scripts converted to it's format so it didn't really matter on the server release that it took over Sys-V init.

    Reading the documentation is mandatory, but here are some quick tips for things at least i found dificult to discover on the project's website or in the man pages:

    Default runlevel is defined here: /etc/init/rc-sysinit.conf  and ofcourse it can be overridden on the kernel command line . /etc/inittab is gone and everything moved to /etc/init/ while legacy init scripts(= not converted yet to upstart format) can still be found in /etc/init.d/ together with symlinks to converted init jobs.

    Managing jobs:  initctl start <job> / initctl stop <job> / initctl restart <job> / initctl reload <job>  ; Listing all jobs and their status: initctl list

    Now here comes the horror story: seems that there is no tool (cli based) which lists what Upstart jobs will start in a particular runlevel, or better what Upstart and /etc/rc*.d jobs will start in a runlevel. There are two GUI based tools  (jobs-admin and Boot-Up Manager) but no cli tools so you are left to use things like sysv-rc-conf / chkconfig / update-rc.d for the /etc/rc*.d system-V init like legacy folders and for Upstart jobs you need to manually look at the files in /etc/init/ which is cumbersome as beside the runlevel entry you also need to take into account events/dependencies like net-device-up

    It seems like Canonical is thinking that nowadays a Server sysadmin must also install the GUI tools in order to manage basic things like what services start with the server.


    Running Linux on Sparc hardware

    I have a SunFire V210 laying around, which i bought in order to learn Solaris and get accustomed to it's hardware platform.

    Now i need to move a personal project from it's current server to another one and i though that i might put to use the Sun server to use. It turns out that not that many people run Linux on the Sparc architecture and some things that i expected to work (like software raid) turned out to be complicated to set up.

    Linux Distribution - basically you have two options if you're looking for an up to date and maintained distro: Debian or Gentoo . When i first installed this server (few months ago) Squeeze was not out and because i was looking for something which has newer software i decided to go with Gentoo despite the fact that i don't like compiling everything over and over and over again.

    Linux Software Raid - there isn't a good place to find relevant information so i had to read a lot through discussion lists, forums and blogs until i got it working as expected ; after i finished my setup Debian Squeeze was released and the installer does help a lot.


    How to find out all of the ip addresses of an Europe based ISP

    You may want to block ip traffic from a particular Internet Services Provider due to different reasons , like for example a lot of crawlers and spammers are hosted there.
    For Europe based providers this can be done querying RIPE NCC database : "The RIPE Database contains registration information for networks in the the RIPE NCC service region and related contact details" . This is something which can't be avoided and the data there is genuine.

    To query either use the web interface or better the whois Linux/*nix command line client. For this you need to already know the AS (Autonomous System) number for that provider and this can be easily established if you know an ip address from that particular provider

    $ whois -- yyy.yyy.yyy.yyy | grep '^origin:' | awk {'print $2'}
    $ whois -h -- -i or ASxxxx | grep '^route:'| awk {'print $2'}