:: One job in a client involves managing
terabytes of logfiles. Each logfile can be about 9 GB in size. And a new one is created every few minutes. Needless to say, however big your storage is, it will always be hard to keep up with such rate of usage.
I alleviate this by compressing the logfile using bzip2, before sending it to the storage server.
One problem though; compressing 9 GB logfile can take 2,5 hours ! Gzip can compress faster, but the resulting file is also about 2x bigger than bzip2.Â
So I looked around for a solution, and found mpibzip2 : http://compression.ca/mpibzip2/Â
Now a 9 GB logfile can be compressed in just 15 minutes. That's at least 10x faster ! Not bad 🙂
mpibzip2 achieves this by making the compression process spread into a cluster / many machines. Yes, we now can use a cluster to speedup compression process (insert Beowulf cluster joke here) :-)Â
An OpenMPI (Open Message Passing Interface) cluster enable us to run a software simultaneously in multiple machines : http://en.wikipedia.org/wiki/Message_Passing_InterfaceÂ
Setting up an OpenMPI cluster may seem to be a daunting task at first.Â
Turned out it's quite easy on a Debian platform (squeeze or newer). Here is how :
=====================
###### MASTER ######
cd /tmp
apt-get install openmpi-bin build-essential libbz2-dev libopenmpi-dev Â
wget http://compression.ca/mpibzip2/mpibzip2-0.6.tar.gz
tar xzvf mpibzip2-0.6.tar.gz
cd mpibzip2-0.6
# Need to edit Makefile
vi Makefile
## make sure the line with CC=c++ is changed into
# CC=mpic++
# Otherwise, we'll get the following error message :Â
# pibzip2.cpp:72:17: fatal error: mpi.h: No such file or directory
make
make install
# let's test locally / just in this machine
# -n2 = use 2 processors
mpirun -n 2 mpibzip2 /var/log/syslog
### OK, let's start set up the cluster
# the master need to be able to access the slaves with no password
# create SSH keys
ssh-keygen -t rsa -b 4096
# when asked for password, just press enter, twice
cat ~/.ssh/id_rsa.pub
# then paste this public key into slaves' ~/.ssh/authorized_keys file
# put slave's IP address / hostnames here :
vi /etc/openmpi/openmpi-default-hostfile
# just need to put the slaves' IP addresses there, simple.
###### SLAVE ######
cd /tmp
apt-get install openmpi-bin build-essential libbz2-dev libopenmpi-dev sshfsÂ
wget http://compression.ca/mpibzip2/mpibzip2-0.6.tar.gz
tar xzvf mpibzip2-0.6.tar.gz
cd mpibzip2-0.6
# Need to edit Makefile
vi Makefile
## make sure the line with CC=c++ is changed into
# CC=mpic++
# Otherwise, we'll get the following error message :Â
# pibzip2.cpp:72:17: fatal error: mpi.h: No such file or directory
make
make install
###### TEST mpibzip2 ######
# make sure there is a shared folder on all master & slaves
# in this example, I'll use sshfs to share the folder
cp /var/log/syslog /tmp/
ssh root@slave1 sshfs root@master:/tmp//tmpÂ
ssh root@slave2 sshfs root@master:/tmp//tmpÂ
# let's run mpibzip2
mpirun -v -n 40 –hostfile /etc/openmpi/openmpi-default-hostfile /usr/bin/mpibzip2 -v /tmp/syslog
# this will run 20 processes of mpibzip2 on each slave1 & slave2
=====================
Hope you'll find this note useful.Â
At the moment, my 40-processors OpenMPI cluster is busy scrouging the storage server for any uncompressed logfiles, and quickly compress it. Love this :)