Setup OpenMPI Cluster for bzip2

:: One job in a client involves managing terabytes of logfiles. Each logfile can be about 9 GB in size. And a new one is created every few minutes. Needless to say, however big your storage is, it will always be hard to keep up with such rate of usage.

I alleviate this by compressing the logfile using bzip2, before sending it to the storage server.
One problem though; compressing 9 GB logfile can take 2,5 hours ! Gzip can compress faster, but the resulting file is also about 2x bigger than bzip2. 

So I looked around for a solution, and found mpibzip2 : http://compression.ca/mpibzip2/ 

Now a 9 GB logfile can be compressed in just 15 minutes. That's at least 10x faster ! Not bad 🙂

mpibzip2 achieves this by making the compression process spread into a cluster / many machines. Yes, we now can use a cluster to speedup compression process (insert Beowulf cluster joke here) :-) 

An OpenMPI (Open Message Passing Interface) cluster enable us to run a software simultaneously in multiple machines : http://en.wikipedia.org/wiki/Message_Passing_Interface 

Setting up an OpenMPI cluster may seem to be a daunting task at first. 
Turned out it's quite easy on a Debian platform (squeeze or newer). Here is how :

=====================
###### MASTER ######
cd /tmp
apt-get install openmpi-bin build-essential libbz2-dev libopenmpi-dev  
wget http://compression.ca/mpibzip2/mpibzip2-0.6.tar.gz
tar xzvf mpibzip2-0.6.tar.gz
cd mpibzip2-0.6

# Need to edit Makefile
vi Makefile
## make sure the line with CC=c++ is changed into
# CC=mpic++
# Otherwise, we'll get the following error message : 
# pibzip2.cpp:72:17: fatal error: mpi.h: No such file or directory

make
make install

# let's test locally / just in this machine
# -n2 = use 2 processors
mpirun -n 2 mpibzip2 /var/log/syslog

### OK, let's start set up the cluster
# the master need to be able to access the slaves with no password
# create SSH keys
ssh-keygen -t rsa -b 4096
# when asked for password, just press enter, twice

cat ~/.ssh/id_rsa.pub
# then paste this public key into slaves' ~/.ssh/authorized_keys file

# put slave's IP address / hostnames here :
vi /etc/openmpi/openmpi-default-hostfile
# just need to put the slaves' IP addresses there, simple.

###### SLAVE ######
cd /tmp
apt-get install openmpi-bin build-essential libbz2-dev libopenmpi-dev sshfs 
wget http://compression.ca/mpibzip2/mpibzip2-0.6.tar.gz
tar xzvf mpibzip2-0.6.tar.gz
cd mpibzip2-0.6

# Need to edit Makefile
vi Makefile
## make sure the line with CC=c++ is changed into
# CC=mpic++
# Otherwise, we'll get the following error message : 
# pibzip2.cpp:72:17: fatal error: mpi.h: No such file or directory

make
make install

###### TEST mpibzip2 ######

# make sure there is a shared folder on all master & slaves
# in this example, I'll use sshfs to share the folder
cp /var/log/syslog /tmp/
ssh root@slave1 sshfs root@master:/tmp//tmp 
ssh root@slave2 sshfs root@master:/tmp//tmp 

# let's run mpibzip2
mpirun -v -n 40 –hostfile /etc/openmpi/openmpi-default-hostfile /usr/bin/mpibzip2 -v /tmp/syslog
# this will run 20 processes of mpibzip2 on each slave1 & slave2

=====================

Hope you'll find this note useful. 

At the moment, my 40-processors OpenMPI cluster is busy scrouging the storage server for any uncompressed logfiles, and quickly compress it. Love this :)

Open MPI – Wikipedia, the free encyclopedia
Open MPI. Open MPI logo.png · Stable release, 1.8 / March 31, 2014; 3 months ago (2014-03-31). Operating system · Unix, Linux, Mac OS · Platform · Cross-platform · Type · Library · License · New BSD License (free software). Website, www.open-mpi.org. Open MPI is a Message Passing Interface (MPI) …

Post imported by Google+Blog for WordPress.

66 thoughts on “Setup OpenMPI Cluster for bzip2

  1. Informasi yang menarik. Cuma mungkin level advance kot, sebab newbie cam saya ni blur2 sikit tang configuration ni.

  2. Saya sangat menyukai untuk artikel bagian yang ini, dengan topik yang bagus dan menarik telah membantu banyak orang tantang hal yang belum orang ketahui. anda harus lebih banyak mempublikasikan hal ini agar banyak orang yang mengerti tentang hal yang jarang orang-orang ketahui selama ini. sukses untuk anda!!!

Leave a Reply

Your email address will not be published. Required fields are marked *