MySQL : Make Your Query Runs (at least) 16800% Faster

:: One of our data analyst was asked to analyze a bunch of data from 2 tables. One of it  contained nearly 200 million records, and the other about 120 million records. The query to run against this dataset is doing several JOINs on non-unique fields, causing the resultset to be estimated at trillions of records.

As some of you may have already guessed – a week went by, and the query still hasn't finished 🙂

I was asked to help, so I tried almost every trick that I knew, including (but not limited to) :

# vertical partitioning : split the columns into a table each
# horizontal partitioning : splitting the table into 10 smaller tables
# putting the tables on fast SSDs
# putting the tables on RAM disk : on several occasions, I actually managed to saturate the memory bandwidth……
# enlarging the JOIN buffers : up to tens of gigabytes.
# etc, etc.

Still no joy. The query just won't finish. And I've been sleeping on my work desk for almost a week.

One day before the deadline (just like in the movies, huh?), another of our data analyst suggested that I hashes the fields being compared, and then compare the hashes instead.

To my surprise, it works GREAT. It's like a miracle.

The query which wasn't finished in a week, now finished in an hour. 
We got the job done, at the very least, 16800% faster.

Mind = Blown 🙂

So, here's how it's done, hopefully you'll find it useful too later.
Enjoy !


SELECT a.f1, a.f2, b.f1, b.f2  
FROM t1 AS a, t2 AS b 
WHERE a.f1=b.f1 AND a.f2=b.f2 AND a.f3=b.f3 AND a.f4=b.f4 AND a.f5=b.f5;


SELECT uid, MD5(CONCAT(f1,f2,f3,f4,f5)) AS hash  
FROM t1 

SELECT uid, MD5(CONCAT(f1,f2,f3,f4,f5)) AS hash  
FROM t2 

This is where the magic happens :) 
we compare the hashes, and then we store the uid related to those hashes.

CREATE TABLE hash_result (KEY(uid1), KEY(uid2)) ENGINE=MyISAM AS 
SELECT a.uid AS uid1, b.uid AS uid2 
FROM hash1 AS a, hash2 AS b 
WHERE a.hash=b.hash 

Collect the records with the corresponding uid :

SELECT a.f1, a.f2, b.f1, b.f2 
FROM t1 AS a, t2 AS b, hash_result AS c 
WHERE a.uid=c.uid1 AND b.uid=c.uid2 

fastest car in the world | World of Cars
world top leading car companies information and photos. lamborghini,bmw,audi,mercedes,maserati,pagani,pgo,toyota,honda. fastest car in the world. fastest car fast car. fastest car wallpaper. bugatti veyron. fast car. bugatti veyron. fast car. bugatti veyron. fast car. bugatti zonda. fast car …

Post imported by Google+Blog for WordPress.

23 thoughts on “MySQL : Make Your Query Runs (at least) 16800% Faster

  1. Great technique. I'm aware that the hashes were done against multiple columns, but am I correct in assuming that you're aware and accepting of the risk of hash collisions here? Especially when considering that you have hundreds of millions of records?

  2. +Arga Nugraha – Yup, jadi ingat di posting ini di Facebook, ada yang komentar, menyebutkan menggunakan CRC32 & secara tidak langsung merekomendasikan itu 😀 padahal CRC32 itu 32 bit = cuma ada 4 milyar kombinasi. 

    Kalau yang di hash ada ratusan juta… bakalan panen hash collisions ini….. 😀

    MD5 itu outputnya 128 bit, jadi ada 2^128 kombinasi = 3,4 x 10^38 ….. ratusan juta itu adalah 1 x 10^8, yah……  sejauh ini, terlalu kecil  kemungkinan hash collisionnya 🙂

  3. Nice post! The information you provided is very helpful if someone is planning to Digitalized our Business. I think Digitalopment is a better way to get all the information about Digital Marketing.

  4. Before choosing any SEO company you need to know their past record although you can check it by looking at their portfolio which is great to know besides this you can also look for their experience level as by this you will get the brief idea about their services.

  5. Nice post! The information you provided is very helpful if someone is planning to enjoy the holidays. I think ALNoorIsland is a better way to spend the joyful vacations.

Leave a Reply

Your email address will not be published. Required fields are marked *