lunes, 17 de diciembre de 2012

MySQL, PostreSQL and SAP HANA - A friendly comparison

UPDATE! Thanks to a comment by Alan Suffolk I fixed my script a little bit, just moving the end = time.clock() and time_taken = end - start after the cur.execute(query) because that's exactly when the SAP HANA query ends...you can see the new processing time in the image below...

My good friend Pakdi Decnud gave a great idea while we were having lunch the very same day of the SAP CodeJam Montreal event.

Pakdi told me..."Why don't you make a comparison between SAP HANA and let's say...MongoDB"...I thought that it was of course a great idea...so yesterday I start exploring MongoDB...so you may ask yourselves..."Why then are you talking about MySQL and PostreSQL?"

Easy answer...and here are my thoughts...
  • I really don't get MongoDB...the whole No-SQL is really alien to me...
  • MongoDB is "Document" based, meaning that you create collections of documents, no databases nor tables...
  • MongoDB doesn't support Inner Joins and aggregates need a framework that it's more weird than MongoDB itself...
  • MongoDB is not meant for enterprise applications

That's why I decide to make a little bit shift and grab the two most used databases by start-ups and developers...MySQL and PostgreSQL.

For this blog, I wanted to have a lot of information...so as always, I grab my beloved Python and create a little script to generate 1 million records for two tables. One script per table.

The structure of the tables is as follows...

DOC_HEADER
Field NameData TypeLength
DOCUMENT_IDVARCHAR8
YEARVARCHAR4
AREAVARCHAR2
NAMEVARCHAR20
LAST_NAMEVARCHAR20

DOC_DETAIL
Field NameData TypeLength
DOCUMENT_IDVARCHAR8
YEARVARCHAR4
AREAVARCHAR2
AMOUNTVARCHAR

And here are the script to generate the 1 million records in a nice .CSV file

Doc_Header_Generator.py
import random
import csv
 
names = ["Anne", "Gigi", "Juergen", "Ingo", "Inga", "Alvaro", "Mario",
"Julien", "Mike", "Michael", "Karin", "Rui", "John", "Rocky", "Sebastian",
"Kai-Yin", "Hester", "Katrin", "Uwe", "Vitaliy"]
last_names = ["Hardy", "Read", "Schmerder", "Sauerzapf", "Bereza", "Tejada",
"Herger", "Vayssiere", "Flynn", "Byczkowski", "Schattka",
"Nogueira", "Mayerhofer", "Ongkowidjojo", "Wieczorek", "Gau", "Hilbrecht",
"Staehr", "Kylau", "Rudnytskiy"]
area = ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"]
year = ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",
"2008", "2009", "2010", "2011", "2012"]
 
def Generate_File(pSchema, pNumber):
    iNumber = 0
    c = csv.writer(open("Doc_Header.csv", "wb"))
 
    while iNumber < pNumber:
        queries = []
        r_doc_id = random.randrange(1, 999999)
        r_names = random.randrange(0, 20)
        r_lastnames = random.randrange(0, 20)
        r_areas = random.randrange(0, 10)
        r_years = random.randrange(0, 13)
        iNumber += 1
        queries.append(r_doc_id)
        queries.append(year[r_years])
        queries.append(str(area[r_areas]))
        queries.append(names[r_names])
        queries.append(last_names[r_lastnames])
        c.writerow(queries)
 
num_files = input("How many records?: \n")
Generate_File(num_files)

Doc_Detail_Generator.py
import random
import csv
 
area = ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"]
year = ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",
"2008", "2009", "2010", "2011", "2012"]
 
def Generate_File(pSchema, pNumber):
    iNumber = 0
    c = csv.writer(open("Doc_Detail.csv", "wb"))
 
    while iNumber < pNumber:
        queries = []
        r_doc_id = random.randrange(1, 999999)
        r_areas = random.randrange(0, 10)
        r_years = random.randrange(0, 13)
        r_amount = random.randrange(0, 10000, 1)
        iNumber += 1
        queries.append(r_doc_id)
        queries.append(year[r_years])
        queries.append(str(area[r_areas]))
        queries.append(r_amount)
        c.writerow(queries)
 
num_files = input("How many records?: \n")
Generate_File(num_files)


With the two files ready, I upload them to MySQL, PostgreSQL and SAP HANA.

To measure the speed, I create three Python scripts using...yes...again Bottle...

The basic idea is to join the two tables, select the Document_Id, Year, Area and the sum of Amount.

Let's start with the MySQL Script...

MySQL_Bottle_Documents.py
from bottle import get, run
import mysql.connector
import time


@get('/show_query')
def show_form():
    counter = 0
    start = time.clock()
    conn = mysql.connector.Connect(host='localhost', user='root',
                        password='root', database='P075400')
    cur = conn.cursor()

    query = '''SELECT A.DOCUMENT_ID, A.YEAR, A.AREA, SUM(AMOUNT)
                  FROM DOC_HEADER AS A INNER JOIN DOC_DETAIL AS B
                  WHERE A.DOCUMENT_ID = B.DOCUMENT_ID
                      AND A.YEAR = B.YEAR
                      AND A.AREA = B.AREA
                  GROUP BY DOCUMENT_ID, YEAR, AREA'''
    cur.execute(query)

    output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>"
    output += "<TR BGCOLOR='#B9C9FE'>"
    output += "<TH>Document</TH><TH>Year</TH>"
    output += "<TH>Area</TH><TH>Amount</TH>"
    output += "</TR>"
    for row in cur:
        counter += 1
        document_id = str(row[0])
        year = str(row[1])
        area = str(row[2])
        amount = str(row[3])
        output += "<TR BGCOLOR='#E8EDFF'>"
        output += '''<TD>%s</TD><TD>%s</TD>
                         <TD>%s</TD><TD>%s</TD>''' %     
                   (document_id, year,area, amount)
        output += "</TR>"
    output += "</TABLE>"
    end = time.clock()
    time_taken = end - start
    output += "<H1>SAP HANA - %s records in %s seconds</H1></DIV>"\
    % (counter, time_taken)
    return output

run(host='localhost', port=8080)

I let the script run...and after more than one hour...I simply got bored and interrupt the process...

So, I continue with PostgreSQL...

PostgreSQL_Bottle_Documents.py
from bottle import get, run
import psycopg2
import time


@get('/show_query')
def show_form():
    counter = 0
    start = time.clock()
    conn = psycopg2.connect("dbname=P075400 user=postgres password=root")
    cur = conn.cursor()

    query = '''SELECT "DOC_HEADER"."DOCUMENT_ID", "DOC_HEADER"."YEAR",
                "DOC_HEADER"."AREA", SUM("DOC_DETAIL"."AMOUNT") FROM
                public."DOC_HEADER", public."DOC_DETAIL" WHERE
                "DOC_HEADER"."DOCUMENT_ID" = "DOC_DETAIL"."DOCUMENT_ID"
                AND "DOC_HEADER"."YEAR" = "DOC_DETAIL"."YEAR"
                AND "DOC_HEADER"."AREA" = "DOC_DETAIL"."AREA"
                GROUP BY "DOC_HEADER"."DOCUMENT_ID", "DOC_HEADER"."YEAR",
                "DOC_HEADER"."AREA"'''
    cur.execute(query)

    output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>"
    output += "<TR BGCOLOR='#B9C9FE'>"
    output += "<TH>Document</TH><TH>Year</TH>"
    output += "<TH>Area</TH><TH>Amount</TH>"
    output += "</TR>"
    for row in cur:
        counter += 1
        document_id = str(row[0])
        year = str(row[1])
        area = str(row[2])
        amount = str(row[3])
        output += "<TR BGCOLOR='#E8EDFF'>"
        output += '''<TD>%s</TD><TD>%s</TD>
                         <TD>%s</TD><TD>%s</TD>''' % 
                     (document_id, year, area, amount)
        output += "</TR>"
    output += "</TABLE>"
    end = time.clock()
    time_taken = end - start
    output += "<H1>PostgreSQL - %s records in %s seconds</H1></DIV>"\
    % (counter, time_taken)
    return output

run(host='localhost', port=8080)

This time...I was lucky...


Out of 2 millions records, PostgreSQL managed to aggregate the amount field and generate 7669 records in 36 seconds...not bad at all...

For SAP HANA, I decided to take fully advantage of the Calculation Views, so I create the following...



I joined both tables, used a projection, applied the aggregation and specified the result...then I wrote this Python script...

SAPHANA_Bottle_Documents.py
from bottle import get, run
import dbapi
import time


@get('/show_query')
def show_form():
    counter = 0
    start = time.clock()
    conn = dbapi.connect('hanasvr-02', 30015, 'P075400', '5pA5kb6i')
    cur = conn.cursor()

    try:
        ret = cur.execute("drop type test_out")
    except dbapi.Error:
        pass

    try:
        ret = cur.execute("drop procedure pyTest")
    except dbapi.Error:
        pass

    queries = ['''create type test_out as table (DOCUMENT_ID NVARCHAR(8),
               YEAR VARCHAR(4), AREA VARCHAR(2), AMOUNT BIGINT)''',
               '''create procedure pyTest (OUT out_test TEST_OUT)\
               language sqlscript reads sql data with result view ProcView as\
               begin\
               out_test = CE_CALC_VIEW("_SYS_BIC"."blag/CV_DOCUMENTS",
               ["DOCUMENT_ID","YEAR","AREA","AMOUNT"]);\
               end''']

    for query in queries:
        cur.execute(query)
        conn.commit()

    query = '''select DOCUMENT_ID, YEAR, AREA, AMOUNT from ProcView'''

    cur.execute(query)
    ret = cur.fetchall()

    output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>"
    output += "<TR BGCOLOR='#B9C9FE'>"
    output += "<TH>Document</TH><TH>Year</TH>"
    output += "<TH>Area</TH><TH>Amount</TH>"
    output += "</TR>"
    for row in ret:
        counter += 1
        document_id = str(row["DOCUMENT_ID"])
        year = str(row["YEAR"])
        area = str(row["AREA"])
        amount = str(row["AMOUNT"])
        output += "<TR BGCOLOR='#E8EDFF'>"
        output += '''<TD>%s</TD><TD>%s</TD>
                     <TD>%s</TD><TD>%s</TD>''' % 
                     (document_id, year, area, amount)
        output += "</TR>"
    output += "</TABLE>"
    end = time.clock()
    time_taken = end - start
    output += "<H1>SAP HANA - %s records in %s seconds</H1></DIV>"\
    % (counter, time_taken)
    return output

run(host='localhost', port=8080)

After the execution...I couldn't be happier...here's the result...



SAP HANA managed the same 2 million records...generate the same 7669 aggregated records in only 18 seconds...that's 50% faster than PostgreSQL and...well...let's only say...way faster than MySQL...

Now...tell me that SAP HANA is not the coolest and fastest Database around...I dare you -:)

By doing that little fix on my Python Script for SAP HANA...the new processing time, without generating the Bottle table is...


Greetings,

Blag.

2 comentarios:

Henrique dijo...

Hey Blag, since you don't have any calculated fields, I suppose you'd get even better results with an Analytic view instead of a calc view. Worth a try!

Alvaro "Blag" Tejada Galindo dijo...

Henrique:

Good call -:) But you may be surprised...I just run a new test using Analytical, Calculation and plain SQL...here are the results...

1.1289 Calculation

1.5302 Analytical

0.7286 SQL

Greetings,

Blag.