My good friend Pakdi Decnud gave a great idea while we were having lunch the very same day of the SAP CodeJam Montreal event.
Pakdi told me..."Why don't you make a comparison between SAP HANA and let's say...MongoDB"...I thought that it was of course a great idea...so yesterday I start exploring MongoDB...so you may ask yourselves..."Why then are you talking about MySQL and PostreSQL?"
Easy answer...and here are my thoughts...
- I really don't get MongoDB...the whole No-SQL is really alien to me...
- MongoDB is "Document" based, meaning that you create collections of documents, no databases nor tables...
- MongoDB doesn't support Inner Joins and aggregates need a framework that it's more weird than MongoDB itself...
- MongoDB is not meant for enterprise applications
That's why I decide to make a little bit shift and grab the two most used databases by start-ups and developers...MySQL and PostgreSQL.
For this blog, I wanted to have a lot of information...so as always, I grab my beloved Python and create a little script to generate 1 million records for two tables. One script per table.
The structure of the tables is as follows...
DOC_HEADER
Field Name | Data Type | Length |
---|---|---|
DOCUMENT_ID | VARCHAR | 8 |
YEAR | VARCHAR | 4 |
AREA | VARCHAR | 2 |
NAME | VARCHAR | 20 |
LAST_NAME | VARCHAR | 20 |
DOC_DETAIL
Field Name | Data Type | Length |
---|---|---|
DOCUMENT_ID | VARCHAR | 8 |
YEAR | VARCHAR | 4 |
AREA | VARCHAR | 2 |
AMOUNT | VARCHAR |
And here are the script to generate the 1 million records in a nice .CSV file
Doc_Header_Generator.py |
---|
import random import csv names = ["Anne", "Gigi", "Juergen", "Ingo", "Inga", "Alvaro", "Mario", "Julien", "Mike", "Michael", "Karin", "Rui", "John", "Rocky", "Sebastian", "Kai-Yin", "Hester", "Katrin", "Uwe", "Vitaliy"] last_names = ["Hardy", "Read", "Schmerder", "Sauerzapf", "Bereza", "Tejada", "Herger", "Vayssiere", "Flynn", "Byczkowski", "Schattka", "Nogueira", "Mayerhofer", "Ongkowidjojo", "Wieczorek", "Gau", "Hilbrecht", "Staehr", "Kylau", "Rudnytskiy"] area = ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"] year = ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012"] def Generate_File(pSchema, pNumber): iNumber = 0 c = csv.writer(open("Doc_Header.csv", "wb")) while iNumber < pNumber: queries = [] r_doc_id = random.randrange(1, 999999) r_names = random.randrange(0, 20) r_lastnames = random.randrange(0, 20) r_areas = random.randrange(0, 10) r_years = random.randrange(0, 13) iNumber += 1 queries.append(r_doc_id) queries.append(year[r_years]) queries.append(str(area[r_areas])) queries.append(names[r_names]) queries.append(last_names[r_lastnames]) c.writerow(queries) num_files = input("How many records?: \n") Generate_File(num_files) |
Doc_Detail_Generator.py |
---|
import random import csv area = ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"] year = ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012"] def Generate_File(pSchema, pNumber): iNumber = 0 c = csv.writer(open("Doc_Detail.csv", "wb")) while iNumber < pNumber: queries = [] r_doc_id = random.randrange(1, 999999) r_areas = random.randrange(0, 10) r_years = random.randrange(0, 13) r_amount = random.randrange(0, 10000, 1) iNumber += 1 queries.append(r_doc_id) queries.append(year[r_years]) queries.append(str(area[r_areas])) queries.append(r_amount) c.writerow(queries) num_files = input("How many records?: \n") Generate_File(num_files) |
With the two files ready, I upload them to MySQL, PostgreSQL and SAP HANA.
To measure the speed, I create three Python scripts using...yes...again Bottle...
The basic idea is to join the two tables, select the Document_Id, Year, Area and the sum of Amount.
Let's start with the MySQL Script...
MySQL_Bottle_Documents.py |
---|
from bottle import get, run import mysql.connector import time @get('/show_query') def show_form(): counter = 0 start = time.clock() conn = mysql.connector.Connect(host='localhost', user='root', password='root', database='P075400') cur = conn.cursor() query = '''SELECT A.DOCUMENT_ID, A.YEAR, A.AREA, SUM(AMOUNT) FROM DOC_HEADER AS A INNER JOIN DOC_DETAIL AS B WHERE A.DOCUMENT_ID = B.DOCUMENT_ID AND A.YEAR = B.YEAR AND A.AREA = B.AREA GROUP BY DOCUMENT_ID, YEAR, AREA''' cur.execute(query) output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>" output += "<TR BGCOLOR='#B9C9FE'>" output += "<TH>Document</TH><TH>Year</TH>" output += "<TH>Area</TH><TH>Amount</TH>" output += "</TR>" for row in cur: counter += 1 document_id = str(row[0]) year = str(row[1]) area = str(row[2]) amount = str(row[3]) output += "<TR BGCOLOR='#E8EDFF'>" output += '''<TD>%s</TD><TD>%s</TD> <TD>%s</TD><TD>%s</TD>''' % (document_id, year,area, amount) output += "</TR>" output += "</TABLE>" end = time.clock() time_taken = end - start output += "<H1>SAP HANA - %s records in %s seconds</H1></DIV>"\ % (counter, time_taken) return output run(host='localhost', port=8080) |
I let the script run...and after more than one hour...I simply got bored and interrupt the process...
So, I continue with PostgreSQL...
PostgreSQL_Bottle_Documents.py |
---|
from bottle import get, run import psycopg2 import time @get('/show_query') def show_form(): counter = 0 start = time.clock() conn = psycopg2.connect("dbname=P075400 user=postgres password=root") cur = conn.cursor() query = '''SELECT "DOC_HEADER"."DOCUMENT_ID", "DOC_HEADER"."YEAR", "DOC_HEADER"."AREA", SUM("DOC_DETAIL"."AMOUNT") FROM public."DOC_HEADER", public."DOC_DETAIL" WHERE "DOC_HEADER"."DOCUMENT_ID" = "DOC_DETAIL"."DOCUMENT_ID" AND "DOC_HEADER"."YEAR" = "DOC_DETAIL"."YEAR" AND "DOC_HEADER"."AREA" = "DOC_DETAIL"."AREA" GROUP BY "DOC_HEADER"."DOCUMENT_ID", "DOC_HEADER"."YEAR", "DOC_HEADER"."AREA"''' cur.execute(query) output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>" output += "<TR BGCOLOR='#B9C9FE'>" output += "<TH>Document</TH><TH>Year</TH>" output += "<TH>Area</TH><TH>Amount</TH>" output += "</TR>" for row in cur: counter += 1 document_id = str(row[0]) year = str(row[1]) area = str(row[2]) amount = str(row[3]) output += "<TR BGCOLOR='#E8EDFF'>" output += '''<TD>%s</TD><TD>%s</TD> <TD>%s</TD><TD>%s</TD>''' % (document_id, year, area, amount) output += "</TR>" output += "</TABLE>" end = time.clock() time_taken = end - start output += "<H1>PostgreSQL - %s records in %s seconds</H1></DIV>"\ % (counter, time_taken) return output run(host='localhost', port=8080) |
This time...I was lucky...
Out of 2 millions records, PostgreSQL managed to aggregate the amount field and generate 7669 records in 36 seconds...not bad at all...
For SAP HANA, I decided to take fully advantage of the Calculation Views, so I create the following...
I joined both tables, used a projection, applied the aggregation and specified the result...then I wrote this Python script...
SAPHANA_Bottle_Documents.py |
---|
from bottle import get, run import dbapi import time @get('/show_query') def show_form(): counter = 0 start = time.clock() conn = dbapi.connect('hanasvr-02', 30015, 'P075400', '5pA5kb6i') cur = conn.cursor() try: ret = cur.execute("drop type test_out") except dbapi.Error: pass try: ret = cur.execute("drop procedure pyTest") except dbapi.Error: pass queries = ['''create type test_out as table (DOCUMENT_ID NVARCHAR(8), YEAR VARCHAR(4), AREA VARCHAR(2), AMOUNT BIGINT)''', '''create procedure pyTest (OUT out_test TEST_OUT)\ language sqlscript reads sql data with result view ProcView as\ begin\ out_test = CE_CALC_VIEW("_SYS_BIC"."blag/CV_DOCUMENTS", ["DOCUMENT_ID","YEAR","AREA","AMOUNT"]);\ end'''] for query in queries: cur.execute(query) conn.commit() query = '''select DOCUMENT_ID, YEAR, AREA, AMOUNT from ProcView''' cur.execute(query) ret = cur.fetchall() output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>" output += "<TR BGCOLOR='#B9C9FE'>" output += "<TH>Document</TH><TH>Year</TH>" output += "<TH>Area</TH><TH>Amount</TH>" output += "</TR>" for row in ret: counter += 1 document_id = str(row["DOCUMENT_ID"]) year = str(row["YEAR"]) area = str(row["AREA"]) amount = str(row["AMOUNT"]) output += "<TR BGCOLOR='#E8EDFF'>" output += '''<TD>%s</TD><TD>%s</TD> <TD>%s</TD><TD>%s</TD>''' % (document_id, year, area, amount) output += "</TR>" output += "</TABLE>" end = time.clock() time_taken = end - start output += "<H1>SAP HANA - %s records in %s seconds</H1></DIV>"\ % (counter, time_taken) return output run(host='localhost', port=8080) |
After the execution...I couldn't be happier...here's the result...
SAP HANA managed the same 2 million records...generate the same 7669 aggregated records in only 18 seconds...that's 50% faster than PostgreSQL and...well...let's only say...way faster than MySQL...
Now...tell me that SAP HANA is not the coolest and fastest Database around...I dare you -:)
By doing that little fix on my Python Script for SAP HANA...the new processing time, without generating the Bottle table is...
Greetings,
Blag.
2 comentarios:
Hey Blag, since you don't have any calculated fields, I suppose you'd get even better results with an Analytic view instead of a calc view. Worth a try!
Henrique:
Good call -:) But you may be surprised...I just run a new test using Analytical, Calculation and plain SQL...here are the results...
1.1289 Calculation
1.5302 Analytical
0.7286 SQL
Greetings,
Blag.
Publicar un comentario