My good friend Pakdi Decnud gave a great idea while we were having lunch the very same day of the SAP CodeJam Montreal event.
Pakdi told me..."Why don't you make a comparison between SAP HANA and let's say...MongoDB"...I thought that it was of course a great idea...so yesterday I start exploring MongoDB...so you may ask yourselves..."Why then are you talking about MySQL and PostreSQL?"
Easy answer...and here are my thoughts...
- I really don't get MongoDB...the whole No-SQL is really alien to me...
- MongoDB is "Document" based, meaning that you create collections of documents, no databases nor tables...
- MongoDB doesn't support Inner Joins and aggregates need a framework that it's more weird than MongoDB itself...
- MongoDB is not meant for enterprise applications
That's why I decide to make a little bit shift and grab the two most used databases by start-ups and developers...MySQL and PostgreSQL.
For this blog, I wanted to have a lot of information...so as always, I grab my beloved Python and create a little script to generate 1 million records for two tables. One script per table.
The structure of the tables is as follows...
DOC_HEADER
| Field Name | Data Type | Length |
|---|---|---|
| DOCUMENT_ID | VARCHAR | 8 |
| YEAR | VARCHAR | 4 |
| AREA | VARCHAR | 2 |
| NAME | VARCHAR | 20 |
| LAST_NAME | VARCHAR | 20 |
DOC_DETAIL
| Field Name | Data Type | Length |
|---|---|---|
| DOCUMENT_ID | VARCHAR | 8 |
| YEAR | VARCHAR | 4 |
| AREA | VARCHAR | 2 |
| AMOUNT | VARCHAR |
And here are the script to generate the 1 million records in a nice .CSV file
| Doc_Header_Generator.py |
|---|
import random
import csv
names = ["Anne", "Gigi", "Juergen", "Ingo", "Inga", "Alvaro", "Mario",
"Julien", "Mike", "Michael", "Karin", "Rui", "John", "Rocky", "Sebastian",
"Kai-Yin", "Hester", "Katrin", "Uwe", "Vitaliy"]
last_names = ["Hardy", "Read", "Schmerder", "Sauerzapf", "Bereza", "Tejada",
"Herger", "Vayssiere", "Flynn", "Byczkowski", "Schattka",
"Nogueira", "Mayerhofer", "Ongkowidjojo", "Wieczorek", "Gau", "Hilbrecht",
"Staehr", "Kylau", "Rudnytskiy"]
area = ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"]
year = ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",
"2008", "2009", "2010", "2011", "2012"]
def Generate_File(pSchema, pNumber):
iNumber = 0
c = csv.writer(open("Doc_Header.csv", "wb"))
while iNumber < pNumber:
queries = []
r_doc_id = random.randrange(1, 999999)
r_names = random.randrange(0, 20)
r_lastnames = random.randrange(0, 20)
r_areas = random.randrange(0, 10)
r_years = random.randrange(0, 13)
iNumber += 1
queries.append(r_doc_id)
queries.append(year[r_years])
queries.append(str(area[r_areas]))
queries.append(names[r_names])
queries.append(last_names[r_lastnames])
c.writerow(queries)
num_files = input("How many records?: \n")
Generate_File(num_files)
|
| Doc_Detail_Generator.py |
|---|
import random
import csv
area = ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"]
year = ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",
"2008", "2009", "2010", "2011", "2012"]
def Generate_File(pSchema, pNumber):
iNumber = 0
c = csv.writer(open("Doc_Detail.csv", "wb"))
while iNumber < pNumber:
queries = []
r_doc_id = random.randrange(1, 999999)
r_areas = random.randrange(0, 10)
r_years = random.randrange(0, 13)
r_amount = random.randrange(0, 10000, 1)
iNumber += 1
queries.append(r_doc_id)
queries.append(year[r_years])
queries.append(str(area[r_areas]))
queries.append(r_amount)
c.writerow(queries)
num_files = input("How many records?: \n")
Generate_File(num_files)
|
With the two files ready, I upload them to MySQL, PostgreSQL and SAP HANA.
To measure the speed, I create three Python scripts using...yes...again Bottle...
The basic idea is to join the two tables, select the Document_Id, Year, Area and the sum of Amount.
Let's start with the MySQL Script...
| MySQL_Bottle_Documents.py |
|---|
from bottle import get, run
import mysql.connector
import time
@get('/show_query')
def show_form():
counter = 0
start = time.clock()
conn = mysql.connector.Connect(host='localhost', user='root',
password='root', database='P075400')
cur = conn.cursor()
query = '''SELECT A.DOCUMENT_ID, A.YEAR, A.AREA, SUM(AMOUNT)
FROM DOC_HEADER AS A INNER JOIN DOC_DETAIL AS B
WHERE A.DOCUMENT_ID = B.DOCUMENT_ID
AND A.YEAR = B.YEAR
AND A.AREA = B.AREA
GROUP BY DOCUMENT_ID, YEAR, AREA'''
cur.execute(query)
output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>"
output += "<TR BGCOLOR='#B9C9FE'>"
output += "<TH>Document</TH><TH>Year</TH>"
output += "<TH>Area</TH><TH>Amount</TH>"
output += "</TR>"
for row in cur:
counter += 1
document_id = str(row[0])
year = str(row[1])
area = str(row[2])
amount = str(row[3])
output += "<TR BGCOLOR='#E8EDFF'>"
output += '''<TD>%s</TD><TD>%s</TD>
<TD>%s</TD><TD>%s</TD>''' %
(document_id, year,area, amount)
output += "</TR>"
output += "</TABLE>"
end = time.clock()
time_taken = end - start
output += "<H1>SAP HANA - %s records in %s seconds</H1></DIV>"\
% (counter, time_taken)
return output
run(host='localhost', port=8080)
|
I let the script run...and after more than one hour...I simply got bored and interrupt the process...
So, I continue with PostgreSQL...
| PostgreSQL_Bottle_Documents.py |
|---|
from bottle import get, run
import psycopg2
import time
@get('/show_query')
def show_form():
counter = 0
start = time.clock()
conn = psycopg2.connect("dbname=P075400 user=postgres password=root")
cur = conn.cursor()
query = '''SELECT "DOC_HEADER"."DOCUMENT_ID", "DOC_HEADER"."YEAR",
"DOC_HEADER"."AREA", SUM("DOC_DETAIL"."AMOUNT") FROM
public."DOC_HEADER", public."DOC_DETAIL" WHERE
"DOC_HEADER"."DOCUMENT_ID" = "DOC_DETAIL"."DOCUMENT_ID"
AND "DOC_HEADER"."YEAR" = "DOC_DETAIL"."YEAR"
AND "DOC_HEADER"."AREA" = "DOC_DETAIL"."AREA"
GROUP BY "DOC_HEADER"."DOCUMENT_ID", "DOC_HEADER"."YEAR",
"DOC_HEADER"."AREA"'''
cur.execute(query)
output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>"
output += "<TR BGCOLOR='#B9C9FE'>"
output += "<TH>Document</TH><TH>Year</TH>"
output += "<TH>Area</TH><TH>Amount</TH>"
output += "</TR>"
for row in cur:
counter += 1
document_id = str(row[0])
year = str(row[1])
area = str(row[2])
amount = str(row[3])
output += "<TR BGCOLOR='#E8EDFF'>"
output += '''<TD>%s</TD><TD>%s</TD>
<TD>%s</TD><TD>%s</TD>''' %
(document_id, year, area, amount)
output += "</TR>"
output += "</TABLE>"
end = time.clock()
time_taken = end - start
output += "<H1>PostgreSQL - %s records in %s seconds</H1></DIV>"\
% (counter, time_taken)
return output
run(host='localhost', port=8080)
|
This time...I was lucky...
Out of 2 millions records, PostgreSQL managed to aggregate the amount field and generate 7669 records in 36 seconds...not bad at all...
For SAP HANA, I decided to take fully advantage of the Calculation Views, so I create the following...
I joined both tables, used a projection, applied the aggregation and specified the result...then I wrote this Python script...
| SAPHANA_Bottle_Documents.py |
|---|
from bottle import get, run
import dbapi
import time
@get('/show_query')
def show_form():
counter = 0
start = time.clock()
conn = dbapi.connect('hanasvr-02', 30015, 'P075400', '5pA5kb6i')
cur = conn.cursor()
try:
ret = cur.execute("drop type test_out")
except dbapi.Error:
pass
try:
ret = cur.execute("drop procedure pyTest")
except dbapi.Error:
pass
queries = ['''create type test_out as table (DOCUMENT_ID NVARCHAR(8),
YEAR VARCHAR(4), AREA VARCHAR(2), AMOUNT BIGINT)''',
'''create procedure pyTest (OUT out_test TEST_OUT)\
language sqlscript reads sql data with result view ProcView as\
begin\
out_test = CE_CALC_VIEW("_SYS_BIC"."blag/CV_DOCUMENTS",
["DOCUMENT_ID","YEAR","AREA","AMOUNT"]);\
end''']
for query in queries:
cur.execute(query)
conn.commit()
query = '''select DOCUMENT_ID, YEAR, AREA, AMOUNT from ProcView'''
cur.execute(query)
ret = cur.fetchall()
output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>"
output += "<TR BGCOLOR='#B9C9FE'>"
output += "<TH>Document</TH><TH>Year</TH>"
output += "<TH>Area</TH><TH>Amount</TH>"
output += "</TR>"
for row in ret:
counter += 1
document_id = str(row["DOCUMENT_ID"])
year = str(row["YEAR"])
area = str(row["AREA"])
amount = str(row["AMOUNT"])
output += "<TR BGCOLOR='#E8EDFF'>"
output += '''<TD>%s</TD><TD>%s</TD>
<TD>%s</TD><TD>%s</TD>''' %
(document_id, year, area, amount)
output += "</TR>"
output += "</TABLE>"
end = time.clock()
time_taken = end - start
output += "<H1>SAP HANA - %s records in %s seconds</H1></DIV>"\
% (counter, time_taken)
return output
run(host='localhost', port=8080)
|
After the execution...I couldn't be happier...here's the result...
SAP HANA managed the same 2 million records...generate the same 7669 aggregated records in only 18 seconds...that's 50% faster than PostgreSQL and...well...let's only say...way faster than MySQL...
Now...tell me that SAP HANA is not the coolest and fastest Database around...I dare you -:)
By doing that little fix on my Python Script for SAP HANA...the new processing time, without generating the Bottle table is...
Greetings,
Blag.




Hey Blag, since you don't have any calculated fields, I suppose you'd get even better results with an Analytic view instead of a calc view. Worth a try!
ResponderEliminarHenrique:
ResponderEliminarGood call -:) But you may be surprised...I just run a new test using Analytical, Calculation and plain SQL...here are the results...
1.1289 Calculation
1.5302 Analytical
0.7286 SQL
Greetings,
Blag.