My good friend Pakdi Decnud gave a great idea while we were having lunch the very same day of the SAP CodeJam Montreal event.
Pakdi told me..."Why don't you make a comparison between SAP HANA and let's say...MongoDB"...I thought that it was of course a great idea...so yesterday I start exploring MongoDB...so you may ask yourselves..."Why then are you talking about MySQL and PostreSQL?"
Easy answer...and here are my thoughts...
- I really don't get MongoDB...the whole No-SQL is really alien to me...
- MongoDB is "Document" based, meaning that you create collections of documents, no databases nor tables...
- MongoDB doesn't support Inner Joins and aggregates need a framework that it's more weird than MongoDB itself...
- MongoDB is not meant for enterprise applications
That's why I decide to make a little bit shift and grab the two most used databases by start-ups and developers...MySQL and PostgreSQL.
For this blog, I wanted to have a lot of information...so as always, I grab my beloved Python and create a little script to generate 1 million records for two tables. One script per table.
The structure of the tables is as follows...
DOC_HEADER
| Field Name | Data Type | Length |
|---|---|---|
| DOCUMENT_ID | VARCHAR | 8 |
| YEAR | VARCHAR | 4 |
| AREA | VARCHAR | 2 |
| NAME | VARCHAR | 20 |
| LAST_NAME | VARCHAR | 20 |
DOC_DETAIL
| Field Name | Data Type | Length |
|---|---|---|
| DOCUMENT_ID | VARCHAR | 8 |
| YEAR | VARCHAR | 4 |
| AREA | VARCHAR | 2 |
| AMOUNT | VARCHAR |
And here are the script to generate the 1 million records in a nice .CSV file
| Doc_Header_Generator.py |
|---|
import random
import csv
names = ["Anne", "Gigi", "Juergen", "Ingo", "Inga", "Alvaro", "Mario",
"Julien", "Mike", "Michael", "Karin", "Rui", "John", "Rocky", "Sebastian",
"Kai-Yin", "Hester", "Katrin", "Uwe", "Vitaliy"]
last_names = ["Hardy", "Read", "Schmerder", "Sauerzapf", "Bereza", "Tejada",
"Herger", "Vayssiere", "Flynn", "Byczkowski", "Schattka",
"Nogueira", "Mayerhofer", "Ongkowidjojo", "Wieczorek", "Gau", "Hilbrecht",
"Staehr", "Kylau", "Rudnytskiy"]
area = ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"]
year = ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",
"2008", "2009", "2010", "2011", "2012"]
def Generate_File(pSchema, pNumber):
iNumber = 0
c = csv.writer(open("Doc_Header.csv", "wb"))
while iNumber < pNumber:
queries = []
r_doc_id = random.randrange(1, 999999)
r_names = random.randrange(0, 20)
r_lastnames = random.randrange(0, 20)
r_areas = random.randrange(0, 10)
r_years = random.randrange(0, 13)
iNumber += 1
queries.append(r_doc_id)
queries.append(year[r_years])
queries.append(str(area[r_areas]))
queries.append(names[r_names])
queries.append(last_names[r_lastnames])
c.writerow(queries)
num_files = input("How many records?: \n")
Generate_File(num_files)
|
| Doc_Detail_Generator.py |
|---|
import random
import csv
area = ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"]
year = ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",
"2008", "2009", "2010", "2011", "2012"]
def Generate_File(pSchema, pNumber):
iNumber = 0
c = csv.writer(open("Doc_Detail.csv", "wb"))
while iNumber < pNumber:
queries = []
r_doc_id = random.randrange(1, 999999)
r_areas = random.randrange(0, 10)
r_years = random.randrange(0, 13)
r_amount = random.randrange(0, 10000, 1)
iNumber += 1
queries.append(r_doc_id)
queries.append(year[r_years])
queries.append(str(area[r_areas]))
queries.append(r_amount)
c.writerow(queries)
num_files = input("How many records?: \n")
Generate_File(num_files)
|
With the two files ready, I upload them to MySQL, PostgreSQL and SAP HANA.
To measure the speed, I create three Python scripts using...yes...again Bottle...
The basic idea is to join the two tables, select the Document_Id, Year, Area and the sum of Amount.
Let's start with the MySQL Script...
| MySQL_Bottle_Documents.py |
|---|
from bottle import get, run
import mysql.connector
import time
@get('/show_query')
def show_form():
counter = 0
start = time.clock()
conn = mysql.connector.Connect(host='localhost', user='root',
password='root', database='P075400')
cur = conn.cursor()
query = '''SELECT A.DOCUMENT_ID, A.YEAR, A.AREA, SUM(AMOUNT)
FROM DOC_HEADER AS A INNER JOIN DOC_DETAIL AS B
WHERE A.DOCUMENT_ID = B.DOCUMENT_ID
AND A.YEAR = B.YEAR
AND A.AREA = B.AREA
GROUP BY DOCUMENT_ID, YEAR, AREA'''
cur.execute(query)
output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>"
output += "<TR BGCOLOR='#B9C9FE'>"
output += "<TH>Document</TH><TH>Year</TH>"
output += "<TH>Area</TH><TH>Amount</TH>"
output += "</TR>"
for row in cur:
counter += 1
document_id = str(row[0])
year = str(row[1])
area = str(row[2])
amount = str(row[3])
output += "<TR BGCOLOR='#E8EDFF'>"
output += '''<TD>%s</TD><TD>%s</TD>
<TD>%s</TD><TD>%s</TD>''' %
(document_id, year,area, amount)
output += "</TR>"
output += "</TABLE>"
end = time.clock()
time_taken = end - start
output += "<H1>SAP HANA - %s records in %s seconds</H1></DIV>"\
% (counter, time_taken)
return output
run(host='localhost', port=8080)
|
I let the script run...and after more than one hour...I simply got bored and interrupt the process...
So, I continue with PostgreSQL...
| PostgreSQL_Bottle_Documents.py |
|---|
from bottle import get, run
import psycopg2
import time
@get('/show_query')
def show_form():
counter = 0
start = time.clock()
conn = psycopg2.connect("dbname=P075400 user=postgres password=root")
cur = conn.cursor()
query = '''SELECT "DOC_HEADER"."DOCUMENT_ID", "DOC_HEADER"."YEAR",
"DOC_HEADER"."AREA", SUM("DOC_DETAIL"."AMOUNT") FROM
public."DOC_HEADER", public."DOC_DETAIL" WHERE
"DOC_HEADER"."DOCUMENT_ID" = "DOC_DETAIL"."DOCUMENT_ID"
AND "DOC_HEADER"."YEAR" = "DOC_DETAIL"."YEAR"
AND "DOC_HEADER"."AREA" = "DOC_DETAIL"."AREA"
GROUP BY "DOC_HEADER"."DOCUMENT_ID", "DOC_HEADER"."YEAR",
"DOC_HEADER"."AREA"'''
cur.execute(query)
output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>"
output += "<TR BGCOLOR='#B9C9FE'>"
output += "<TH>Document</TH><TH>Year</TH>"
output += "<TH>Area</TH><TH>Amount</TH>"
output += "</TR>"
for row in cur:
counter += 1
document_id = str(row[0])
year = str(row[1])
area = str(row[2])
amount = str(row[3])
output += "<TR BGCOLOR='#E8EDFF'>"
output += '''<TD>%s</TD><TD>%s</TD>
<TD>%s</TD><TD>%s</TD>''' %
(document_id, year, area, amount)
output += "</TR>"
output += "</TABLE>"
end = time.clock()
time_taken = end - start
output += "<H1>PostgreSQL - %s records in %s seconds</H1></DIV>"\
% (counter, time_taken)
return output
run(host='localhost', port=8080)
|
This time...I was lucky...
Out of 2 millions records, PostgreSQL managed to aggregate the amount field and generate 7669 records in 36 seconds...not bad at all...
For SAP HANA, I decided to take fully advantage of the Calculation Views, so I create the following...
I joined both tables, used a projection, applied the aggregation and specified the result...then I wrote this Python script...
| SAPHANA_Bottle_Documents.py |
|---|
from bottle import get, run
import dbapi
import time
@get('/show_query')
def show_form():
counter = 0
start = time.clock()
conn = dbapi.connect('hanasvr-02', 30015, 'P075400', '5pA5kb6i')
cur = conn.cursor()
try:
ret = cur.execute("drop type test_out")
except dbapi.Error:
pass
try:
ret = cur.execute("drop procedure pyTest")
except dbapi.Error:
pass
queries = ['''create type test_out as table (DOCUMENT_ID NVARCHAR(8),
YEAR VARCHAR(4), AREA VARCHAR(2), AMOUNT BIGINT)''',
'''create procedure pyTest (OUT out_test TEST_OUT)\
language sqlscript reads sql data with result view ProcView as\
begin\
out_test = CE_CALC_VIEW("_SYS_BIC"."blag/CV_DOCUMENTS",
["DOCUMENT_ID","YEAR","AREA","AMOUNT"]);\
end''']
for query in queries:
cur.execute(query)
conn.commit()
query = '''select DOCUMENT_ID, YEAR, AREA, AMOUNT from ProcView'''
cur.execute(query)
ret = cur.fetchall()
output = "<DIV ALIGN='CENTER'><TABLE BORDER='1'>"
output += "<TR BGCOLOR='#B9C9FE'>"
output += "<TH>Document</TH><TH>Year</TH>"
output += "<TH>Area</TH><TH>Amount</TH>"
output += "</TR>"
for row in ret:
counter += 1
document_id = str(row["DOCUMENT_ID"])
year = str(row["YEAR"])
area = str(row["AREA"])
amount = str(row["AMOUNT"])
output += "<TR BGCOLOR='#E8EDFF'>"
output += '''<TD>%s</TD><TD>%s</TD>
<TD>%s</TD><TD>%s</TD>''' %
(document_id, year, area, amount)
output += "</TR>"
output += "</TABLE>"
end = time.clock()
time_taken = end - start
output += "<H1>SAP HANA - %s records in %s seconds</H1></DIV>"\
% (counter, time_taken)
return output
run(host='localhost', port=8080)
|
After the execution...I couldn't be happier...here's the result...
SAP HANA managed the same 2 million records...generate the same 7669 aggregated records in only 18 seconds...that's 50% faster than PostgreSQL and...well...let's only say...way faster than MySQL...
Now...tell me that SAP HANA is not the coolest and fastest Database around...I dare you -:)
By doing that little fix on my Python Script for SAP HANA...the new processing time, without generating the Bottle table is...
Greetings,
Blag.




2 comentarios:
Hey Blag, since you don't have any calculated fields, I suppose you'd get even better results with an Analytic view instead of a calc view. Worth a try!
Henrique:
Good call -:) But you may be surprised...I just run a new test using Analytical, Calculation and plain SQL...here are the results...
1.1289 Calculation
1.5302 Analytical
0.7286 SQL
Greetings,
Blag.
Publicar un comentario