Blag's bag of rants: Quick SAP HANA and R usecase

lunes, 13 de agosto de 2012

Quick SAP HANA and R usecase

DISCLAIMER: I'm not an SAP HANA expert or an R expert, not even a Python expert. I'm just a guy with a lot of ideas who loves to write blogs.

The other day I was thinking about making some nice with SAP HANA and R, because people doesn't seem to be enough interested in R yet, and that's a real shame...R is just awesome and SAP HANA is more awesome...so bringing them together is...I think you got the idea...

First thing that came into my mind was a survey...first name, last name, country, age, sex and a favourite programming language...of course...I need a lot of records...more that I could generate by hand...so...a Python script came to the rescue...I thought...why I don't take my team's names and last names, countries of origin, made up some ages and other things just to fill up the mix. The script will basically generate random entries that will get loaded into SAP HANA.

Vote_Generator.py
import os import sys import random names = ["Anne", "Gigi", "Juergen", "Ingo", "Inga", "Alvaro", "Mario", "Julien", "Mike", "Michael", "Karin", "Rui", "John", "Rocky", "Sebastian", "Kai-Yin", "Hester", "Katrin", "Uwe", "Vitaliy"] last_names = ["Hardy", "Read", "Schmerder", "Sauerzapf", "Bereza", "Tejada", "Herger", "Vayssiere", "Flynn", "Byczkowski", "Schattka", "Nogueira", "Mayerhofer", "Ongkowidjojo", "Wieczorek", "Gau", "Hilbrecht", "Staehr", "Kylau", "Rudnytskiy"] age = [24, 34, 40, 38, 28, 36, 35, 42, 30, 37] sex = ["M", "F"] country = ["Germany", "France", "Polland", "Peru", "Russia", "USA", "China", "Philippines", "Portugal", "Spain"] vote = ["ABAP", "Node", "Ruby", "Python", "R", "PHP", "ActionScript", "Euphoria", "Java", "C++"] def Generate_File(pSchema, pNumber): iNumber = 0 pathname = os.path.dirname(sys.argv[0]) pathname = os.path.abspath(pathname) + "\HANA_File.txt" myfile = open(pathname, "a") while iNumber < pNumber: r_names = random.randrange(0, 20) r_lastnames = random.randrange(0, 20) r_age = random.randrange(0, 10) r_sex = random.randrange(0, 2) r_country = random.randrange(0, 10) r_vote = random.randrange(0, 10) iNumber += 1 myfile.write("insert into " + pSchema + " values('" + names[r_names] + "','" + last_names[r_lastnames] + "'," + str(age[r_age]) + ",'" + sex[r_sex] + "','" + country[r_country] + "','" + vote[r_vote] + "');\n") myfile.close() print 'The file ' + pathname + ' was written successfully' schema = raw_input("Schema: \n") num_files = input("How many records?: \n") Generate_File(schema, num_files)

Vote_Generator.py

import os
import sys
import random
 
names = ["Anne", "Gigi", "Juergen", "Ingo", "Inga", "Alvaro", "Mario", "Julien",
"Mike", "Michael", "Karin", "Rui", "John", "Rocky", "Sebastian", "Kai-Yin",
"Hester", "Katrin", "Uwe", "Vitaliy"]
last_names = ["Hardy", "Read", "Schmerder", "Sauerzapf", "Bereza", "Tejada",
"Herger", "Vayssiere", "Flynn", "Byczkowski", "Schattka",
"Nogueira", "Mayerhofer", "Ongkowidjojo", "Wieczorek", "Gau", "Hilbrecht",
"Staehr", "Kylau", "Rudnytskiy"]
age = [24, 34, 40, 38, 28, 36, 35, 42, 30, 37]
sex = ["M", "F"]
country = ["Germany", "France", "Polland", "Peru", "Russia", "USA", "China",
"Philippines", "Portugal", "Spain"]
vote = ["ABAP", "Node", "Ruby", "Python", "R", "PHP", "ActionScript",
"Euphoria", "Java", "C++"]
 
def Generate_File(pSchema, pNumber):
    iNumber = 0
    pathname = os.path.dirname(sys.argv[0])
    pathname = os.path.abspath(pathname) + "\HANA_File.txt"
    myfile = open(pathname, "a")
    while iNumber < pNumber:
        r_names = random.randrange(0, 20)
        r_lastnames = random.randrange(0, 20)
        r_age = random.randrange(0, 10)
        r_sex = random.randrange(0, 2)
        r_country = random.randrange(0, 10)
        r_vote = random.randrange(0, 10)
        iNumber += 1
        myfile.write("insert into " + pSchema + " values('" +
        names[r_names] + "','" + last_names[r_lastnames] + "'," +
        str(age[r_age]) + ",'" + sex[r_sex] + "','" + country[r_country] +
        "','" + vote[r_vote] + "');\n")
    myfile.close()
    print 'The file ' + pathname + ' was written successfully'
 
schema = raw_input("Schema: \n")
num_files = input("How many records?: \n")
Generate_File(schema, num_files)

The execution is fair simple...we pass the schema and the number of files we want to generate...for this blog...200,000 records sounded like a good deal.

As you can see...some random funny names are generated, but after all, we're a team so I don't think they care too much.

Next step was to create the table in SAP HANA and load the records.

The main goal of our RHANA code is to have a detailed information on how many women or men of certain age from a certain country voted on the survey. Something like CHINA --> 24 --> F --> 890. (890 women of 24 years old from China voted on the survey). So of course, we need to create a table to handle that information as well.

With those pieces ready...we can code our script...

HANA_R_Votes.sql
DROP TYPE T_VOTES_DETAIL; CREATE TYPE T_VOTES_DETAIL AS TABLE ( COUNTRY VARCHAR(15), AGE CHAR(2), SEX CHAR(1), FREQUENCY INTEGER ); DROP PROCEDURE CALCULATE_VOTES; DROP PROCEDURE GET_VOTES; CREATE PROCEDURE CALCULATE_VOTES(IN votes VOTES, out votes_detail T_VOTES_DETAIL) LANGUAGE RLANG AS BEGIN Country<-votes$COUNTRY Age<-votes$AGE Sex<-votes$SEX Votes<-table(Country, Age, Sex) Stats<-data.frame(Votes) Stats<-subset(Stats,(Stats$Freq>0)) Stats<-Stats[order(Stats$Country,Stats$Age),] votes_detail<-data.frame(COUNTRY=Stats$Country,AGE=Stats$Age,SEX=Stats$Sex,FREQUENCY=Stats$Freq) END; CREATE PROCEDURE GET_VOTES() LANGUAGE SQLSCRIPT AS BEGIN tab_votes = SELECT * FROM VOTES; CALL CALCULATE_VOTES(:tab_votes, T_VOTES_DETAIL); INSERT INTO "VOTES_DETAIL" SELECT * FROM :T_VOTES_DETAIL; END; CALL GET_VOTES(); SELECT * FROM VOTES_DETAIL

HANA_R_Votes.sql

DROP TYPE T_VOTES_DETAIL;
 
CREATE TYPE T_VOTES_DETAIL AS TABLE (
COUNTRY VARCHAR(15),
AGE CHAR(2),
SEX CHAR(1),
FREQUENCY INTEGER
);
 
DROP PROCEDURE CALCULATE_VOTES;
DROP PROCEDURE GET_VOTES;
 
CREATE PROCEDURE CALCULATE_VOTES(IN votes VOTES, out votes_detail T_VOTES_DETAIL)
LANGUAGE RLANG AS
BEGIN
Country<-votes$COUNTRY
Age<-votes$AGE
Sex<-votes$SEX
Votes<-table(Country, Age, Sex)
Stats<-data.frame(Votes)
Stats<-subset(Stats,(Stats$Freq>0))
Stats<-Stats[order(Stats$Country,Stats$Age),]
votes_detail<-data.frame(COUNTRY=Stats$Country,AGE=Stats$Age,SEX=Stats$Sex,FREQUENCY=Stats$Freq)
END;
 
CREATE PROCEDURE GET_VOTES()
LANGUAGE SQLSCRIPT AS
BEGIN
tab_votes = SELECT * FROM VOTES;
CALL CALCULATE_VOTES(:tab_votes, T_VOTES_DETAIL);
INSERT INTO "VOTES_DETAIL" SELECT * FROM :T_VOTES_DETAIL;
END;
 
CALL GET_VOTES();
SELECT * FROM VOTES_DETAIL

When everything is ready, we call the GET_VOTES() procedure which is going to read the 200,000K records from SAP HANA, send them to our R server to process the data and then send it back to SAP HANA to fill up a table with the information we want to take care of.

Yes...the whole process took less that 2 seconds...for 200,000K records...is not bad at all...

We can now select the information from our condensed table...

Now that we have the information we want, we can use SAP HANA analysis capabilities to...analyse our data...

I gotta say...that was pretty easy and fast...might not be the best scenario ever, but I'm sure it will you a better perspective of what can be done with SAP HANA and R.

Greetings,

Blag.

No hay comentarios:

Publicar un comentario