Most of you might not be aware of a feature introduced on SAP HANA SPS5. This new feature is called "Flexible Tables", which means that you can define a table that will grow depending on your needs. Let's see an example...
You define a table with ID, NAME and LAST_NAME. The table works fine, but you realize that you need to also add the PhoneNumber and Address...in a normal situation, you will need to open the definition and add those fields...but using Flexibles Tables, you will need to only add those fields as part of the INSERT query and let SAP HANA do it's magic...
Of course, that scenario is very unlikely to happen, because for a couple of fields, it doesn't make sense...so...where do we use this Flexible Tables? Enterprise Search, were we need to have all products on one table, and this table can have a really big amount of columns...as different products will have different characteristics...
So...for this blog...I really break my head trying to find a simple scenario that could cover Flexible Tables...what I came up with? SAP HANA, Flexible Tables, R and Twitter...
The code is very complex, so I'm not going to explain line by line how it works...but of course, I'm going to give a nice overview...
The Twitter API (I'm using version 1 even when it's deprecated, simply because version 1.1 deals with Authentication and didn't want to spend too much time on that...) allows us to fetch information from Twitter...so in this case I was interested in the Hashtags...the ones that starts with an "#" and are used to identify and organize certain tweets related to an event, technology or famous person. (I'm using only Tweets, not taking Retweets into account)...
Using R, I read the User Timeline to get the most recent 200 tweets from a particular account. This information will be send back to SAP HANA to be stored.
With the 200 tweets, I extract all the Hashtags, summarize them and then save the information both in the final table and in a intermediate table (to be used for the next user). In this intermediate table, I will store a long string with all the Hashtags separated by a comma.
When the next user arrives, all 200 tweets are read, Hashtags extracted, combine with the one saved in the intermediate table, summarized (This is very important because we want to keep track of the previous Hashtags, both the ones that are common to both users and the ones that only exist on the first or second user) and the information will be saved in the final table and the information from the next user will be save (replacing the previous one) as a long string in the intermediate table.
Why I need to this? Simple...let's say that the first user has 3 Hashtags...#SAPHANA, #R and #Python with values 3, 2 and 1. The next user will have 5 Hashtags...#SAPHANA, #SAP, #Ruby, #IPhone and #Android with the values 2, 5, 1, 4 and 3.
When we store the first user we will have:
UserName | SAPHANA | R | Python |
---|---|---|---|
First_User | 3 | 2 | 1 |
When we store the second user, we will have...
UserName | SAPHANA | R | Python | SAP | Ruby | Android | IPhone |
---|---|---|---|---|---|---|---|
First_User | 3 | 2 | 1 | ? | ? | ? | ? |
Second_User | 3 | 2 | 1 | 5 | 1 | 4 | 3 |
Now...you may wonder...why R has "0" for the next user and SAP has "?" for the first user? Easy...as you can see...as we added more fields (at runtime) the table grow...the R field was already there for the first user so it got a "0" for the next user, however SAP wasn't there before, so we don't actually know what should the value for the first user...so a "?" will be in place...
I'm sure you will have a better picture when you see the images of the table after I show the source code...
First, we need to create a table called "TWITTER_USERS", that will hold the users that we want to work with...
Then, we need another table were we are going to store the Hashtags and its values as a long string. This table will be called "FIRST_HASH".
Now, things get interesting, as we're going to create our Flexible Table using a very simple command...this table will be called "TWITTER_HASHTAGS".
Twitter_Hashtags.sql |
---|
CREATE COLUMN TABLE TWITTER_HASHTAGS( USERNAME NVARCHAR(10) ) WITH SCHEMA FLEXIBILITY; |
This table will look pretty regular when watching its definition...but it's a Flexible Table...as you can see...we only defined one field...so this table can grow and grow and grow -;)
Next, we need to create a couple of type tables to allow to interact between SAP HANA and R...
Table_Types.sql |
---|
CREATE TYPE T_COL_NAMES AS TABLE( COL_NAMES NVARCHAR(1000) ); CREATE TYPE T_COL_VALUES AS TABLE( COL_VALUES NVARCHAR(1000) ); |
And now...we're ready to start with the code...one R procedure and two SQLScript procedures...
Get_Hashtags.sql |
---|
CREATE PROCEDURE GET_HASHTAGS(IN twittername TWITTER_USERS,IN first_hash FIRST_HASH, OUT out_col_names T_COL_NAMES, OUT out_col_values T_COL_VALUES) LANGUAGE RLANG AS BEGIN UserName = twittername$USERNAME hashline = first_hash$HASH_LINE hashvalues = first_hash$HASH_VALUES Get_Twitter<-function(p_source,p_pattern){ datalines = grep(p_pattern,web_page,value=TRUE) getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1) g_list = gregexpr(p_pattern,datalines) matches = mapply(getexpr,datalines,g_list) result = gsub(p_pattern,'\\1',matches) names(result) = NULL return(result) } Get_Hashtags<-function(p_source){ check<-!length(grep('\\"([^,\\"]+)\\"', as.character(p_source))) if(!check){ mypattern = '\\"([^,\\"]+)\\"' datalines = grep(mypattern,p_source,value=TRUE) getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1) g_list = gregexpr(mypattern,datalines) matches = mapply(getexpr,datalines,g_list) result = gsub(mypattern,'\\1',matches) names(result) = NULL return(result) }else{ result<-p_source return(result) } } url<-paste("http://api.twitter.com/1/statuses/user_timeline.xml?count=200&screen_name=",UserName,sep="") mypattern = '<text>([^<]*)</text>' web_page<-readLines(url) tweets<-Get_Twitter(web_page,mypattern) mypattern = '[^\\&]#(\\.?\\w+)' hash_list<-Get_Twitter(tweets,mypattern) hashtags<-sapply(hash_list,Get_Hashtags) hashtags<-as.vector(unlist(hashtags)) hashtags<-toupper(hashtags) dt.hashtags<-data.frame(UserName,hashtags) tab.hashtags<-table(dt.hashtags) dt.hashtags<-as.data.frame.matrix(tab.hashtags) hashtags_names<-names(dt.hashtags) hashtags_names<-gsub("^\\.",'',hashtags_names) if(length(hashline>=1)){ hash_line<-gsub("^(\\w)+\\,",'',hashline) hash_line<-unlist(strsplit(hash_line, split=",")) hash_values<-gsub("^(\\'+\\w+\\')+\\,",'',hashvalues) hash_values<-as.numeric(unlist(strsplit(hash_values, split=","))) hash_frame<-data.frame(names=hash_line,values=hash_values) hash_frame["values"]<-0 Col_Names<-"" Col_Values<-"" for(i in 1:length(hashtags_names)){ Col_Names<-paste(Col_Names,hashtags_names[i],sep=",") Col_Values<-paste(Col_Values,dt.hashtags[,i],sep=",") } Col_Names<-gsub("^\\,|\\.",'',Col_Names) Col_Values<-gsub("^\\,|\\.",'',Col_Values) Col_Names<-unlist(strsplit(Col_Names, split=",")) Col_Values<-as.numeric(unlist(strsplit(Col_Values, split=","))) new_hash_frame<-data.frame(names=Col_Names,values=Col_Values) new_hash_frame<-rbind(hash_frame,new_hash_frame) new_hash_frame<-aggregate(values ~ names, FUN = "sum", data = new_hash_frame) new_hash_names<-new_hash_frame$names new_hash_values<-new_hash_frame$values Col_Names<-"USERNAME" Col_Values<-paste("'",UserName,"'",sep="") for(i in 1:length(new_hash_names)){ Col_Names<-paste(Col_Names,new_hash_names[i],sep=",") Col_Values<-paste(Col_Values,new_hash_values[i],sep=",") } }else{ Col_Names<-"USERNAME" Col_Values<-paste("'",UserName,"'",sep="") for(i in 1:length(hashtags_names)){ Col_Names<-paste(Col_Names,hashtags_names[i],sep=",") Col_Values<-paste(Col_Values,dt.hashtags[,i],sep=",") } } col_names<-gsub("^\\,\\.?",'',Col_Names) col_values<-gsub("^\\,",'',Col_Values) out_col_names<-data.frame(COL_NAMES=col_names) out_col_values<-data.frame(COL_VALUES=col_values) END; |
Save_Hashtags.sql |
---|
CREATE PROCEDURE SAVE_HASHTAGS(IN in_col_names T_COL_NAMES, IN in_col_values T_COL_VALUES) LANGUAGE SQLSCRIPT AS v_select VARCHAR(2000); v_col_names_char NVARCHAR(1000); v_col_values_char NVARCHAR(1000); CURSOR c_cursor1 FOR SELECT COL_NAMES FROM :in_col_names; CURSOR c_cursor2 FOR SELECT COL_VALUES FROM :in_col_values; BEGIN OPEN c_cursor1; FETCH c_cursor1 into v_col_names_char; CLOSE c_cursor1; OPEN c_cursor2; FETCH c_cursor2 into v_col_values_char; CLOSE c_cursor2; DELETE FROM FIRST_HASH; INSERT INTO FIRST_HASH VALUES(:v_col_names_char,:v_col_values_char); v_select := 'INSERT INTO TWITTER_HASHTAGS (' || v_col_names_char || ') VALUES (' || v_col_values_char || ')'; EXEC v_select; END; |
Get_Twitter_Users.sql |
---|
CREATE PROCEDURE GET_TWITTER_USERS(UserName NVARCHAR(10)) LANGUAGE SQLSCRIPT AS BEGIN Twitter_Users = SELECT USERNAME FROM TWITTER_USERS WHERE USERNAME = :UserName; First_Hash = SELECT HASH_LINE, HASH_VALUES FROM FIRST_HASH; CALL GET_HASHTAGS(:Twitter_Users,:First_Hash,T_COL_NAMES,T_COL_VALUES); CALL SAVE_HASHTAGS(:T_COL_NAMES,:T_COL_VALUES); END; |
In order for this to work, we need to insert some values on our "TWITTER_USERS" table...
And then, simply call the "GET_TWITTER_USERS" procedure...
Call_Get_Twitter_Users.sql |
---|
CALL GET_TWITTER_USERS('Blag'); CALL GET_TWITTER_USERS('Schmerdy'); CALL GET_TWITTER_USERS('ggread'); |
When we execute the first call...that's it with user @Blag we will have the following on the "FIRST_HASH" table...
And this on our "TWITTER_HASHTAGS" Flexible Table...
As you can see...our table started only with USERNAME...but as we pass in the Hashtags and its values...the table grew to able to hold them...
When we call the next user...that's @Schmerdy we will have this on the Flexible Table...
As you can see...in all the Hashtags that belong to @Blag but doesn't belong to @Schmerdy we have a "0" value...so what will happen to the ones that belong to @Schmerdy but not to @Blag?
Those field will have a "?" value, as they didn't exits before we add them...and again...the table grew to hold all the new fields....
Now...something interesting is that @Schmerdy had more Hashtags than @Blag...so what will happen when we call the last user which is @ggread that by the way...has less Hashtags than @Schmerdy and @Blag...
@ggread will have a value "0" in all the Hashtags that doesn't belong to the user...but will have a value in the one that are similar to @Schmerdy...
So...what will happen with the Hashtags that belongs to @ggread but doesn't exist on @Schmerdy or @Blag? Easy...they will be added and some "?" value will be placed where those Hashtags didn't exist before...
I wish I could put the whole table...but it contains more than 50 columns...so better...I can export them to a .CSV file...and do some analysis using Visual Intelligence...
Here, we can see how often these three user have used the Hashtags #SAPHANA, #SAP and #SAPTECHED in their last 200 tweets...
So...that's it...a nice and simple way to demonstrate how the Flexible Tables work in SAP HANA by using my always beloved R -:)
Greetings,
Blag.
2 comentarios:
Thanks for sharing Blag. Was designing something similar, but it turns out the pivot was a bot! Will try this out some day.
Arun:
I'm glad you like it -:)
Greetings,
Blag.
Publicar un comentario