lunes, 29 de junio de 2015

Exploring SparkR

A colleague from work, asked me to investigate about Spark and R. So the most obvious thing to was to investigate about SparkR -;)

I installed Scala, Hadoop, Spark and SparkR...not sure Hadoop is needed for this...but I wanted to have the full picture -:)

Anyway...I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b"...

For this code I used the lyrics of Girls Not Grey by AFI...

SparkR.R
library(SparkR)

start.time <- Sys.time()
sc <- sparkR.init(master="local")
logFile <- "/home/blag/R_Codes/Girls_Not_Grey"
logData <- SparkR:::textFile(sc, logFile)
numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) }))
numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) }))
paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken



0.3167355 seconds...pretty fast...I wonder how regular R will behave?

PlainR.R
library("stringr")

start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Girls_Not_Grey"
logfile<-read.table(logFile,header = F, fill = T)
logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" "))
df<-data.frame(lines=logfile)
a<-sum(apply(df,1,function(x) grepl("a",x)))
b<-sum(apply(df,1,function(x) grepl("b",x)))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken


Nice...0.01522398 seconds...wait...what? Isn't Spark supposed to be pretty fast? Well...I remembered that I read somewhere that Spark shines with big files...

Well...I prepared a file with 5 columns and 1 million records...let's see how that goes...

SparkR.R
library(SparkR)

start.time <- Sys.time()
sc <- sparkR.init(master="local")
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
logData <- SparkR:::textFile(sc, logFile)
numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) }))
numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) }))
paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken



26.45734 seconds for a million records? Nice job -:) Let's see if plain R wins again...

PlainR.R
library("stringr")

start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
logfile<-read.csv(logFile,header = F)
logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" "))
df<-data.frame(lines=logfile)
a<-sum(apply(df,1,function(x) grepl("a",x)))
b<-sum(apply(df,1,function(x) grepl("b",x)))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken


48.31641 seconds? Look like Spark was almost twice as fast this time...and this is a pretty simple example...I'm sure that when complexity arises...the gap is even bigger...

And sure...I know that a lot of people can take my plain R code and make it even faster than Spark...but...this is my blog...not theirs -;)

I will come back as soon as I learn more about SparkR -:D

UPDATE

So...I got a couple of comments claiming that read.csv() is too slow...and I should measuring the process not the loading of an csv file...while I don't agree...because everything is included in the process...I did something as simple as moving the start.time after the csv file is done...let's see how much of a change this brings...

SparkR

Around 1 second faster...which means that reading the csv was really efficient...


Plain R

Around 6 seconds faster...read.csv is not that good...but...SparkR is almost 50% faster...

HOLLY CRAP UPDATE!

 Markus from Spain gave me this code on the comments...I just added a couple of things to make complaint...but...damn...I wish I could code like that in R! -:D Thanks Markus!!!

Markus's code
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
lines <- readLines(logFile)
start.time <- Sys.time()
a<-sum(grepl("a", lines, fixed=TRUE))
b<-sum(grepl("b", lines, fixed=TRUE))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken


Simply...superb! -:)

Greetings,

Blag.
Development Culture.

6 comentarios:

Eduardo dijo...

Greetings!

I believe that most of what your code is actually profiling is read.csv. Swap out the call with a call to data.table's fread or readr's read_csv, and let's see what the benchmark shows! Optionally, start timing after data ingestion.

I don't believe you are measuring the complexity of the transforms so much as you are measuring the speed at which the most basic data import function works. I could be wrong, but I would love to see if that's the case!

Anónimo dijo...

The way you measure it, most of the time will be reading and parsing the CSV, which is not what you want. Also read.csv is the slowest CSV reader on Earth, try data.table::fread instead, about 100x faster.

Alvaro "Blag" Tejada Galindo dijo...

Thanks for your comments...this post's aim is not to do a benchmark on SparkR versus R...as I'm simply learning and exploring...

That said...sure...I can change read.csv by data.table::fread and post an update -:)

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

Ok...just changed read.csv with this...

logfile<-data.table::fread(logFile,header = F, stringsAsFactors=FALSE)

Execution time was basically the same...if you guys can post a better code...you're more than welcome -:)

Greetings,

Blag.
Development Culture.

Anónimo dijo...

Hi, I'm Markus from Spain.
Could you try this?


start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
lines <- readLines(logFile)
sum(grepl("a", lines, fixed=TRUE))
sum(grepl("b", lines, fixed=TRUE))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Thanks! ;)

Alvaro "Blag" Tejada Galindo dijo...

Markus! Damn you're good! -:D Just updated the post...less than 1 second? Awesome...maybe I should get back and learn R again -:)

Greetings,

Blag.
Development Culture.