I installed Scala, Hadoop, Spark and SparkR...not sure Hadoop is needed for this...but I wanted to have the full picture -:)
Anyway...I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b"...
For this code I used the lyrics of Girls Not Grey by AFI...
SparkR.R |
---|
library(SparkR) start.time <- Sys.time() sc <- sparkR.init(master="local") logFile <- "/home/blag/R_Codes/Girls_Not_Grey" logData <- SparkR:::textFile(sc, logFile) numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) })) numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) })) paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken |
PlainR.R |
---|
library("stringr") start.time <- Sys.time() logFile <- "/home/blag/R_Codes/Girls_Not_Grey" logfile<-read.table(logFile,header = F, fill = T) logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" ")) df<-data.frame(lines=logfile) a<-sum(apply(df,1,function(x) grepl("a",x))) b<-sum(apply(df,1,function(x) grepl("b",x))) paste("Lines with a: ", a, ", Lines with b: ", b, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken |
Nice...0.01522398 seconds...wait...what? Isn't Spark supposed to be pretty fast? Well...I remembered that I read somewhere that Spark shines with big files...
Well...I prepared a file with 5 columns and 1 million records...let's see how that goes...
SparkR.R |
---|
library(SparkR) start.time <- Sys.time() sc <- sparkR.init(master="local") logFile <- "/home/blag/R_Codes/Doc_Header.csv" logData <- SparkR:::textFile(sc, logFile) numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) })) numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) })) paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken |
26.45734 seconds for a million records? Nice job -:) Let's see if plain R wins again...
PlainR.R |
---|
library("stringr") start.time <- Sys.time() logFile <- "/home/blag/R_Codes/Doc_Header.csv" logfile<-read.csv(logFile,header = F) logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" ")) df<-data.frame(lines=logfile) a<-sum(apply(df,1,function(x) grepl("a",x))) b<-sum(apply(df,1,function(x) grepl("b",x))) paste("Lines with a: ", a, ", Lines with b: ", b, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken |
48.31641 seconds? Look like Spark was almost twice as fast this time...and this is a pretty simple example...I'm sure that when complexity arises...the gap is even bigger...
And sure...I know that a lot of people can take my plain R code and make it even faster than Spark...but...this is my blog...not theirs -;)
I will come back as soon as I learn more about SparkR -:D
UPDATE
So...I got a couple of comments claiming that read.csv() is too slow...and I should measuring the process not the loading of an csv file...while I don't agree...because everything is included in the process...I did something as simple as moving the start.time after the csv file is done...let's see how much of a change this brings...
SparkR
Around 1 second faster...which means that reading the csv was really efficient...
Plain R
HOLLY CRAP UPDATE!
Markus from Spain gave me this code on the comments...I just added a couple of things to make complaint...but...damn...I wish I could code like that in R! -:D Thanks Markus!!!
Markus's code |
---|
logFile <- "/home/blag/R_Codes/Doc_Header.csv" lines <- readLines(logFile) start.time <- Sys.time() a<-sum(grepl("a", lines, fixed=TRUE)) b<-sum(grepl("b", lines, fixed=TRUE)) paste("Lines with a: ", a, ", Lines with b: ", b, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken |
Simply...superb! -:)
Greetings,
Blag.
Development Culture.
6 comentarios:
Greetings!
I believe that most of what your code is actually profiling is read.csv. Swap out the call with a call to data.table's fread or readr's read_csv, and let's see what the benchmark shows! Optionally, start timing after data ingestion.
I don't believe you are measuring the complexity of the transforms so much as you are measuring the speed at which the most basic data import function works. I could be wrong, but I would love to see if that's the case!
The way you measure it, most of the time will be reading and parsing the CSV, which is not what you want. Also read.csv is the slowest CSV reader on Earth, try data.table::fread instead, about 100x faster.
Thanks for your comments...this post's aim is not to do a benchmark on SparkR versus R...as I'm simply learning and exploring...
That said...sure...I can change read.csv by data.table::fread and post an update -:)
Greetings,
Blag.
Development Culture.
Ok...just changed read.csv with this...
logfile<-data.table::fread(logFile,header = F, stringsAsFactors=FALSE)
Execution time was basically the same...if you guys can post a better code...you're more than welcome -:)
Greetings,
Blag.
Development Culture.
Hi, I'm Markus from Spain.
Could you try this?
start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
lines <- readLines(logFile)
sum(grepl("a", lines, fixed=TRUE))
sum(grepl("b", lines, fixed=TRUE))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Thanks! ;)
Markus! Damn you're good! -:D Just updated the post...less than 1 second? Awesome...maybe I should get back and learn R again -:)
Greetings,
Blag.
Development Culture.
Publicar un comentario