I installed Scala, Hadoop, Spark and SparkR...not sure Hadoop is needed for this...but I wanted to have the full picture -:)
Anyway...I came across a piece of code that reads lines from a file and count how many lines have a "a" and how many lines have a "b"...
For this code I used the lyrics of Girls Not Grey by AFI...
| SparkR.R |
|---|
library(SparkR)
start.time <- Sys.time()
sc <- sparkR.init(master="local")
logFile <- "/home/blag/R_Codes/Girls_Not_Grey"
logData <- SparkR:::textFile(sc, logFile)
numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) }))
numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) }))
paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
|
| PlainR.R |
|---|
library("stringr")
start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Girls_Not_Grey"
logfile<-read.table(logFile,header = F, fill = T)
logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" "))
df<-data.frame(lines=logfile)
a<-sum(apply(df,1,function(x) grepl("a",x)))
b<-sum(apply(df,1,function(x) grepl("b",x)))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
|
Nice...0.01522398 seconds...wait...what? Isn't Spark supposed to be pretty fast? Well...I remembered that I read somewhere that Spark shines with big files...
Well...I prepared a file with 5 columns and 1 million records...let's see how that goes...
| SparkR.R |
|---|
library(SparkR)
start.time <- Sys.time()
sc <- sparkR.init(master="local")
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
logData <- SparkR:::textFile(sc, logFile)
numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) }))
numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) }))
paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
|
26.45734 seconds for a million records? Nice job -:) Let's see if plain R wins again...
| PlainR.R |
|---|
library("stringr")
start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
logfile<-read.csv(logFile,header = F)
logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" "))
df<-data.frame(lines=logfile)
a<-sum(apply(df,1,function(x) grepl("a",x)))
b<-sum(apply(df,1,function(x) grepl("b",x)))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
|
48.31641 seconds? Look like Spark was almost twice as fast this time...and this is a pretty simple example...I'm sure that when complexity arises...the gap is even bigger...
And sure...I know that a lot of people can take my plain R code and make it even faster than Spark...but...this is my blog...not theirs -;)
I will come back as soon as I learn more about SparkR -:D
UPDATE
So...I got a couple of comments claiming that read.csv() is too slow...and I should measuring the process not the loading of an csv file...while I don't agree...because everything is included in the process...I did something as simple as moving the start.time after the csv file is done...let's see how much of a change this brings...
SparkR
Around 1 second faster...which means that reading the csv was really efficient...
Plain R
HOLLY CRAP UPDATE!
Markus from Spain gave me this code on the comments...I just added a couple of things to make complaint...but...damn...I wish I could code like that in R! -:D Thanks Markus!!!
| Markus's code |
|---|
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
lines <- readLines(logFile)
start.time <- Sys.time()
a<-sum(grepl("a", lines, fixed=TRUE))
b<-sum(grepl("b", lines, fixed=TRUE))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
|
Simply...superb! -:)
Greetings,
Blag.
Development Culture.












