viernes, 16 de mayo de 2014

Julia versus R - Playing around

So...as time goes by, I'm getting more proficient with Julia...which is something fairly easy as the learning curve is pretty fast...

I decided to load a file with 590,209 records that I got from Freebase...the file in question contains Actors and Actresses from movies...you can have a quick look here...


For this test, I'm using my Linux box on VMWare running on 2 GB of RAM...running Ubuntu 12.04.4 (Precise)

For R, I'm not using any special package...just plain R...version 2.14.1 and for Julia version 0.2.1, I'm using the DataFrames package...

Let's take a look at the R source code first along with its runtime processing...

Actors_Info.R
start.time <- Sys.time()
if(!exists("Actors")){
Actors<-read.csv("Actors_Table.csv", header=TRUE, 
                     stringsAsFactors=FALSE, colClasses="character", na.strings = "")
}
Actors<-unique(Actors)
Actors<-Actors[complete.cases(Actors),]
Actor_Info<-data.frame(Actor_Id=Actors$Actor_Id,Name=Actors$Name,Gender=Actors$Gender)
Actor_Info<-Actor_Info[order(Actor_Info$Gender),]
write.csv(Actor_Info,"Actor_Info_R.csv",row.names=TRUE)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

This source will first ask if the file was loaded already, if not...it will load it...then, it will eliminate the repeated records, delete all the null or NA's and the create a new Data Frame, sort it by "Gender" and then write a new CSV file...time will be taken to measure its speed...we will run it twice...first time the file is not loaded...second time it will...and that should improve greatly the execution time...



As we can see...the times are really good...and the different between the first and second run are pretty obvious...for the record...the generated file contains 105874 records...

Now...let's see the Julia version of the code...

Actors_Info.jl
using DataFrames
start = time()
isdefined(:Actors) || (Actors = readtable("Actors_Table.csv", header=true, nastrings=["","NA"]))
drop_duplicates!(Actors)
complete_cases!(Actors)
Actor_Info = DataFrame(Actor_Id=Actors["Actor_Id"],Name=Actors["Name"],Gender=Actors["Gender"])
sortby!(Actor_Info, [:Gender])
writetable("Actor_Info_Julia.csv", Actor_Info)
finish = time()
println("Time: ", finish-start)


Here...we're doing the same...we load the DataFrames package (But exclude that from the execution time), check if the file is loaded so we don't load it again on the second run...eliminate duplicates, delete all null or NA, create a new DataFrame, sort it by "Gender" and finally write a new CVS file...


Well...the difference between the second and first run is very significative...but of course...way slower than R...

But...let me tell you one simple thing...Julia is still a brand new language...the DataFrames package is not part of the core Julia language, which means...that its even newer...and optimizations are being performed as we speak...I would say that for a young language...18 seconds to process 590,209 records is pretty awesome...and of course...my R experience surpasses greatly my Julia experience...

So...I don't really want to leave you with the impression that Julia is not good or not fast enough...because believe me...it is...and you going to love my next experiment -;)

Let's take a look at the R source code first...

Random_Names.R
start.time <- Sys.time()
names<-c("Anne","Gigi","Blag","Juergen","Marek","Ingo","Lars","Julia",
         "Danielle","Rocky","Julien","Uwe","Myles","Mike", "Steven")

last_names<-c("Hardy","Read","Tejada","Schmerder","Kowalkiewicz","Sauerzapf",
              "Karg","Satsuta","Keene","Ongkowidjojo","Vayssiere","Kylau",
              "Fenlon","Flynn","Taylor")
full_names<-c()
for(i in 1:100000){
  name<-sample(1:15, 1)
  last_name<-sample(1:15, 1)
  full_name<-paste(names[name],last_names[last_name],sep=" ")
  full_names<-append(full_names,full_name)
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

So this code is fairly simple...we have a couple of vectors with names and last names...then we loop 100000 times and then generate a couple of random numbers simply to read the vectors, create a full name and populate a new vector... with some random funny name combinations...



Well....the different between both runs is not really good...second time was a little bit higher...and 1 minute is kind of a lot...let's see how Julia behaves...

Here's the Julia source code...

Random_Numbers.jl
start = time()
names=["Anne","Gigi","Blag","Juergen","Marek","Ingo","Lars","Julia",
       "Danielle","Rocky","Julien","Uwe","Myles","Mike", "Steven"]
last_names=["Hardy","Read","Tejada","Schmerder","Kowalkiewicz","Sauerzapf",
            "Karg","Satsuta","Keene","Ongkowidjojo","Vayssiere","Kylau","Fenlon","Flynn","Taylor"]
full_names=String[]
full_name = ""
for i = 1:100000
        name=rand(1:15)
        last_name=rand(1:15)
        full_name = names[name] * " " * last_names[last_name]
        push!(full_names,full_name)
end
finish = time()
println("Time: ", finish-start)

So this code as well, creates two arrays with names and last names, do a loop 100000 times, generate a couple of random numbers, mix a name with a last name and then populate a new array with some mixed full names...


Just like in the R code...the second time took Julia a little bit more...but...less than a second?! That's something like...amazingly fast and really took R by storm...

Now...I believe you will start to take Julia more seriously -:D

Hope you liked this blog...

Greetings,

Blag.
Development Culture.

26 comentarios:

Stephen Henderson dijo...

That would be a poor way to do this even in C. And it is completely opposite to the design and principles of R code. Here it is vectorise and running in ~86 nanoseconds

library(microbenchmark)
fun1=function()
{
names<-c("Anne","Gigi","Blag","Juergen","Marek","Ingo","Lars","Julia",
"Danielle","Rocky","Julien","Uwe","Myles","Mike", "Steven")

last_names<-c("Hardy","Read","Tejada","Schmerder","Kowalkiewicz","Sauerzapf",
"Karg","Satsuta","Keene","Ongkowidjojo","Vayssiere","Kylau",
"Fenlon","Flynn","Taylor")
full_names = rep(NA, 200000)
full_names[seq(1,200000, 2)] = sample(names, 100000, replace=T)
full_names[seq(2,200000, 2)] = sample(last_names, 100000, replace=T)
}
microbenchmark(fun1)

#Unit: nanoseconds
#expr min lq median uq max neval
#fun1 72 83 86 87 1696 100

Owe dijo...

I suppose the second example is meant as a joke, as you are using the widely known worst way to get that vector of names (1st or 2nd ring of the R Inferno, I think). Just for the record, using vectors

nameVec<-sample(names, 100000, replace = T)
last_nameVec <-sample(last_names, 100000, replace = T)
full_name<-paste(nameVec,last_nameVec,sep=" ")


my result is

Time difference of 0.05203414 secs,

compared to a time for the loop of
Time difference of 34.8261 secs

Gergely Daróczi dijo...

This is of course slow in R, I mean e.g. using "append" 1e5 times is an extremely bad idea. As no one would ever do such thing in R, IMHO it's not okay for using it in such direct comparison.

Try e.g.:

system.time(replicate(1e5, paste(sample(names, 1), sample(last_names, 1))))

Anónimo dijo...

You should avoid for loops in R at all cost.

Your version of Random_Names.R took 52 seconds on my laptop, and the version below took 0.046 seconds.

start.time <- Sys.time()
first.names<-c("Anne","Gigi","Blag","Juergen","Marek","Ingo","Lars","Julia",
"Danielle","Rocky","Julien","Uwe","Myles","Mike", "Steven")

last.names<-c("Hardy","Read","Tejada","Schmerder","Kowalkiewicz","Sauerzapf",
"Karg","Satsuta","Keene","Ongkowidjojo","Vayssiere","Kylau",
"Fenlon","Flynn","Taylor")
N <- 100000
full_names <- paste(first.names[sample(1:15, N, replace=TRUE)],
last.names[sample(1:15, N, replace=TRUE)],
sep=" ")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

MusX dijo...

If you are using packages in Julia you should also use some in R. Current comparison is pointless. Try R data.table package.

Marcos F dijo...

I'm from Spain and could write in Spanish.
But, anyway, because we want to reach a wider audience I will write in English.


In R, it's usually better to vectorize the operations.
For example, how would I solve the second task (random names)? Just this way (execution time < 1sec):



start.time <- Sys.time()
names<-c("Anne","Gigi","Blag","Juergen","Marek","Ingo","Lars","Julia",
"Danielle","Rocky","Julien","Uwe","Myles","Mike", "Steven")

last_names<-c("Hardy","Read","Tejada","Schmerder","Kowalkiewicz","Sauerzapf",
"Karg","Satsuta","Keene","Ongkowidjojo","Vayssiere","Kylau",
"Fenlon","Flynn","Taylor")
full_names <- paste(
sample(names, 100000, replace=TRUE),
sample(last_names, 100000, replace=TRUE)
)

Sys.time() - start.time
head(full_names)

Anónimo dijo...

The R code in the second example is horribly inefficient, and not a proper way to write R. Try this. Julia may be faster in some cases, but not as fast as you say it is.

start.time <- Sys.time()
names<-c("Anne","Gigi","Blag","Juergen","Marek","Ingo","Lars","Julia",
"Danielle","Rocky","Julien","Uwe","Myles","Mike", "Steven")

last_names<-c("Hardy","Read","Tejada","Schmerder","Kowalkiewicz","Sauerzapf",
"Karg","Satsuta","Keene","Ongkowidjojo","Vayssiere","Kylau",
"Fenlon","Flynn","Taylor")
name<-sample(names,100000,T)
last_name<-sample(last_names,100000,T)
full_names<-paste(name,last_name)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

TM dijo...

If you're going to do this comparison, you need to write efficient code. Your R code in the second example breaks a lot of the rules for efficient code. Check out The R Inferno by Patrick Burns for more on this. If I replace your loop with:

name <- sample(1:15, 100000, replace=TRUE)
last_name <- sample(1:15, 100000, replace=TRUE)
full_names <- paste(names[name],last_names[last_name],sep=" ")

R completes in 0.062 seconds on my box, which is comparable to your Julia code. Your original code takes 47 seconds on my machine. I don't know Julia, but I wouldn't be surprised if you are creating bottlenecks in your Julia code in your first example.

Gabor Csardi dijo...

Your example is very artificial, nobody would write R code like that. Here is how to write it properly:

system.time({

full_names <- paste(sample(names, 100000, replace=TRUE),

sample(last_names, 100000, replace=TRUE))

})

# user system elapsed

# 0.027 0.000 0.027

Ryan Raaum dijo...

"append" in R is really, really slow. If you prepare an empty character vector of the appropriate length, it gets much faster:

full_names<-character(100000)
for(i in 1:100000){
name<-sample(1:15, 1)
last_name<-sample(1:15, 1)
full_name<-paste(names[name],last_names[last_name],sep=" ")
full_names[i]<-full_name
}

It's not quite as fast as the Julia version, but gets close (~2 seconds on my computer).

Alvaro "Blag" Tejada Galindo dijo...

Ryan:

Shame on me -:( I forgot about full_names<-character(100000)...regarding the "append"...as I'm using "push" on Julia I wanted to have both languages using the same approach...but sure...your tips cut down the R runtime to 4.112477 seconds...amazing how something that simple can improve the time so much...

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

Stephen:

Thanks...your code is indeed pretty fast...however the idea is to generate 100000 records...so actually it would be like this...

full_names = rep(NA, 100000)
full_names[seq(1,100000, 2)] = sample(names, 50000, replace=T)
full_names[seq(2,100000, 2)] = sample(last_names, 50000, replace=T)

Anyway...it gives R a new runtime of 0.0598869

But...keep in mind that I wanted to show a comparison between Julia and R using almost the same code...which means...For vs. For...and in that sense...Julia beats R -:)

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

Owe:

If you think it was meant as joke...then I hope you're laughing...

I wanted to give both Julia and R the same code...both using For and using Push or Append...

I have read the R Inferno...but I don't think it's relevant here...

Anyway...thanks for your comment -:)

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

Gergely:

I know that "append" and "for" are slow in R...but as you can see...I'm using "push" and "for" in Julia...so it's fair enough...it doesn't matter if no one would use R like that or not...it's just a comparison between core language elements...

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

Anonimo:

Loops in R are slow...but they are not slow in Julia...a comparison between both languages using the same "command"... -:)

Your example took 0.1488149 seconds on my machine...pretty impressive -:)

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

Musx:

If I'm using a library in Julia and not in R, it's simply because Julia can't handle DataFrame by itself...while R of course can...I don't any pointlessness in the comparison...but you're free to have your opinion -:)

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

Marcos:

I'm from Peru...so I could write in Spanish as well...wait...I did it already -;) http://atejada.blogspot.com/2014/05/julia-versus-r-jugando-un-poco.html

Regarding your comment...sure...vectorize is better for sure...will keep it in mind for next time -;)

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

Anonimo:

I like your code too -:) And keep in mind that I don't work using R nor I'm a Data Scientist...so sure...my R code can be really crappy sometimes -;)

Greetings,

Blag.
Development Culture.

iandgow dijo...

I also think that doing it the non-R way in R makes the code less readable:

```
start.time <- Sys.time()
names <- c("Anne","Gigi","Blag","Juergen","Marek","Ingo","Lars","Julia",
"Danielle","Rocky","Julien","Uwe","Myles","Mike", "Steven")

last_names <- c("Hardy","Read","Tejada","Schmerder","Kowalkiewicz","Sauerzapf",
"Karg","Satsuta","Keene","Ongkowidjojo","Vayssiere","Kylau",
"Fenlon","Flynn","Taylor")
N <- 100000
full_name <- paste(sample(names, N, replace=TRUE),
sample(last_names, N, replace=TRUE), sep=" ")

end.time <- Sys.time()
end.time - start.time
```

Alvaro "Blag" Tejada Galindo dijo...

TM:

I have of course read the R Inferno...more than once in fact...but for me...a real comparison is using the same code for both languages...the Julia code is not optimized...why the R code should be?

And regarding bottlenecks in Julia...sure...I have only started learning Julia like a week ago...so I'm in no means an expert...I'm just waiting for some Julia folks to jump in a fix my code -;)

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

Gabor:

Artificial or not...I'm doing a comparison using the same commands...the R and the Julia codes are almost identical...and for me...that's fair...

Your example is for sure faster and better written...but then we would need to fix the Julia code as well...as see which one is faster...

Greetings,

Blag.
Development Culture.

Alvaro "Blag" Tejada Galindo dijo...

iandgow:

Sure...the code is less readable...but it's in fact the same construct as in Julia...it doesn't really matter if it is readable or not...anyway...thanks for your comment -:)

Greetings,

Blag.
Development Culture.

Anónimo dijo...

The post is very interesting because it shows how important is to adapt to the specific properties and flaws of each language. In the case of R, loops and memory allocation.

Still, the comparison should be about doing the same task, indeed, but each language should be allowed to use its best "weapons"; forcing R to use code that is "unnatural" to ist design is a bad way to assess its performance.

Best,

Miguel

Alvaro "Blag" Tejada Galindo dijo...

Miguel:

Sure, you have a fair point...maybe I should have optimized the R code...but then...I don't know enough Julia to optimized it as well...so it's kind of a weird situation...thanks for you comment -:)

Greetings,

Blag.
Development Culture.

Anónimo dijo...

It's people like you and posts like this that are the reason I'm starting to hate Julia.

Computer languages are not about identical structure, they are about getting things done. And they use different paradigms to do so.

Using R against its very core principles is not "fair".

You're comparing tools to build a shelf. And to be "fair" you compare a hammer and a screwdriver at nailing, instead of using screws with the screwdriver...

Alvaro "Blag" Tejada Galindo dijo...

Dear Anonymous (Or whatever your name is):

It's people like you and comments like yours that make believe that the R community is full of closed minded people -:) So I guess we're even...

Regarding fairness...I think it's totally and absolutely fair to compare both languages using their core features...

Greetings,

Blag.
Development Culture.