martes, 23 de septiembre de 2014

Web scrapping with Haskell and PhatomJS

Some time ago I wrote a blog called Web scrapping with Julia and PhantomJS...today...I wanted to do the same but using Haskell instead...

The concept is the same...we create a PhantomJS script that will read a "user" Twitter page and get the hashtags of the first 5 pages...here's the PhantomJS script...

Hashtags.js
var system = require('system');

var webpage = require('webpage').create();
webpage.viewportSize = { width: 1280, height: 800 };
webpage.scrollPosition = { top: 0, left: 0 };

var userid = system.args[1];
var profileUrl = "http://www.twitter.com/" + userid;

webpage.open(profileUrl, function(status) {
 if (status === 'fail') {
  console.error('webpage did not open successfully');
  phantom.exit(1);
 }
 var i = 0,
 top,
 queryFn = function() {
  return document.body.scrollHeight;
 };
 setInterval(function() {
  top = webpage.evaluate(queryFn);
  i++;
   
  webpage.scrollPosition = { top: top + 1, left: 0 };

  if (i >= 5) {
   var twitter = webpage.evaluate(function () {
    var twitter = [];
    forEach = Array.prototype.forEach;
    var tweets = document.querySelectorAll('[data-query-source="hashtag_click"]');
    forEach.call(tweets, function(el) {
     twitter.push(el.innerText);
    });
    return twitter;
   });

   twitter.forEach(function(t) {
    console.log(t);
   });

   phantom.exit();
  }
}, 3000);
});

If we run the script we're going to see the following output...


Now...what I want to do with this information...is to send it to Haskell...and get the most used hashtags...so I will summarize them and then get rid of the ones that only appear less than 5 times...

Let's see the Haskell code...

hashtags.hs
import System.Process
import Data.List

hashTags :: String -> IO()
hashTags(user) = do
 let x = readProcess "phantomjs" ["--ssl-protocol=any","Hashtags.js",user] []
 y <- x
 mapM_ print $ sortBy sortGT $ count y

count :: String -> [(String,Int)]
count xs = filter ((>=5).snd) $ 
     map(\ws -> (head ws, length ws)) $ 
           group $ sort $ words xs

sortGT :: (Ord a, Ord a1) => (a1, a) -> (a1, a) -> Ordering
sortGT (a1, b1) (a2, b2)
  | b1 < b2 = GT
  | b1 > b2 = LT
  | b1 == b2 = compare a1 a2

When we run this code...we're going to have this output...


The nice thing about this app is that we can pass any username as parameter and the result is going to nicely ordered and filtered...another reason to love Haskell -;)

Greetings,

Blag.
Development Culture.