Make Your Own Website Scraper

Description of a website scraper

I recently did a project where I needed to scrape a subset of webpages from the internet. Back in the 1990s all websites were pretty much straight up html, which makes crawling them and getting their data pretty easy. All you needed was to grab the html files and run them through a tokenizer, parsing out all the juicy bits. Today things are a bit more complicated as most websites have some form of javascript execution (usually jQuery or AngularJS). In order to crawl these type of sites the web scraper needs to be able to execute javascript. There are several ways this can be done each with their benefits and drawbacks. I chose to go with Selenium Webdriver

Of the various implementations of the webdriver – chrome, firefox, internet explorer, htmlunit, and etc . . . the fastest is chrome by far. The basic idea of a web scraper is to load a page as the user would see it. Grab the text from the page that you need, massage it, and store it off somewhere for data mining later. In the process of parsing the pages you grab all the anchor tags and push them into a list of websites “To Be Crawled” and once you scrape the particular website move them to a list of “Crawled Websites.” That’s a web scraper in a nutshell, however, the implementation of one can get a bit more complex especially depending on what your goals are.

One of the major problems I ran into quickly was how angular routes store references to pages. They use a baseurl.com/#/rest_of_link. The reason this is bad is because you need to strip the “#” hash sign from other url’s like foo.com/page#div, which references an id of an html element to scroll to on the loading of that link. The following is the regex I used in order to do strip out the #! and #_end_of_url. It uses the regex look ahead to accomplish the goal.

List<WebElement> list = driver.findElements(By.xpath("//*[@href]"));
for (def e : list) {
    def href =  e.getAttribute("href");
    if (href.contains('#')) {

        href = href.replaceAll("(?!#!)#.*","")

        UrlModel newModel = new UrlModel(href);
        //SUBMIT TO DB
    }
}

Then once you have a list of url’s to crawl you can just initialize a new chromedriver instance with the url in question and grab all the webelements in the body. Once you have all the web elements you can grab the text values. The rest is up to you how you want to store the data and mine it later. I did not need this to go particularly fast but you can map reduce the scraping by running multiple instances of the chromedriver on several different machines. Selenium grid has a good way of doing this.

Comments are closed.