Thursday, 28 February 2013

Starters guide to web scraping with the HTML Agility Pack for .NET

I recently wanted to get a rough average MPG for each car available on the website, yet unfortunately there was no API for me to access the values, so I turned to Google and came across the NuGet package HTML Agility Pack. This post will get you up to speed on using HTML Agility Pack, basic XPath and some LINQ.

Before we start, please make sure to check the terms and conditions and any possible copyright terms that may be applicable to the data you are retrieving. You should be able to view this on the website, however it may vary from country to country. Please also keep in mind that you will effectively be accessing the site at a rapid rate, and so it would be sensible to save any communications to your local disk for later usage, and adding a delay between page downloads.

Getting Started

Identifying the data

The first thing you need to do is find where in the HTML the data is you want to download. Let's try going to fuelly and browsing all the cars. As you can see there are a large variety of cars available, and if you click on one of the models it takes you to a page that displays the year of the model and the average MPG for it. For my personal project, I wanted to obtain four values: Manufacturer, Model, Year and Average MPG. With this data I can then perform queries such as what vehicles between 2003 and 2008 give an MPG figure of above 50? - However, for the purpose of this blog post and simplicity, lets simply just retrieve a list of some models from a manufacturer.

So, lets start on the browse all cars page, and view the page source. If we look at the first manufacturer header on the page, we see Abarth, followed by AC, etc.

Open the page source (for Google Chrome: Settings > Tools > View Source), and find "Abarth":

You will see that the Manufacturer is wrapped in a <h3> tag. Below this, is a <div> that contains each model under the Abarth name: 500, Grande Punto and Punto Evo. Let's try and get hold of these models, but first, we need to setup the project.

Setting up the project

Create a new project in Visual Studio, a simple console project should suffice for this blog post. Add the HTML Agility Pack to the project via NuGet and add the following code:
1:    class Program  
2:    {  
3:      static void Main(string[] args)  
4:      {  
5:        const string WEBSITE_LOCATION = @"";  
6:        var htmlDocument = new HtmlAgilityPack.HtmlDocument();  
7:        using (var webClient = new System.Net.WebClient())  
8:        {  
9:          using (var stream = webClient.OpenRead(WEBSITE_LOCATION))  
10:         {  
11:            htmlDocument.Load(stream);  
12:         }  
13:        }  
14:      }  
15:    }  

This simply loads the HTML page in to the HtmlDocument type so that we can run XPath queries against it to eventually get the value we are looking for.

Navigating Nodes with XPath

So, as noted earlier, we want to get a load of models from a manufacturer. Each <div> tag has a specific ID which we can use in our query (see "inline-list" below).
1:  <h3><a href="/car/abarth" style="text-decoration:none;color:#000;">Abarth</a></h3>  
2:  <div id="inline-list">  
3:       <ul>  
4:            <li><nobr><a href="/car/abarth/500">500</a> <span class="smallcopy">(34)</span> &nbsp; </nobr></li>  
5:            <li><nobr><a href="/car/abarth/grande punto">Grande Punto</a> <span class="smallcopy">(1)</span> &nbsp;</nobr></li>  
6:            <li><nobr><a href="/car/abarth/punto evo">Punto EVO</a> <span class="smallcopy">(3)</span> &nbsp; </nobr></li>  
7:       </ul>  
8:  </div>  

We can use this specific hook to get the values we want. So lets add the code:
1:  HtmlAgilityPack.HtmlNodeCollection divTags = htmlDocument.DocumentNode.SelectNodes("//div[@id='inline-list']");  

The above code returns a collection of <div> tags that represent each Manufacturer listed on the page. 

Let's break up the XPath syntax to make sense of it:
// - Selects nodes in the document from the current node that match the selection no matter where they are.
div - The specific nodes we are interested in.
[@id] - Predicate that defines a specific node.
[@id='inline-list'] - Predicate that defines a specific node with a specific value.

We can now dig deeper into each div tag, using a little more XPath and some LINQ to get the values we want.

Accessing the data using LINQ

OK, so within each item of the <div> tags we still have a load of rubbish we don't really need. All we want is the car models. Well we know each car model is between <a> (hyperlink) tags, with a href value. So using XPath and a little LINQ we can extract the data we need:

1:  HtmlAgilityPack.HtmlNode aTags = divTags.FirstOrDefault();  
2:  var manufacturerList = from hyperlink in aTags.SelectNodes(".//a[@href]")  
3:                         where hyperlink != null  
4:                         select hyperlink.InnerText;  

Line 1: Get the first manufacturer in the list 
Line 2: for each hyperlink inside the div, select all <a> tags with a 'href' node
Line 3: where the hyperlink isn't null (i.e. a href node was found), then:
Line 4: select the text inside of the hyperlink.


So, there you have it. You learned how to get specific values within a piece of HTML, a little XPath and some LINQ. Although the end data in this example probably isn't too useful, hopefully you can now see how you would expand upon this to find specific values and URLs to build a more complex system.