Writing a WP7 website scraper application
April 16, 2012 20 Comments
In this tutorial I will explain how you write a WP7 application using the HtmlAgility Pack in order to use information scraped from a website.
Website scraping is the act of retrieving information from a website page. An act by some considered stealing, by others borrowing. Let’s leave that debate to the others. In this post I will show how easy it is to scrape content from a website so that you can (re)use it in your Windows Phone 7 application. As it is, this information will for the most part also work in other, non WP7, projects of course.
Sometimes website scraping is the only means available to consume certain information from a website. If the website doesn’t have some publicly available API or web service you can use you’re pretty much left with scraping, whether you like it or not.
Now before reading on, it is extremely important to understand that there are legal issues concerning scraping: basically, as far as I understand it, you’re only allowed to use scraped data if you have clearance to do so by the website owner (i.e. the one that ‘owns’ the data).
To get started, we will first need the Html Agility Pack (HAP) which is a very thorough HTML parser (read all about it here). The nice thing about HAP is that it also supports XPATH queries and Linq to Objects, making it actually fun (in a geeky way) to perform webscraping in C#. Unfortunately, currently ‘HAP for WP7’ doesn’t support XPATH queries so we’ll use Linq to Objects for the remainder of this tutorial.
For some nice demos of what can be done with HAP, make sure to check out the following tutorial.
When you download HAP from codeplex, you will still need to build the WP7 dll manually. Following post describes the steps needed to do this. For the lazy people amongst us, here you will find the compiled dll, based on the HAP version of February 2012 (make sure to ‘unblock’ the dll if you downloaded it from this site: rightclick the file, choose properties and then click ‘unblock’).
Add reference to HAP in WP7
Next we need to add a reference to the HAP dll in order to use all the sweetness it contains. Rightclick in your solution explorer on the References folder and choose Add Reference.
Next point to the downloaded or compiled HtmlAgilityPack.dll file and choose Add.
Finally add a using statement and we are all set to go:
Download page to scrape
We use the LoadAsync method from the static HAP class HtmlWeb in order to download and parse our html file. We provide the url to download, as well as the callback method once the file is downloaded and processed by HAP:
Next we define the callback method in which we immediately check if the download went well. Once that is done, we can start querying our parsed html document:
Alternatives to downloading pages directly
As a side note, you might as well use the WP7 WebClient.DownloadStringAsync method to download the html. Afterwards you can then feed the downloaded string (or a string you retrieved through some other obscure manner) to HAP using the LoadHtml() method:
Discovering the page layout
For the purpose of this demo, we would like to retrieve the url to latest xkcd.com joke image. In order to do this, the best way is to open your browser and open the developer’s tool by hitting the F12 key (works for me in Chrome and IE9). I prefer to use Chrome for this because of the simple fact that the currently selected element is highlighted on the page itself, making it easier to rapidly drill down to the element you need.
To rapidly view the position of a specific element inside the html DOM, simply right click the element on the page and choose “Inspect element”. So we right click the image and choose “Inspect element”:
Using the developer tool you now have to look for the element(s) and/or attributes needed. Once correctly identified the needed element, you need to find the most straightforward way to retrieve to element in your code:
- if the html has abundant div and/or other elements with unique id’s, it is simply a matter of finding that specific element by filtering out to the unique id.
- if the element has no unique identifier and is part of a group of equal elements you will need to iterate over all these elements and do some manual comparisons using for example regular expression; or hardcode the exact position (e.g. retrieve the 3th img element from a given node).
Note, at the bottom of the developer tools you’ll also notice the full ‘path’ of the element, which can be handy if you’re getting lost in more advanced (or crappy) pages.
To be honest, the ‘hardest part’ in my opinion is identifying the correct to the wanted element.
Using Linq-to-objects to get the goodies
Once we have identified the correct way of retrieving the element (or a certain attribute value of that element) it is time to start writing the necessary linq code. The basic principles are explained very clearly in this blog, so I’ll immediately dive into some more ‘hardcore’ *ahem * html.
For this demo we need the value of the “src” attribute of the img-element, inside the uniquely named div-element with id “comic”
First we’ll try to ‘capture’ the unique div, if we don’t find that one we can be pretty certain that some error occurred (e.g. our url is wrong, the site has changed its html, etc.):
So here we capture all the child nodes of the document that are div-elements whose “id” attribute equals “comic”. Notice that a new IEnumerable collection is returned: we can feed this collection to a new query or iterate over this collection with, for example, a foreach loop.
Because we actually know that there will be only one element, or none at all, we will change our query to:
By using FirstOrDefault() we can then check if the correct element was found or not (it will be null if not found), without having to cope with possible exceptions:
Once inside the if-braces, we can be fairly we’ll find the desired image url.
Because we know that inside the comic-div there will be only one element of the img-kind we can retrieve the value of its attributes using the following statement:
The Element() method returns the single element of the “img” type. Next we we retrieve the value of the “src” attribute of the -element.
It should be noted that the slightest mistake in your queries can result in some exception being thrown; make it a practice to catch these (otherwise your WP7 will simple quit without any notification should an HAP exception occur).
Important update (7/12/12)
A problem I circumvented in the previous examples was exceptions occuring if you tried accessing a value of an attribute that didn’t exist. For example. Suppose we we need all elements that hav a class-attribute with a specific value. If we’d write the following, we could get exception if there is at least one element in the document that doesn’t have the class-attribute:
var q = from s in doc.DocumentNode.Descendants() where s.Attributes["class"].Value == "someVal" select s;
This is why, in the previous examples, I so painfully slow walked down the nodes to get what I wanted.
However, there’s a pretty straightforward way of accomplishing all the above by simple checking if an element uberhaupt has the attribute we need as shown here:
var q = from s in doc.DocumentNode.Descendants() where s.Attributes["class"] !=null && s.Attributes["class"].Value == "someVal" select s;
It is important that you FIRST check if the attribute exists on the current element (s.Attrribute["class"] != null) and THEN check if the value compares to the one we need. Doing this the other way around will again results in a bug (because of the left-to-right order in which the expressions are completed.
Using Element and Elements
Depending on your preference, there’s several ways to retrieve the data you need. For example, if we are 100% certain that the page will have the html layout we need, we can drill down to the needed value using the Element/Elements methods, as follows.
- You use Element(string type) if you are certain that the current node will have one and only one child element of the given type , that is passed as parameter.
- You use Elements(string type) to retrieve all the child nodes of the current node of a given type.
In order to retrieve our image url, we could open the developer tool and note the unique path to it on the bottom of the window:
It’s now a matter of translating this path to the correct fluent method chain (is that a correct word?) , resulting in the following:
In fact, we could even hardcore this more (not always recommended). We know that we need the second div element of the body, and inside that div we again need the second div-element. Se we could write and thus skip the need for specifying the filters:
Note: I can’t really proclaim to be a skilled Linq writer, so if any of these steps can be done more quickly, don’t hesitate to mention so.
Cherry on the pie: show the image
To show that this works, suppose we have the following , very empty, WP7 xaml page on which we wish to load the joke:
All that needs to be done is assign the retrieved url as a new source to the Image control,named jokeimg:
BitmapImage b= new BitmapImage(new Uri(imgurl)); jokeimg.Source = b;
That’s all folks!
Update: You can download the full demo-solution here.