Writing a WP7 website scraper application

In this tutorial I will explain how you write a WP7 application using the HtmlAgility Pack in order to use information scraped from a website.
Website scraping is the act of retrieving information from a website page. An act by some considered stealing, by others borrowing. Let’s leave that debate to the others. In this post I will show how easy it is to scrape content from a website so that you can (re)use it in your Windows Phone 7 application. As it is, this information will for the most part also work in other, non WP7, projects of course.
Sometimes website scraping is the only means available to consume certain information from a website. If the website doesn’t have some publicly available API or web service you can use you’re pretty much left with scraping, whether you like it or not.

Now before reading on, it is extremely important to understand that there are legal issues concerning scraping: basically, as far as I understand it, you’re only allowed to use scraped data if you have clearance to do so by the website owner (i.e. the one that ‘owns’ the data).

HtmlAgility Pack

To get started, we will first need the Html Agility Pack (HAP) which is a very thorough HTML parser (read all about it here). The nice thing about HAP is that it also supports XPATH queries and Linq to Objects, making it actually fun (in a geeky way) to perform webscraping in C#. Unfortunately, currently ‘HAP for WP7’ doesn’t support XPATH queries so we’ll use Linq to Objects for the remainder of this tutorial.

For some nice demos of what can be done with HAP, make sure to check out the following tutorial.

When you download HAP from codeplex, you will still need to build the WP7 dll manually. Following post describes the steps needed to do this. For the lazy people amongst us, here you will find the compiled dll, based on the HAP version of February 2012 (make sure to ‘unblock’ the dll if you downloaded it from this site: rightclick the file, choose properties and then click ‘unblock’).

Add reference to HAP in WP7

Next we need to add a reference to the HAP dll in order to use all the sweetness it contains. Rightclick in your solution explorer on the References folder and choose Add Reference.

Add reference

Next point to the downloaded or compiled HtmlAgilityPack.dll file and choose Add.

Finally add a using statement and we are all set to go:

using

Download page to scrape

We use the LoadAsync method from the static HAP class HtmlWeb in order to download and parse our html file. We provide the url to download, as well as the callback method once the file is downloaded and processed by HAP:

loadasync

Next we define the callback method in which we immediately check if the download went well. Once that is done, we can start querying our parsed html document:

callback

Alternatives to downloading pages directly

As a side note, you might as well use the WP7 WebClient.DownloadStringAsync method to download the html. Afterwards you can then feed the downloaded string (or a string you retrieved through some other obscure manner) to HAP using the LoadHtml() method:

LoadHtml

Discovering the page layout

For the purpose of this demo, we would like to retrieve the url to latest xkcd.com joke image. In order to do this, the best way is to open your browser and open the developer’s tool by hitting the F12 key (works for me in Chrome and IE9). I prefer to use Chrome for this because of the simple fact that the currently selected element is highlighted on the page itself, making it easier to rapidly drill down to the element you need.

To rapidly view the position of a specific element inside the html DOM, simply right click the element on the page and choose “Inspect element”. So we right click the image and choose “Inspect element”:

devtools

Using the developer tool you now have to look for the element(s) and/or attributes needed. Once correctly identified the needed element, you need to find the most straightforward way to retrieve to element in your code:

  • if the html has abundant div and/or other elements with unique id’s, it is simply a matter of finding that specific element by filtering out to the unique id.
  • if the element has no unique identifier and is part of a group of equal elements you will need to iterate over all these elements and do some manual comparisons using for example regular expression; or hardcode the exact position (e.g. retrieve the 3th img element from a given node).

Note, at the bottom of the developer tools you’ll also notice the full ‘path’ of the element, which can be handy if you’re getting lost in more advanced (or crappy) pages.

To be honest, the ‘hardest part’ in my opinion is identifying the correct to the wanted element.

Using Linq-to-objects to get the goodies

Once we have identified the correct way of retrieving the element (or a certain attribute value of that element) it is time to start writing the necessary linq code. The basic principles are explained very clearly in this blog, so I’ll immediately dive into some more ‘hardcore’ *ahem * html.

For this demo we need the value of the “src” attribute of the img-element, inside the uniquely named div-element with id “comic”

First we’ll try to ‘capture’ the unique div, if we don’t find that one we can be pretty certain that some error occurred (e.g. our url is wrong, the site has changed its html, etc.):

query1

So here we capture all the child nodes of the document that are div-elements whose “id” attribute equals “comic”. Notice that a new IEnumerable collection is returned: we can feed this collection to a new query or iterate over this collection with, for example, a foreach loop.

Because we actually know that there will be only one element, or none at all, we will change our query to:

query2

By using FirstOrDefault() we can then check if the correct element was found or not (it will be null if not found), without having to cope with possible exceptions:

startscrape

Once inside the if-braces, we can be fairly we’ll find the desired image url.

Because we know that inside the comic-div there will be only one element of the img-kind we can retrieve the value of its attributes using the following statement:

getimg

The Element() method returns the single element  of the “img” type. Next we we retrieve the value of the “src” attribute of the -element.

It should be noted that the slightest mistake in your queries can result in some exception being thrown; make it a practice to catch these (otherwise your WP7 will simple quit without any notification should an HAP exception occur).

Important update (7/12/12)

A problem I circumvented in the previous examples was exceptions occuring if you tried accessing a value of an attribute  that didn’t exist. For example. Suppose we we need all elements that hav a class-attribute with a specific value. If we’d write the following, we could get exception if there is at least one element in the document that doesn’t have the class-attribute:

var q = from s in doc.DocumentNode.Descendants() where  s.Attributes["class"].Value == "someVal" select s;

This is why, in the previous examples, I so painfully slow walked down the nodes to get what I wanted.
However, there’s a pretty straightforward way of accomplishing all the above by simple checking if an element uberhaupt has the attribute we need as shown here:

var q = from s in doc.DocumentNode.Descendants() where s.Attributes["class"] !=null &&  s.Attributes["class"].Value == "someVal" select s;

It is important that you FIRST check if the attribute exists on the current element (s.Attrribute["class"] != null) and THEN check if the value compares to the one we need. Doing this the other way around will again results in a bug (because of the left-to-right order in which the expressions are completed.

Using Element and Elements

Depending on your preference, there’s several ways to retrieve the data you need. For example, if we are 100% certain that the page will have the html layout we need, we can drill down to the needed value using the Element/Elements methods, as follows.

  1. You use Element(string type) if you are certain that the current node will have one and only one child element of the given type , that is passed as parameter.
  2. You use Elements(string type) to retrieve all the child nodes of the current node of a given type.

In order to retrieve our image url, we could open the developer tool and note the unique path to it on the bottom of the window:

path

It’s now a matter of translating this path to the correct fluent method chain (is that a correct word?) , resulting in the following:

elementquery1

In fact, we could even hardcore this more (not always recommended). We know that we need the second div element of the body, and inside that div we again need the second div-element. Se we could write and thus skip the need for specifying the filters:

elementquery2

Note: I can’t really proclaim to be a skilled Linq writer, so if any of these steps can be done more quickly, don’t hesitate to mention so.

Cherry on the pie: show the image

To show that this works, suppose we have the following , very empty, WP7 xaml page on which we wish to load the joke:

xamlpage

All that needs to be done is assign the retrieved url as a new source to the Image control,named jokeimg:

BitmapImage b= new BitmapImage(new Uri(imgurl));
jokeimg.Source = b;

That’s all folks!

Update: You can download the full demo-solution here.

About these ads

About timdams
C#, .NET, Microsoft, security, .... Read more on : http://timdams.com/

20 Responses to Writing a WP7 website scraper application

  1. Pjotri says:

    I have only read little bit of you tutorial but wow thanks! I had so many problems building HAP.dll and there you are handling it out for nothing :)

  2. Steve says:

    Hey Tim, Thanks for the tutorial, this is something I want to create though I am a complete beginner at coding. I’ve struggled to get some things to work, probably putting in the wrong location or even in the wrong files. Would you be able to share your project file for this tutorial? Would help me learn the basics while creating an app I have a use for.

    • timdams says:

      Hi Steve,
      I’ve added a link to the sourcode-solution at the end of the blog.
      Thx for the suggestion.

      • Steve says:

        Thank you Tim!
        I’ll have a look through tonight after work :)
        Really appreciate it!

  3. naifmhd says:

    Hello. Thank you for the nice tutorial. I would like to know if you can explain how to scrap a content of text in a website with this method. thank you.

  4. Steve says:

    naifmhd :
    Hello. Thank you for the nice tutorial. I would like to know if you can explain how to scrap a content of text in a website with this method. thank you.

    I too would like to know this, got images to work perfectly (apart from sites source that use /image.jpg as the src”” instead of the full url).

    Scrapping text is the main use I have and I will continue to google for advice on how to do it, though if it’s not too difficult for you to write an addition to this tutorial it would be appreciated.

  5. timdams says:

    You can use the InnerText and InnerHtml properties of each HtmlNode you’ve scraped. So simply capture the div or whatever that contains your text, strip out stuff you don’t need, ét voila.
    This example strips out the headertext in the uperright corner of the xkcd-site:

    var newsdiv = (from divnode in doc.DocumentNode.Descendants(“div”)
    where divnode.Attributes["id"].Value == “news”
    select divnode).FirstOrDefault();

    MessageBox.Show(HttpUtility.HtmlDecode(newsdiv.InnerText));

    Which wil show the following text in the messagebox:
    XKCD updates every Monday, Wednesday, and Friday.
    You can get prints, posters, and t-shirts in the store.

  6. Daniel Pino says:

    Hello Tim,

    Thanks for the tutorial,

    I am creating a website for a club and I would like to create a windows phone app that is synced with the news on the club’s website. Would this be a good technique to use? Also, your examples show how to get a single piece of text, but how would I go about getting multiple pieces of text and applying them to a listbox of textblocks. I’m not sure if my question is clear, but please help me out. Or is there an easier alternative?

    Thanks

    • timdams says:

      Hi Daniel
      1° If you need multiple pieces of text (or divs, or whatever), you can simply write a query that returns a collection of the stuff you need. You than need to iterate over that collection and add each item to the list.
      For example, suppose we’d like to add each piece of string of the topleft div on the xkcd.com page (Archive, Forums, Blag, Store, About), we can write:

      var links = (from divnode in doc.DocumentNode.Descendants(“div”)
      where divnode.Attributes["id"].Value == “topLeft”
      select divnode).FirstOrDefault().Element(“ul”).Elements(“li”);
      foreach (var htmlNode in links)
      {
      lb.Items.Add(htmlNode.InnerText);
      }

      2° Scraping is’t very interesting if you yourself are the owner of the content. If you’d like to make a mobile version, I think it’d be better to better create a mobile version of the page as explained here http://www.webpagefx.com/design-build-mobile-web-site.html or google for convert page mobile (because basically webdesign is not my cup of tea :) ).

      Goodluck

  7. naifmhd says:

    Hello. If its not a too much trouble for you. can you help me understand why this code is not working.

    public MainPage()
    {
    InitializeComponent();
    HtmlWeb.LoadAsync(“http://www.haveeru.com.mv/dhivehi/news/120946″, DownLoadCompleted);

    }
    void DownLoadCompleted(object sender, HtmlDocumentLoadCompleted e)
    {
    if(e.Error == null)
    {
    HtmlDocument doc = e.Document;
    if (doc != null)
    {
    //Start scraping

    var newsdiv = (from divnode in doc.DocumentNode.Descendants(“div”)
    where divnode.Attributes["id"].Value == “article”
    select divnode).FirstOrDefault();

    MessageBox.Show(HttpUtility.HtmlDecode(newsdiv.InnerHtml));

    }
    }
    }

    I want to take innertext of
    appreciate any help.

  8. Steve says:

    Hi Tim,

    I changed ‘MessageBox.Show(HttpUtility.HtmlDecode(newsdiv.InnerText));’ this to

    Var news HttpUtility.HtmlDecode(newsdiv.InnerText)
    MessageBox.Show(news);

    Which works fine, though I’m struggling to call this variable from Xaml to use it as the default content of a textblock.

    Been reading and playing with DataContext and Bindings for two days but rather frustratingly I just cannot get it to work.

    Hoping you’d be able to help me out.

    I have realised I’ve stepped into a too advanced first project, but i’m not going to be able to get this out of my head until it’s resolved. Once completed (this stage anyways) I’m going to look at some more basic tutorials and learn how xaml and c# work together.

    Steve.

  9. Nice blog, I would like to tell you that you have given me much knowledge about it. Thanks for everything.

  10. NewbieScraper says:

    Thanks for this nice tutorial :) I had never worked with scraping before your tutorial and it was so easy for me to follow and to implement in my own project, that I really wanted to thank you via this reply. Keep up the good work ;)

  11. Arash says:

    I have found out that on heavy pages it takes time to download and work with a page you want to scrap from. If the page is not downloaded by the time you call the methods it wil simply jump the method over. My question is how do i 1. put a pause and check if the page is downloaded, so i can search in it?

  12. cambo78 says:

    Hey! Really nice blog post, but as a novice, I had no idea how to scrap thing in this website.

    Could someone helps me getting those text from here http://www.newsdlrp.com/dlpwaittime.html to a WP7 app ?

  13. Dominic says:

    Excellent blog post! Thanks for all the extra links to the background tutorials, really helped. This is exactly the type of tutorial I’ve been looking for. Thanks again Tim!

  14. Savio Mody says:

    Howdy would you mind letting me know which web host you’re working with? I’ve loaded your blog in 3 different browsers
    and I must say this blog loads a lot quicker then most.
    Can you suggest a good internet hosting provider at a honest price?
    Thank you, I appreciate it!

  15. Deepak Sharma says:

    how about extracting some text from that page instead of image…
    help…please

    • timdams says:

      Hi Deepak,
      You can use the InnerText and InnerHtml properties for that. It’s part of HtmlNode.

  16. Hello, Thanks for the article :)
    I am facing one issue while implementing it. I downloaded your source code and while building I am getting one issue as
    ***The type ‘System.Xml.XPath.IXPathNavigable’ is defined in an assembly that is not referenced. You must add a reference to assembly ‘System.Xml.XPath, Version=2.0.5.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35′. C:\Users\Sanket\Documents\Visual Studio 2012\Projects\Scraping\Scraping\MainPage.xaml.cs****
    And this error is pointing to the statement “HtmlDocument doc = e.Document;”
    Can you help me out as why it is showing this error?

    Note: I am using Visual studio 2012, I have implemented the HtmlAgility packet properly through nuget manager.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 327 other followers

%d bloggers like this: