april 16, 2012april 19, 2020timdams22 Reacties

Writing a WP7 website scraper application

In this tutorial I will explain how you write a WP7 application using the HtmlAgility Pack in order to use information scraped from a website.
Website scraping is the act of retrieving information from a website page. An act by some considered stealing, by others borrowing. Let’s leave that debate to the others. In this post I will show how easy it is to scrape content from a website so that you can (re)use it in your Windows Phone 7 application. As it is, this information will for the most part also work in other, non WP7, projects of course.
Sometimes website scraping is the only means available to consume certain information from a website. If the website doesn’t have some publicly available API or web service you can use you’re pretty much left with scraping, whether you like it or not.

Now before reading on, it is extremely important to understand that there are legal issues concerning scraping: basically, as far as I understand it, you’re only allowed to use scraped data if you have clearance to do so by the website owner (i.e. the one that ‘owns’ the data).

HtmlAgility Pack

To get started, we will first need the Html Agility Pack (HAP) which is a very thorough HTML parser (read all about it here). The nice thing about HAP is that it also supports XPATH queries and Linq to Objects, making it actually fun (in a geeky way) to perform webscraping in C#. Unfortunately, currently ‘HAP for WP7’ doesn’t support XPATH queries so we’ll use Linq to Objects for the remainder of this tutorial.

For some nice demos of what can be done with HAP, make sure to check out the following tutorial.

When you download HAP from codeplex, you will still need to build the WP7 dll manually. Following post describes the steps needed to do this. For the lazy people amongst us, here you will find the compiled dll, based on the HAP version of February 2012 (make sure to ‘unblock’ the dll if you downloaded it from this site: rightclick the file, choose properties and then click ‘unblock’).

Add reference to HAP in WP7

Next we need to add a reference to the HAP dll in order to use all the sweetness it contains. Rightclick in your solution explorer on the References folder and choose Add Reference.

Next point to the downloaded or compiled HtmlAgilityPack.dll file and choose Add.

Finally add a using statement and we are all set to go:

Download page to scrape

We use the LoadAsync method from the static HAP class HtmlWeb in order to download and parse our html file. We provide the url to download, as well as the callback method once the file is downloaded and processed by HAP:

Next we define the callback method in which we immediately check if the download went well. Once that is done, we can start querying our parsed html document:

Alternatives to downloading pages directly

As a side note, you might as well use the WP7 WebClient.DownloadStringAsync method to download the html. Afterwards you can then feed the downloaded string (or a string you retrieved through some other obscure manner) to HAP using the LoadHtml() method:

Discovering the page layout

For the purpose of this demo, we would like to retrieve the url to latest xkcd.com joke image. In order to do this, the best way is to open your browser and open the developer’s tool by hitting the F12 key (works for me in Chrome and IE9). I prefer to use Chrome for this because of the simple fact that the currently selected element is highlighted on the page itself, making it easier to rapidly drill down to the element you need.

To rapidly view the position of a specific element inside the html DOM, simply right click the element on the page and choose “Inspect element”. So we right click the image and choose “Inspect element”:

Using the developer tool you now have to look for the element(s) and/or attributes needed. Once correctly identified the needed element, you need to find the most straightforward way to retrieve to element in your code:

if the html has abundant div and/or other elements with unique id’s, it is simply a matter of finding that specific element by filtering out to the unique id.
if the element has no unique identifier and is part of a group of equal elements you will need to iterate over all these elements and do some manual comparisons using for example regular expression; or hardcode the exact position (e.g. retrieve the 3th img element from a given node).

Note, at the bottom of the developer tools you’ll also notice the full ‘path’ of the element, which can be handy if you’re getting lost in more advanced (or crappy) pages.

To be honest, the ‘hardest part’ in my opinion is identifying the correct to the wanted element.

Using Linq-to-objects to get the goodies

Once we have identified the correct way of retrieving the element (or a certain attribute value of that element) it is time to start writing the necessary linq code. The basic principles are explained very clearly in this blog, so I’ll immediately dive into some more ‘hardcore’ *ahem * html.

For this demo we need the value of the “src” attribute of the img-element, inside the uniquely named div-element with id “comic”

First we’ll try to ‘capture’ the unique div, if we don’t find that one we can be pretty certain that some error occurred (e.g. our url is wrong, the site has changed its html, etc.):

So here we capture all the child nodes of the document that are div-elements whose “id” attribute equals “comic”. Notice that a new IEnumerable collection is returned: we can feed this collection to a new query or iterate over this collection with, for example, a foreach loop.

Because we actually know that there will be only one element, or none at all, we will change our query to:

By using FirstOrDefault() we can then check if the correct element was found or not (it will be null if not found), without having to cope with possible exceptions:

Once inside the if-braces, we can be fairly we’ll find the desired image url.

Because we know that inside the comic-div there will be only one element of the img-kind we can retrieve the value of its attributes using the following statement:

The Element() method returns the single element of the “img” type. Next we we retrieve the value of the “src” attribute of the -element.

It should be noted that the slightest mistake in your queries can result in some exception being thrown; make it a practice to catch these (otherwise your WP7 will simple quit without any notification should an HAP exception occur).

Important update (7/12/12)

A problem I circumvented in the previous examples was exceptions occuring if you tried accessing a value of an attribute that didn’t exist. For example. Suppose we we need all elements that hav a class-attribute with a specific value. If we’d write the following, we could get exception if there is at least one element in the document that doesn’t have the class-attribute:

var q = from s in doc.DocumentNode.Descendants() where  s.Attributes["class"].Value == "someVal" select s;

This is why, in the previous examples, I so painfully slow walked down the nodes to get what I wanted.
However, there’s a pretty straightforward way of accomplishing all the above by simple checking if an element uberhaupt has the attribute we need as shown here:

var q = from s in doc.DocumentNode.Descendants() where s.Attributes["class"] !=null &&  s.Attributes["class"].Value == "someVal" select s;

It is important that you FIRST check if the attribute exists on the current element (s.Attrribute[“class”] != null) and THEN check if the value compares to the one we need. Doing this the other way around will again results in a bug (because of the left-to-right order in which the expressions are completed.

Using Element and Elements

Depending on your preference, there’s several ways to retrieve the data you need. For example, if we are 100% certain that the page will have the html layout we need, we can drill down to the needed value using the Element/Elements methods, as follows.

You use Element(string type) if you are certain that the current node will have one and only one child element of the given type , that is passed as parameter.
You use Elements(string type) to retrieve all the child nodes of the current node of a given type.

In order to retrieve our image url, we could open the developer tool and note the unique path to it on the bottom of the window:

It’s now a matter of translating this path to the correct fluent method chain (is that a correct word?) , resulting in the following:

In fact, we could even hardcore this more (not always recommended). We know that we need the second div element of the body, and inside that div we again need the second div-element. Se we could write and thus skip the need for specifying the filters:

Note: I can’t really proclaim to be a skilled Linq writer, so if any of these steps can be done more quickly, don’t hesitate to mention so.

Cherry on the pie: show the image

To show that this works, suppose we have the following , very empty, WP7 xaml page on which we wish to load the joke:

All that needs to be done is assign the retrieved url as a new source to the Image control,named jokeimg:

BitmapImage b= new BitmapImage(new Uri(imgurl));
jokeimg.Source = b;

That’s all folks!

Update: You can download the full demo-solution here.

22 gedachten over “Writing a WP7 website scraper application”

Pjotri april 18, 2012 — 15:31

I have only read little bit of you tutorial but wow thanks! I had so many problems building HAP.dll and there you are handling it out for nothing 🙂

LikeLike

Beantwoorden
Steve mei 2, 2012 — 19:31

Hey Tim, Thanks for the tutorial, this is something I want to create though I am a complete beginner at coding. I’ve struggled to get some things to work, probably putting in the wrong location or even in the wrong files. Would you be able to share your project file for this tutorial? Would help me learn the basics while creating an app I have a use for.

LikeLike

Beantwoorden
1. (Post author)
  
  timdams mei 3, 2012 — 09:27
  
  Hi Steve,
  I’ve added a link to the sourcode-solution at the end of the blog.
  Thx for the suggestion.
  
  LikeLike
  
  Beantwoorden
  1. Steve mei 3, 2012 — 10:14
    
    Thank you Tim!
    I’ll have a look through tonight after work 🙂
    Really appreciate it!
    
    LikeLike
naifmhd mei 3, 2012 — 08:53

Hello. Thank you for the nice tutorial. I would like to know if you can explain how to scrap a content of text in a website with this method. thank you.

LikeLike

Beantwoorden
Steve mei 6, 2012 — 12:21

naifmhd :
Hello. Thank you for the nice tutorial. I would like to know if you can explain how to scrap a content of text in a website with this method. thank you.

I too would like to know this, got images to work perfectly (apart from sites source that use /image.jpg as the src”” instead of the full url).

Scrapping text is the main use I have and I will continue to google for advice on how to do it, though if it’s not too difficult for you to write an addition to this tutorial it would be appreciated.

LikeLike

Beantwoorden
(Post author)

timdams mei 7, 2012 — 14:36

You can use the InnerText and InnerHtml properties of each HtmlNode you’ve scraped. So simply capture the div or whatever that contains your text, strip out stuff you don’t need, ét voila.
This example strips out the headertext in the uperright corner of the xkcd-site:

var newsdiv = (from divnode in doc.DocumentNode.Descendants(“div”)
where divnode.Attributes[“id”].Value == “news”
select divnode).FirstOrDefault();

MessageBox.Show(HttpUtility.HtmlDecode(newsdiv.InnerText));

Which wil show the following text in the messagebox:
XKCD updates every Monday, Wednesday, and Friday.
You can get prints, posters, and t-shirts in the store.

LikeLike

Beantwoorden
Daniel Pino mei 8, 2012 — 10:14

Hello Tim,

Thanks for the tutorial,

I am creating a website for a club and I would like to create a windows phone app that is synced with the news on the club’s website. Would this be a good technique to use? Also, your examples show how to get a single piece of text, but how would I go about getting multiple pieces of text and applying them to a listbox of textblocks. I’m not sure if my question is clear, but please help me out. Or is there an easier alternative?

Thanks

LikeLike

Beantwoorden
1. (Post author)
  
  timdams mei 8, 2012 — 10:25
  
  Hi Daniel
  1° If you need multiple pieces of text (or divs, or whatever), you can simply write a query that returns a collection of the stuff you need. You than need to iterate over that collection and add each item to the list.
  For example, suppose we’d like to add each piece of string of the topleft div on the xkcd.com page (Archive, Forums, Blag, Store, About), we can write:
  
  var links = (from divnode in doc.DocumentNode.Descendants(“div”)
  where divnode.Attributes[“id”].Value == “topLeft”
  select divnode).FirstOrDefault().Element(“ul”).Elements(“li”);
  foreach (var htmlNode in links)
  {
  lb.Items.Add(htmlNode.InnerText);
  }
  
  2° Scraping is’t very interesting if you yourself are the owner of the content. If you’d like to make a mobile version, I think it’d be better to better create a mobile version of the page as explained here http://www.webpagefx.com/design-build-mobile-web-site.html or google for convert page mobile (because basically webdesign is not my cup of tea 🙂 ).
  
  Goodluck
  
  LikeLike
  
  Beantwoorden
naifmhd mei 9, 2012 — 22:00

Hello. If its not a too much trouble for you. can you help me understand why this code is not working.

public MainPage()
{
InitializeComponent();
HtmlWeb.LoadAsync(“http://www.haveeru.com.mv/dhivehi/news/120946”, DownLoadCompleted);

}
void DownLoadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
if(e.Error == null)
{
HtmlDocument doc = e.Document;
if (doc != null)
{
//Start scraping

var newsdiv = (from divnode in doc.DocumentNode.Descendants(“div”)
where divnode.Attributes[“id”].Value == “article”
select divnode).FirstOrDefault();

MessageBox.Show(HttpUtility.HtmlDecode(newsdiv.InnerHtml));

}
}
}

I want to take innertext of
appreciate any help.

LikeLike

Beantwoorden
Steve mei 10, 2012 — 13:18

Hi Tim,

I changed ‘MessageBox.Show(HttpUtility.HtmlDecode(newsdiv.InnerText));’ this to

Var news HttpUtility.HtmlDecode(newsdiv.InnerText)
MessageBox.Show(news);

Which works fine, though I’m struggling to call this variable from Xaml to use it as the default content of a textblock.

Been reading and playing with DataContext and Bindings for two days but rather frustratingly I just cannot get it to work.

Hoping you’d be able to help me out.

I have realised I’ve stepped into a too advanced first project, but i’m not going to be able to get this out of my head until it’s resolved. Once completed (this stage anyways) I’m going to look at some more basic tutorials and learn how xaml and c# work together.

Steve.

LikeLike

Beantwoorden
Web Data Extractor mei 16, 2012 — 04:06

Nice blog, I would like to tell you that you have given me much knowledge about it. Thanks for everything.

LikeLike

Beantwoorden
NewbieScraper juli 14, 2012 — 19:48

Thanks for this nice tutorial 🙂 I had never worked with scraping before your tutorial and it was so easy for me to follow and to implement in my own project, that I really wanted to thank you via this reply. Keep up the good work 😉

LikeLike

Beantwoorden
Arash augustus 23, 2012 — 14:39

I have found out that on heavy pages it takes time to download and work with a page you want to scrap from. If the page is not downloaded by the time you call the methods it wil simply jump the method over. My question is how do i 1. put a pause and check if the page is downloaded, so i can search in it?

LikeLike

Beantwoorden
cambo78 september 14, 2012 — 15:57

Hey! Really nice blog post, but as a novice, I had no idea how to scrap thing in this website.

Could someone helps me getting those text from here http://www.newsdlrp.com/dlpwaittime.html to a WP7 app ?

LikeLike

Beantwoorden
Dominic november 2, 2012 — 13:37

Excellent blog post! Thanks for all the extra links to the background tutorials, really helped. This is exactly the type of tutorial I’ve been looking for. Thanks again Tim!

LikeLike

Beantwoorden
Savio Mody januari 4, 2013 — 10:10

Howdy would you mind letting me know which web host you’re working with? I’ve loaded your blog in 3 different browsers
and I must say this blog loads a lot quicker then most.
Can you suggest a good internet hosting provider at a honest price?
Thank you, I appreciate it!

LikeLike

Beantwoorden
Deepak Sharma januari 5, 2013 — 12:19

how about extracting some text from that page instead of image…
help…please

LikeLike

Beantwoorden
1. (Post author)
  
  timdams januari 5, 2013 — 21:01
  
  Hi Deepak,
  You can use the InnerText and InnerHtml properties for that. It’s part of HtmlNode.
  
  LikeLike
  
  Beantwoorden
Sanket Ghorpade november 30, 2013 — 14:48

Hello, Thanks for the article 🙂
I am facing one issue while implementing it. I downloaded your source code and while building I am getting one issue as
***The type ‘System.Xml.XPath.IXPathNavigable’ is defined in an assembly that is not referenced. You must add a reference to assembly ‘System.Xml.XPath, Version=2.0.5.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35’. C:\Users\Sanket\Documents\Visual Studio 2012\Projects\Scraping\Scraping\MainPage.xaml.cs****
And this error is pointing to the statement “HtmlDocument doc = e.Document;”
Can you help me out as why it is showing this error?

Note: I am using Visual studio 2012, I have implemented the HtmlAgility packet properly through nuget manager.

LikeLike

Beantwoorden
1. Shorty augustus 19, 2014 — 06:56
  
  Sharp thngniki! Thanks for the answer.
  
  LikeLike
  
  Beantwoorden
  1. CAGRValue.com mei 30, 2018 — 03:20
    
    What ?
    
    LikeLike