Thursday, 30 May 2013

How to Scrape Website Content

While this tip might be a little more advanced, it shows how easy it is to “scrape” content from websites into your spreadsheets. We’ll be relying on the fact that HTML is a form of XML (strictly speaking XHTML is a valid form of XML, but for all intents and purposes we can assume HTML is close enough to a form of XML).

    Go to the website you want to scrape content from. In this example, we’ll be taking data from a person’s twitter stream, for example http://twitter.com/brickandrews.
    By looking the page source, determine what html element on the page the content you want is found in. For example, the twitter updates on my twitter page are found in span elements of the class “entry-content”:



    In your Google Docs spreadsheet, enter the following formula:

=importXML("http://twitter.com/brickandrews","//span[@class='entry-content']")

    The spreadsheet will fill in with the content from the imported xml:


The importXML function allows us to scrape the content from my twitter page. It takes two arguments:

    The url of the xml you want to import (in this case, its the html from the page at http://twitter.com/brickandrews!).
    An xpath query that specifies what part of the xml you want to import. This is the trickiest part. In my example, the xpath simply specifies any span element with a class attribute equal to “entry-content”. It’s easy enough to hack around with the xpath, but you also might like reading up on how to create xpath queries.

Because my twitter page is constantly updated, the spreadsheet will constantly be updated to reflect this “live” data. Of course web page content isn’t the only kind of xml you can import using the importXML function. Many web apps (like twitter, Google Apps, and more) provide xml based API’s which you could also import from using the same method described above. If you’ve got an interesting example, please share it in the comments!

As always, you can check out the spreadsheet used in this example.

Source: http://googledocstips.com/2011/04/15/how-to-scrape-website-content/

No comments:

Post a Comment