Thursday, 30 May 2013

How to Scrape Website Content

While this tip might be a little more advanced, it shows how easy it is to “scrape” content from websites into your spreadsheets. We’ll be relying on the fact that HTML is a form of XML (strictly speaking XHTML is a valid form of XML, but for all intents and purposes we can assume HTML is close enough to a form of XML).

    Go to the website you want to scrape content from. In this example, we’ll be taking data from a person’s twitter stream, for example http://twitter.com/brickandrews.
    By looking the page source, determine what html element on the page the content you want is found in. For example, the twitter updates on my twitter page are found in span elements of the class “entry-content”:



    In your Google Docs spreadsheet, enter the following formula:

=importXML("http://twitter.com/brickandrews","//span[@class='entry-content']")

    The spreadsheet will fill in with the content from the imported xml:


The importXML function allows us to scrape the content from my twitter page. It takes two arguments:

    The url of the xml you want to import (in this case, its the html from the page at http://twitter.com/brickandrews!).
    An xpath query that specifies what part of the xml you want to import. This is the trickiest part. In my example, the xpath simply specifies any span element with a class attribute equal to “entry-content”. It’s easy enough to hack around with the xpath, but you also might like reading up on how to create xpath queries.

Because my twitter page is constantly updated, the spreadsheet will constantly be updated to reflect this “live” data. Of course web page content isn’t the only kind of xml you can import using the importXML function. Many web apps (like twitter, Google Apps, and more) provide xml based API’s which you could also import from using the same method described above. If you’ve got an interesting example, please share it in the comments!

As always, you can check out the spreadsheet used in this example.

Source: http://googledocstips.com/2011/04/15/how-to-scrape-website-content/

Monday, 27 May 2013

Scraping Amazon.com Product Reviews

The Code

This Perl script builds a URL to the review page for a given ASIN, uses regular expressions to find the reviews, and breaks the review into its pieces: rating, title, date, reviewer, and the text of the review.

Save the following script to a file called get_reviews.pl:

    #!/usr/bin/perl -w
    # get_reviews.pl
    #
    # A script to scrape Amazon, retrieve
    # reviews, and write to a file.
    # Usage: perl get_reviews.pl <asin>
    use strict;
    use LWP::Simple;

    # Take the ASIN from the command line.
    my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

    # Assemble the URL from the passed ASIN.
    my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";

    # Set up unescape-HTML rules. Quicker than URI::Escape.
    my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
    my $unescape_re = join '|' => keys %unescape;

    # Request the URL.
    my $content = get($url);
    die "Could not retrieve $url" unless $content;

    # Loop through the HTML, looking for matches
    while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)\n.
    *?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {

        my($rating,$title,$date,$reviewer,$review) =
                          ($1||'',$2||'',$3||'',$4||'',$5||'');
        $reviewer =~ s!<.+?>!!g;   # drop all HTML tags
        $reviewer =~ s!\(.+?\)!!g; # remove anything in parenthesis
        $reviewer =~ s!\n!!g;      # remove newlines
        $review =~ s!<.+?>!!g;     # drop all HTML tags
        $review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.

        # Print the results
        print "$title\n" . "$date\n" . "by $reviewer\n" .
              "$rating stars.\n\n" . "$review\n\n";

    }



Source: http://oreilly.com/pub/h/977

Friday, 24 May 2013

Amazon Price Scraping

Running a software company means that you have to be dynamic, creative, and most of all innovative. I strive every day to create unique and interesting new ways to do business online. Many of my clients sell their products on Amazon, Google Merchant Central, Shopping.com, Pricegrabber, NextTag, and other shopping sites.

Amazon is by far the most powerful, and so I focus much of my efforts on creating software specifically for their portal. I’ve created very lightweight programs that move data from CSV, XML, and other formats to Amazon AWS using the Amazon Inventory API. I’ve also created programs that push data from Magento directly to Amazon, and do this automatically, updating every few hours like clockwork. Some of my customers sell hundreds of thousands of products on Amazon due to this technology.

Doctrine ORM and Magento

I’m a strong believer in the power of Doctrine ORM in combination with Zend Framework, and I was an early adopter of this technology in production environments. More recently, I’ve been using Doctrine to generate models for Magento and then using these models in the development of advanced information scraping systems for price matching my client’s products against Amazon’s merchants. I prefer to use Doctrine because the documentation is awesome, the object model makes sense, and it is far easier to utilize outside of the Magento core.

What is price matching?

Source: http://www.christopherhogan.com/2011/11/12/amazon-price-scraping/

Thursday, 16 May 2013

5 Awesome Amazon Product Reviews

In the age of social media, brands are obsessed with audience engagement. Brands want you to tweet at them, to retweet them, to comment on their Facebook pages and mention them in your blog posts. They’re just dying for feedback

People have caught on to how much brands want to hear from them, and some have taken their time to share their opinions about brands and their products in a new and creative way: funny Amazon product reviews. It’s become a new Web genre of sorts. Check out these five examples of entertaining Amazon reviews for some pretty random products.

Tuscan Whole Milk, 1 Gallon, 128 fl oz: Who knew that whole milk could inspire such creative prose, and even poetry. Check out these three creative reviews, one of which is actually a take on Edgar Allan Poe’s “The Raven” rewritten about milk. Bravo. The review below eloquently outlines how to properly savor this fine milk and its complex flavor profile.

Guardian Angel: This bizarre-looking thing is apparently some therapeutic, massage-type product. The strange, egg-like appearance wasn’t lost on the reviewer below who apparently was body-snatched halfway through the review.

Hutzler 571 Banana Slicer: The Hutzler Banana slicer makes the perfect gift for inmates and parolees who can’t be around sharp objects or own firearms. And as a bonus, one reviewer noted, it can even save your marriage.

Wheelmate Laptop Steering Wheel Desk: The reviews that poured in about this ingenious product, which creates a mini-desk surface right on your car’s steering wheel so you can multi-task while driving, naturally praised the product for giving people the opportunity to do other important things while driving — like making cocktails. The best part about the customer reviews for this one, though, wasn’t the reviews so much as the user-submitted photos of car crashes.

BIC Cristal For Her Ball Pen, 1.0mm, Black, 16ct (MSLP16-Blk): Of course women were thrilled when BIC came out with a special pack of pens just for ladies, featuring pastel colors and thinner barrels. Naturally, they took to Amazon to voice their enthusiasm for these pens made just for them.

Source: http://www.digiday.com/brands/5-awesome-amazon-product-reviews/

Monday, 6 May 2013

Amazon Scraper – Scraping Amazon’s Product Catalog using Optimal Product Finder!

Dominate the Amazon Affiliate World with the new Optimal Product Finder!
Now You Too Can Quickly & Easily Scrape Amazon’s US Product Catalog!

.

Optimal Product FinderOptimal Product Finder is the premier windows based solution for scraping Amazon’s US Product Advertising API. Optimal Product Finder (OPF) compiles a CSV (Comma Seperated Value) file to use in other scripts to create Affiliate Niche Sites and earn money with Amazon’s Associates Program.

In just minutes, you can find hundreds of targeted products for even the smallest niche using Optimal Product Finder deep-scan technology.

However, OPF does not stop here! It not only allows you to search and scrape Amazon’s US API for specific products using 16 search parameters, but also sort through and clean the information and save it as a CSV data file with your affiliate parameter already included in the product links.

Why Use Optimal Product Finder?

If you are building any sort of Amazon niche or price comparison site, you know how frustrating and time consuming it can be to copy product details and images by hand. It is easier to use an automated Amazon research tool and scraper like Optimal Product Finder.

Here are a few of the features of this POWERFUL Amazon Product Scraper:

Source: http://www.sitecb.com/amazon-scraper-scraping-amazons-product-catalog-using-optimal-product-finder.php