Scrape Amazon Products: Scraping Amazon.com Product Reviews

The Code

This Perl script builds a URL to the review page for a given ASIN, uses regular expressions to find the reviews, and breaks the review into its pieces: rating, title, date, reviewer, and the text of the review.

Save the following script to a file called get_reviews.pl:

 #!/usr/bin/perl -w
 # get_reviews.pl
 #
 # A script to scrape Amazon, retrieve
 # reviews, and write to a file.
 # Usage: perl get_reviews.pl <asin>
 use strict;
 use LWP::Simple;

 # Take the ASIN from the command line.
 my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

 # Assemble the URL from the passed ASIN.
 my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";

 # Set up unescape-HTML rules. Quicker than URI::Escape.
 my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
 my $unescape_re = join '|' => keys %unescape;

 # Request the URL.
 my $content = get($url);
 die "Could not retrieve $url" unless $content;

 # Loop through the HTML, looking for matches
 while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?(.*?), (.*?)\n.
 *?Reviewer:\n\n(.*?).*?</table>\n(.*?) \n !mgis) {

 my($rating,$title,$date,$reviewer,$review) =
 ($1||'',$2||'',$3||'',$4||'',$5||'');
 $reviewer =~ s!<.+?>!!g; # drop all HTML tags
 $reviewer =~ s!$.+?$!!g; # remove anything in parenthesis
 $reviewer =~ s!\n!!g; # remove newlines
 $review =~ s!<.+?>!!g; # drop all HTML tags
 $review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.

 # Print the results
 print "$title\n" . "$date\n" . "by $reviewer\n" .
 "$rating stars.\n\n" . "$review\n\n";

 }

Source: http://oreilly.com/pub/h/977

Scrape Amazon Products

Monday, 27 May 2013

Scraping Amazon.com Product Reviews

No comments:

Post a Comment