Wednesday, 31 December 2014

Data Scraping Services with Proxy Data Scraping

Have you ever heard of "data scraping? Data Scraping is the process of gathering relevant information in the public domain on the internet (private areas even if the conditions are met) and stored in databases or spreadsheets for later use in various applications. Scraping data technology is not new and a successful businessman his fortune by using data scraping technology.

Sometimes owners of sites that are not derived much pleasure from the automated harvesting of their data. Webmasters have learned to deny access to web scrapers their websites using tools or methods that some IP addresses to block the content of the site here. scrapers data is left to either target a different site, or the script to move the harvest of a computer using a different IP address each time and get as much information as possible to "all computers finally blocked the nozzle.

Fortunately, there is a modern solution to this problem. Proxy data scraping technology solves the problem by using a proxy IP addresses. When your data scraping program performs an extraction of a website, the site thinks that it comes from a different IP address. For site owner, proxies just like scratching a short period of increased traffic around the world. They have very limited resources and tedious to block such a scenario, but more importantly - for the most part, they simply do not know they are scraped.

Now you can ask. "Where can I proxy data scraping technology for my project" The "do-it-yourself solution is free, unfortunately, not easy at all Creation of a database scraping proxy network takes time and requires you to either a group of IP addresses and servers can be used in place yet, the computer guru you need to call to get everything configured. You may consider hiring proxy servers hosting providers to select, but this option is usually quite expensive, but probably better than the alternative: dangerous and unreliable servers (but free) public proxy.

There are literally thousands of free proxy servers located all over the world are fairly easy to use. The trick is to find them. Hundreds of sites, list servers, but by placing a functioning, open and supports standard protocols that you need to a lesson in perseverance, trial and error will be. However, if you manage to find a working public representatives, there are dangers inherent in their use. First, you do not know who owns the server or activities taking place elsewhere on the server. Send applications or sensitive data via an open proxy is a bad idea. It's easy enough for a proxy server to keep all information you send or send it back to you to catch. If you choose the method of replacing the public, make sure you never a transaction through which you or anyone else would jeopardize the case of unsavory types are made aware of the data to send.

A less risky scenario for data scraping proxy is to hire a proxy connection that runs through the rotation of a large number of private IP addresses. There are a number of these companies available that claim to remove all Web logs, which you harvest anonymously on the web with a minimal threat of retaliation. Companies such as enterprise solutions offer a large http://www.Anonymizer.com anonymous proxy, but often carry significant costs of installing enough for you to continue.

The other advantage is that companies that own such networks can often help design and implement a set of proxy data scraping custom program instead of trying to work with a generic bone scraping. After performing a simple Google search, I quickly found a company (www.ScrapeGoat.com) that an anonymous proxy server provides for data scraping purposes. Or, according to their website, if you want to make life even easier, scrap goat can retrieve data for you and a variety of different formats to deliver, often before you could finish up your plate from the scraping program.

Whatever path you choose for your data scraping proxy need not let a few simple tips to thwart access to all the wonderful information that is stored on the World Wide Web!

Source:http://www.articlesbase.com/small-business-articles/data-scraping-services-with-proxy-data-scraping-4697825.html

Tuesday, 30 December 2014

Web Data Scraping Services At Lowest Rate For Business Directory

We are the world's most trusted provider directory, your business data scrape, and scrape email scraping and sending the data needed. We scour the entire directory database or doctors, lawyers, brokers, financial advisers, etc. As the scraping of a particular industry category wise database scraping or data that can be adapted.

We are pioneers in the worldwide web scraping and data services. We must understand the value of our customer database, we email id with the greatest effort to collect data. We are lawyers, doctors, brokers, realtors, schools, students, universities, IT managers, pubs, bars, nightclubs, dance clubs, financial advisers, liquor stores, Face book, Twitter, pharmaceutical companies, mortgage broker scraped data, accounting firms, car dealers , artists, shop health and job portals.

Our business database development services to try and get real quality at the lowest possible industry. Example worked. We have a quick turnaround time can be a business mailing database. Our business database development services to try and get real quality at the lowest possible industry. Example worked. We have a quick turnaround time can be a business mailing database.

We are the world's most trusted provider directory, your business data scrape, and scrape email scraping and sending the data needed. We scour the entire directory database or doctors, lawyers, brokers, financial advisers, etc., as the scraping of a particular industry category wise database scraping or data that can be adapted.

We are pioneers in the worldwide web scraping and data services. We must understand the value of our customer database, we email id with the greatest effort to collect data. We are lawyers, doctors, brokers, realtors, schools, students, universities, IT managers, pubs, bars, nightclubs, dance clubs, financial advisers, liquor stores, Face book, Twitter, pharmaceutical companies, mortgage broker scraped data, accounting firms, car dealers , artists, shop health and job portals.

What a great resource for specific information or content with little success to gather and have tried to organize themselves in a folder? You no longer need to worry, and data processing services through our website search are the best solution for your problem.

We currently have an "information explosion" phase of the walk, where there is so much information and content information for an event or a small group of channels.

Order without the benefit of you and your customers a little truth to that information. You use information and material is easy to organize in a way that is needed. Something other than a small business guide, simply create a separate folder in less than an hour.

Our technology-specific Web database for you to a similar configuration and database development to use. In addition, we finished our services can help you through the data to identify the sources of information for web pages to follow. This is a cost effective way to create a database.

We offer directory database, company name, address, the state, country, phone, email and website URL to take. In recent projects we have completed. We have a quick turnaround time can be a business mailing database. Our business database development services to try and get real quality at the lowest possible industry.

Source:http://www.articlesbase.com/outsourcing-articles/web-data-scraping-services-at-lowest-rate-for-business-directory-5757029.html

Saturday, 27 December 2014

What Kind of Legal Problems Can Web Scraping Cause

Web scraping software is readily available and has been used by many for legitimate purposes. It has also been used for illegal purposes. A website that engages in this practice should know the legal dangers of the activity.

Related Articles

Black Hat SEO Popular Techniques

General Knowledge- VII

The idea of web scraping is not new. Search engines have used this type of software to determine which results appear when someone conducts a search. They use special software software to extract data from a website and this data is then used to calculate the rankings of the website. Websites work very hard to improve their ranking and their chance of being found by anyone making a search. This use of this practice is understood and is considered to be a legitimate use for the software. However, there are services that provide web scraping and screen scraping prevention services and help the webmaster to remain safe from the attack of bad bots.

The problem with duplicacy is that it is often used for less than legitimate reasons. Since the software responsible can collect all sorts of data from websites and store the information that is collected, it represents a danger to anyone who might be affected by it. The information that can be collected can be used for many practices that are not so legitimate and may even be illegal. Anyone who is involved in this practice of content duplicacy should be aware of the legal issues implicated with this practice. It may be wise for anyone who has a website to find ways to prevent a site from being scraped or to use professional services to block site scraping.

Legal problems

The first thing to worry about, if you have a website or are using web scraping software, is when you might run into legal problems. Some of the issues that web scraping can cause include:

•    Access. If the software is used to access sites it does not have the right to access and takes information that it is not entitled to, the owner of the web scarping software may find themselves in legal trouble.

•    Re-use. The software can collect and reuse information. If that information is copyrighted, that might be a legal problem. Any information that is reused without permission may create legal issues for anyone who uses it.

•    Robots. Some states have enacted laws that are designed to keep people from using scraping robots. These automatically search out information on websites and using them may be illegal in some states. It is up to the user of the web scraping software to comply with any laws in the state in which they are operating.

Who is Responsible

The laws and regulations surrounding this practice are not always clear. There are many grey areas that allow this practice to occur. The question is, who is responsible for determining whether the use of web scraping software is legal?

Websites collect the information, but they may not be the entity using the web scraping software. If they are using this type of software, it is not always enough to inform the website's visitors that this practice is occurring. Putting this information into the user agreement may or may not protect the website from legal problems.

It is also partly the responsibility of a site owner to prevent a site from being scraped. There is software that can be used that will do this for a website and will keep any information that is collected safe and secure. A website may or may not be held legally responsible for any web scraper that is able to collect information they have. It will depend on why the data was collected, how it was used, who collected it, and whether precautions were taken.

What to expect

The issue of content copying and the legal issues surrounding it will continue to evolve. As more courts take on this issue, the lines between legal and illegal web scraping will become clearer. Many of the cases that have been brought to court have occurred in civil court, although there are some that have been taken up in a criminal court. There will be times when such practice may actually be a felony.

Before you use spying software, you need to realize that the laws surrounding its use are not clear. If you operate a website, you need to know the legal issues that you may face if scraping software is used on your website. The best step is to use the software available to protect your website and stop web scraping and be honest on your site if web scraping is used.

Source: http://www.articlesbase.com/technology-articles/what-kind-of-legal-problems-can-web-scraping-cause-6780486.html

Wednesday, 24 December 2014

Central Qld Coal: Mining for Needed Investments

The Central Qld Coal Project is situated in the Galilee Coal Basin, Central Queensland with the purpose of establishing a mine to service international export markets for thermal coal. An estimated cost to such a project would be around $ 7.5 billion - the amount proves that the mining industry is one serious business to begin with.

In addition to the mine, the Central Qld Coal Project also proposes to construct a railway, potentially in excess of 400km depending on the final option: Either to transport processed coal to an expanded facility at Abbot Point or new export terminal to be established at Dudgeon Point. However, this would require new major water and power supply infrastructure to service the mine and port - hence, the extremely high cost. Because mining areas usually involve desolate areas where there is no direct risk to developed regions where the populace thrives, setting up new major water and power supplies would simply demand costs as high as the estimated cost - but this is not the only major percent of the whole budget of the Central Qld Coal Project.

The location for the Central Qld Coal Project is situated 40km northwest of Alpha, approximately 450 km west of Rockhampton and contains an amount of more than three billion tons. The proposed open-cut mine of the Central Qld Coal Project is expected to be developed in stages. It shall have an initial export capacity of 30 million tons per annum with a mine life expectancy of 30 years.

In terms of employment regarding Central Qld Coal Project, there will be around a total of 2,500 people to be employed during the construction and 1,600 permanent positions shall be employed in the operation stage of the Central Qld Coal Project.

Australia is a major coal exporter - the largest exporter of coal and fourth largest producer of coal. Australia is also the second largest producer of gold, second only to China. As for Opal, Australia is responsible for 95% of its production, thereby making her the largest producer worldwide. Australia would not also lose in terms of commercially viable diamond deposits - being third next after Russia and Botswana. This pretty much explains the significance of the mining industry to Australia. It is like the backbone of its economy; an industry focused on claiming the blessings the earth has giver her lands. The Central Qld Coal Project was made to further the exports and improve the trade. However, the Central Qld Coal Project requires quite a large sum for its project. It is only through the financial support of investments, both local and international, can it achieve its goals and begin reaping the fruits of the land.

Source: http://ezinearticles.com/?Central-Qld-Coal:-Mining-for-Needed-Investments&id=6314576

Monday, 22 December 2014

Scraping table from html web with CloudStat

You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.

Thanks to XML package from R. It provides amazing readHTMLtable() function.

For a study case,

I want to scrape data:

    US Airline Customer Score.
    World Top Chess Players (Men).

A. Scraping US Airline Customer Score table from

http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines

Code:

airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’

airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)

Result:

B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men

Code:

chess = ‘http://ratings.fide.com/top.phtml?list=men’

chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)

Result:

Done. You had successfully scraping data from any web page with CloudStat.

You can get the full version of this study case (code and result) at Scraping table from html web.

Then, you can analyze as usual! Great! No more retype the data. Enjoy!

Source:http://www.r-bloggers.com/scraping-table-from-html-web-with-cloudstat/

Thursday, 18 December 2014

Data Extraction - A Guideline to Use Scrapping Tools Effectively

So many people around the world do not have much knowledge about these scrapping tools. In their views, mining means extracting resources from the earth. In these internet technology days, the new mined resource is data. There are so many data mining software tools are available in the internet to extract specific data from the web. Every company in the world has been dealing with tons of data, managing and converting this data into a useful form is a real hectic work for them. If this right information is not available at the right time a company will lose valuable time to making strategic decisions on this accurate information.

This type of situation will break opportunities in the present competitive market. However, in these situations, the data extraction and data mining tools will help you to take the strategic decisions in right time to reach your goals in this competitive business. There are so many advantages with these tools that you can store customer information in a sequential manner, you can know the operations of your competitors, and also you can figure out your company performance. And it is a critical job to every company to have this information at fingertips when they need this information.

To survive in this competitive business world, this data extraction and data mining are critical in operations of the company. There is a powerful tool called Website scraper used in online digital mining. With this toll, you can filter the data in internet and retrieves the information for specific needs. This scrapping tool is used in various fields and types are numerous. Research, surveillance, and the harvesting of direct marketing leads is just a few ways the website scraper assists professionals in the workplace.

Screen scrapping tool is another tool which useful to extract the data from the web. This is much helpful when you work on the internet to mine data to your local hard disks. It provides a graphical interface allowing you to designate Universal Resource Locator, data elements to be extracted, and scripting logic to traverse pages and work with mined data. You can use this tool as periodical intervals. By using this tool, you can download the database in internet to you spread sheets. The important one in scrapping tools is Data mining software, it will extract the large amount of information from the web, and it will compare that date into a useful format. This tool is used in various sectors of business, especially, for those who are creating leads, budget establishing seeing the competitors charges and analysis the trends in online. With this tool, the information is gathered and immediately uses for your business needs.

Another best scrapping tool is e mailing scrapping tool, this tool crawls the public email addresses from various web sites. You can easily from a large mailing list with this tool. You can use these mailing lists to promote your product through online and proposals sending an offer for related business and many more to do. With this toll, you can find the targeted customers towards your product or potential business parents. This will allows you to expand your business in the online market.

There are so many well established and esteemed organizations are providing these features free of cost as the trial offer to customers. If you want permanent services, you need to pay nominal fees. You can download these services from their valuable web sites also.

Source: http://ezinearticles.com/?Data-Extraction---A-Guideline-to-Use-Scrapping-Tools-Effectively&id=3600918

Tuesday, 16 December 2014

Online Data Entry and Data Mining Services

Data entry job involves transcribing a particular type of data into some other form. It can be either online or offline. The input data may include printed documents like Application forms, survey forms, registration forms, handwritten documents etc.

Data entry process is an inevitable part of the job to any organization. One way or other each organization demands data entry. Data entry skills vary depends upon the nature of the job requirement, in some cases data to be entered from a hard copy formats and in some other cases data to be entered directly into a web portal. Online data entry job generally requires the data to be entered in to any online data base.

For a super market, data associate might be required to enter the goods which have sold in a particular day and the new goods received in a particular day to maintain the stock well in order. Also, by doing this the concerned authorities will get an idea about the sale particulars of each commodity as they requires. In another example, an office the account executive might be required to input the day to day expenses in to the online accounting database in order to keep the account well in order.

The aim of the data mining process is to collect the information from reliable online sources as per the requirement of the customer and convert it to a structured format for the further use. The major source of data mining is any of the internet search engine like Google, Yahoo, Bing, AOL, MSN etc. Many search engines such as Google and Bing provide customized results based on the user's activity history. Based on our keyword search, the search engine lists the details of the websites from where we can gather the details as per our requirement.

Collect the data from the online sources such as Company Name, Contact Person, Profile of the Company, Contact Phone Number of Email ID Etc. are doing for the marketing activities. Once the data is gathered from the online sources into a structured format, the marketing authorities will start their marketing promotions by calling or emailing the concerned persons, which may result to create a new customer. So basically data mining is playing a vital role in today's business expansions. By outsourcing the data entry and its related works, you can save the cost that would be incurred in setting up the necessary infrastructure and employee cost.

Source:http://ezinearticles.com/?Online-Data-Entry-and-Data-Mining-Services&id=7713395

Monday, 15 December 2014

Git workflow for Scrapy projects

Our customers often ask us what’s the best workflow for working with Scrapy projects. A popular approach we have seen and used in the past is to split the spiders folder (typically project/spiders) into two folders: project/spiders_prod and project/spiders_dev, and use the SPIDER_MODULES setting to control which spiders are loaded on each environment. This works reasonably well, until you have to make changes to common code used by many spiders (ie. code outside the spiders folder), for example common base spiders.

Nowadays, DVCS (in particular, git) have become more popular and people are quite used to branching, so we recommend using a simple git workflow (similar to GitHub flow) where you branch for every change you make. You keep all changes in a branch while they’re being tested and finally merge to master when they’re finished. This means that master branch is always stable and contains only “production-ready” spiders.

If you are using our Scrapy Cloud platform, you can have 2 projects (myproject-dev, myproject-prod) and use myproject-dev to test the changes in your branch.  scrapy deploy in Scrapy 0.17 now adds the branch name to the version name (when using version=GIT or version=HG), so you can see which branch you are going to run directly on the panel. This is particularly useful with large teams working on a single Scrapy project, to avoid stepping into each other when making changes to common code.

Here is a concrete example to illustrate how this workflow works:y

•    the developer decides to work on issue 123 (could be a new spider or fixes to an existing spider)
•    the developer creates a new branch to work on the issue
•    git checkout -b issue123
•    the developer finishes working on the code and deploys to the panel (this assumes scrapy.cfg is configured with a deploy target, and using version=GIT – see here for more information)
•    scrapy deploy dev
•    the developer goes into the panel and runs the spider, where he’ll see the branch name (issue123) that will be run
•    the developer checks the scraped data looks fine through the item browser in the panel
•    whenever issues are found, the developer makes more fixes (always working on the same branch) and deploys new versions
•    once all issues are fixed, the developer merges the branch and deploys to production project
•    git checkout master
•    git merge issue123
•    git pull # make sure to pull latest code before deploying
•    scrapy deploy prod

We recommend you keep your common spiders well-tested and use Spider Contracts extensively to test your final spiders. Otherwise experience tell us that base spiders end up being copied (instead of reused) out of fear of breaking old spiders that depend on them, thus turning their maintenance into a nightmare.

Source:http://blog.scrapinghub.com/2013/03/06/git-workflow-scrapy-projects/

Saturday, 13 December 2014

Handling exceptions in scrapers

When requesting and parsing data from a source with unknown properties and random behavior (in other words, scraping), I expect all kinds of bizarrities to occur. Managing exceptions is particularly helpful in such cases.

Here is some ways that an exception might be raised.
[][0] #The list has no zeroth element, so this raises an IndexError
{}['foo'] #The dictionary has no foo element, so this raises a KeyError

Catching the exception is sometimes cleaner than preventing it from happening in the first place. Here are some examples handling bizarre exceptions in scrapers.

Example 1: Inconsistant date formats

Let’s say we’re parsing dates.
import datetime
This doesn’t raise an error.
datetime.datetime.strptime('2012-04-19', '%Y-%m-%d')
But this does.
datetime.datetime.strptime('April 19, 2012', '%Y-%m-%d')

It raises a ValueError because the date formats don’t match. So what do we do if we’re scraping a data source with multiple date formats?

Ignoring unexpected date formats

A simple thing is to ignore the date formats that we didn’t expect.

import lxml.html
import datetime
def parse_date1(source):
    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text
    try:
         cleandate = datetime.datetime.strptime(rawdate, '%Y-%m-%d')
    except ValueError:
         cleandate = None
    return cleandate

print parse_date1('<div id="date">2012-04-19</div>')

If we make a clean date column in a database and put this in there, we’ll have some rows with dates and some rows with nulls. If there are only a few nulls, we might just parse those by hand.

Trying multiple date formats

Maybe we have determined that this particular data source uses three different date formats. We can try all three.

import lxml.html
import datetime

def parse_date2(source):

    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text

    for date_format in ['%Y-%m-%d', '%B %d, %Y', '%d %B, %Y']:

        try:
             cleandate = datetime.datetime.strptime(rawdate, date_format)
             return cleandate
        except ValueError:
             pass
    return None

print parse_date2('<div id="date">19 April, 2012</div>')

This loops through three different date formats and returns the first one that doesn’t raise the error.

Example 2: Unreliable HTTP connection

If you’re scraping an unreliable website or you are behind an unreliable internet connection, you may sometimes get HTTPErrors or URLErrors for valid URLs. Trying again later might help.

import urllib2
def load(url):
    retries = 3
    for i in range(retries):
        try:
            handle = urllib2.urlopen(url)
            return handle.read()
        except urllib2.URLError:
            if i + 1 == retries:
                raise
            else:
                time.sleep(42)
    # never get here

print load('http://thomaslevine.com')

This function tries to download the page thee times. On the first two fails, it waits 42 seconds and tries again. On the third failure, it raises the error. On a success, it returs the content of the page.

Example 3: Logging errors rather than raising them

For more complicated parses, you might find loads of errors popping up in weird places, so you might want to go through all of the documents before deciding which to fix first or whether to do some of them manually.

import scraperwiki
for document_name in document_names:
    try:
        parse_document(document_name)
    except Exception as e:
        scraperwiki.sqlite.save([], {
            'documentName': document_name,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')

This catches any exception raised by a particular document, stores it in the database and then continues with the next document. Looking at the database afterwards, you might notice some trends in the errors that you can easily fix and some others where you might hard-code the correct parse.

Example 4: Exiting gracefully

When I’m scraping over 9000 pages and my script fails on page 8765, I like to be able to resume where I left off. I can often figure out where I left off based on the previous row that I saved to a database or file, but sometimes I can’t, particularly when I don’t have a unique index.


for bar in bars:
    try:
        foo(bar)
    except:
        print('Failure at bar = "%s"' % bar)
        raise

This will tell me which bar I left off on. It’s fancier if I save the information to the database, so here is how I might do that with ScraperWiki.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except:
        scraperwiki.sqlite.save_var('resume_index', i)
        raise
scraperwiki.sqlite.save_var('resume_index', 0)

ScraperWiki has a limit on CPU time, so an error that often concerns me is the scraperwiki.CPUTimeExceededError. This error is raised after the script has used 80 seconds of CPU time; if you catch the exception, you have two CPU seconds to clean up. You might want to handle this error differently from other errors.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except scraperwiki.CPUTimeExceededError:
        scraperwiki.sqlite.save_var('resume_index', i)
    except Exception as e:
        scraperwiki.sqlite.save_var('resume_index', i)
        scraperwiki.sqlite.save([], {
            'bar': bar,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')
scraperwiki.sqlite.save_var('resume_index', 0)

tl;dr

Expect exceptions to occur when you are scraping a randomly unreliable website with randomly inconsistent content, and consider handling them in ways that allow the script to keep running when one document of interest is bizarrely formatted or not available.

Source: https://blog.scraperwiki.com/2012/05/handling-exceptions-in-scrapers/

Thursday, 11 December 2014

Seven tools for web scraping – To use for data journalism & creating insightful content

I’ve been creating a lot of (data driven) creative content lately and one of the things I like to do is gathering as much data as I can from public sources. I even have some cases it is costing to much time to create and run database queries and my personal build PHP scraper is faster so I just wanted to share some tools that could be helpful. Just a short disclaimer: use these tools on your own risk! Scraping websites could generate high numbers of pageviews and with that, using bandwidth from the website you are scraping.

1. Scraper (Chrome plugin)

    Scraper is a simple data mining extension for Google Chrome™ that is useful for online research when you need to quickly analyze data in spreadsheet form.

You can select a specific data point, a price, a rating etc and then use your browser menu: click Scrape Similar and you will get multiple options to export or copy your data to Excel or Google Docs. This plugin is really basic but does the job it is build for: fast and easy screen scraping.

2. Simple PHP Scraper

PHP has a DOMXpath function. I’m not going to explain how this function works, but with the script below you can easily scrape a list of URLs. Since it is PHP, use a cronjob to hourly, daily or weekly scrape the desired data. If you are not used to creating Xpath references, use the Scraper for Chrome plugin by selecting the data point and see the Xpath reference directly.

scraper-example

– Click here to download the example script.

3. Kimono Labs

Kimono has two easy ways to scrape specific URLs: just paste the URL into their website or use their bookmark. Once you have pointed out the data you need, you can set how often and when you want the data to be collected. The data is saved in their database. I like the facts that their learning curve is not that steep and it doesn’t look like you need a PHD in engineering to use their software. The disadvantage of this tool is the fact you can’t upload multiple URLs at once.

4. Import.io

Import.io is a browser based web scraping tool. By following their easy step-by-step plan you select the data you want to scrape and the tool does the rest. It is a more sophisticated tool compared to Kimono. I like it because of the fact it shows a clear overview of all the scrapers you have active and you can scrape multiple URLs at once.

5. Outwit Hub

I will start with the two biggest differences compared to the previous tool: it is a softwarepackage to use on your PC or laptop and to use its full potential it will cost you 75 USD. The free version can only scrape 100 rows of data. What I do like is the number of preprogrammed options to scrape which makes it easy to start and learn about web scraping.

6. ScraperWiki

This tool is really for people wanting to scrape on a massive scale. You can code your own scrapers (in PHP, Ruby & Python) and pricing is really cheap looking to what you can get: 29USD / month for 100 datasets. You are completely free in using libraries and timers. And if your programming skills are not good enough, they can help you out (paid service though). Compared to other tools, this is the most advanced tool that offers the basics of web scraping.

7. Fminer.com

This tool made it possible to finally scrape all the data inside Google Webmaster Tools since it can deal with JavaScript and AJAX interfaces. Read my extensive review on this page: Scraping Webmaster Tools with FMiner!

But on the end, building your individual project scrapers will always be more effective than using predefined scrapers. Am I missing any tools in this sum up of tools?

Source: http://www.notprovided.eu/7-tools-web-scraping-use-data-journalism-creating-insightful-content/

Thursday, 4 December 2014

Multiple Listing Service Gets Favorable Appellate Ruling in Scraping Lawsuit

This is a follow-up to our massive post on anti-scraping lawsuits in the real estate industry from New Year’s Eve 2012 (Note: the portion on MRIS is about halfway through the post, labeled “Same Writ, Different Plaintiff”).

AHRN is a California real estate broker that owns and operates NeighborCity.com. The site gets its data in part by scraping from MLS databases–in this case, MRIS. As part of the scraping, however, AHRN had collected and displayed copyrighted photographs among the bits and pieces of general textual information about the properties. MRIS sent a cease and desist letter to AHRN, and filed suit alleging various copyright claims after the parties failed to agree on a license to use the photographs. Ultimately, a district court in Maryland granted a motion made by MRIS for a preliminary injunction.

When we last left off, the district court had revised its preliminary injunction order to enjoin only AHRN’s use of MRIS’s photographs–not the compilation itself or any textual elements that may be considered a part of it. Since then, AHRN appealed the injunction. On July 18th, the Fourth Circuit Court of Appeals affirmed.

Background

shutterstock_108008486.jpgAHRN argued that MRIS failed to show a likelihood of success on its copyright infringement claim because MRIS: (1) failed to register its copyright in the individual photographs when it registered the database, and (2) did not have a copyright interest in the photographs because the subscribers’ electronic agreement to MRIS’s terms of use failed to transfer those rights.

 MRIS Did Not Fail to Register Its Interest in the Photographs

This first question revolved around the scope of MRIS’s registrations. AHRN argued that MRIS’s collective work registrations did not cover the individual photographs because MRIS did not identify the names of the authors and titles of those works. MRIS argued that 17 U.S.C. §409 did not require any such identification when applied to collective works, and that its general description of the pre-existing photographs’ inclusion sufficed.

The court began its discussion by noting the “ambiguous” nature of §409’s language and its varying judicial interpretations. Some courts have barred infringement suits because the collective work registrant failed to list the authors, while others have allowed infringement suits where the registrant owns the rights to the component works as well as the collective work.

In this case, the court agreed with MRIS and found that the latter approach was more consistent with the relevant statutes and regulations:

    Adding impediments to automated database authors’ attempts to register their own component works conflicts with the general purpose of Section 409 to encourage prompt registration . . . and thwarts the specific goal embodied in Section 408 of easing the burden on group registrations[.]

As part of its decision, the court looked favorably upon the 3Taps case, in which Craigslist sued 3Taps and Padmapper for scraping and repackaging its online classified ads. In that case, the court reasoned that it would be “inefficient” to require registrants to list each author of an extremely large number of component works to which the registrant already had obtained an exclusive license.

Having found that MRIS’s general description satisfied § 409’s pre-suit registration requirement, the court moved on to the merits of MRIS’s infringement claim–more specifically, the question of whether MRIS’s Terms of Use actually transferred a copyright interest to its subscribers’ photographs.

E-SIGN Applies to Assignments of Copyrights and Overrides § 204

AHRN challenged MRIS’s ownership of the photographs by arguing that an MLS subscriber’s electronic agreement to MRIS’s Terms of Use does not operate as an assignment of rights under § 204, which requires a signed “writing.”

In a bad sign for AHRN, the court began its discussion by volunteering an argument that MRIS did not even bring up:

    [I]n situations where “the copyright [author] appears to have no dispute with its [assignee] on this matter, it would be anomalous to permit a third party infringer to invoke [Section 204(a)’s signed writing requirement] against the [assignee].”

With that in mind, the court went on to discuss the E-SIGN act’s impact on the conveyance of copyrights. After establishing the meaning of “e-signature,” the court focused on whether the act was limited from covering this type of situation.

    The Act provides that it “does not . . . limit, alter, or otherwise affect any requirement imposed by a statute, regulation, or rule of law . . . other than a requirement that contracts or other records be written, signed, or in nonelectric form[.]”

The court emphasized the phrase “other than,” reasoning that a plain reading of the E-SIGN language showed that Congress intended the provisions to limit § 204. It also noted that Congress did not list copyright assignments among the various agreements to which E-SIGN did not apply–nor was there a catchall that included such assignments.

The court then turned to the Hermosilla case, in which a district court in Florida upheld the validity of a copyright conveyance via e-mail. It emphasized the Hermosilla court’s reliance on the purpose of § 204–“to resolve disputes between copyright owners and transferees and to protect copyright holders from persons mistakenly or fraudulently claiming oral licenses or copyright ownership.” The appellate court agreed with the Hermosilla court that allowing assignment via e-mail actually helped cut down on these types of disputes.

    To invalidate copyright transfer agreements solely because they were made electronically would thwart the clear congressional intent embodied in the E-Sign Act.

All in all, the court basically said “we don’t see why E-SIGN shouldn’t apply.” Note that it did not pass judgment specifically on whether MRIS’s Terms of Use constituted a valid contract. It simply mentioned that AHRN waived that argument by not bringing it up sooner.

Source: http://blog.ericgoldman.org/archives/2013/07/multiple_listin_1.htm

Sunday, 30 November 2014

The Roots of Web Scraping and the Wisdom behind It

You may be wondering how data mining came into existence. This effective and innovative trend in business and research is indeed something commendable and the genius behind it is worth great reward. To have a clear view of the origin of web scraping, the following important factors that contribute to the creation of this phenomenon called data collection or web scraping are considered.

Foundations

Unlike any other innovation, no specific date can be clearly pointed out as the birthdate of data mining. It has come into existence as a result of several problem solving processes in major data gathering and handling situations. It appears that cyber technology has opened a Pandora box of “anything can happen” experiences. Moreover, the shift from physical to virtual data collection has resulted in a bulk of database that needed to be organized, analyzed and utilized.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/roots-web-scraping-wisdom-behind/

Thursday, 27 November 2014

Scraping Online Communities for your Outreach Campaigns

Online communities offer a wealth of intelligence for blog owners and business owners alike.

Exploring the data within popular communities will help you to understand who the major influencers are, what content is popular and who are the key content aggregators within your niche.

This is all fair and well to talk about, but is it feasible to be manually sorting through online communities to find this information out? Probably not.

This is where data scraping comes in.
What is Scraping and What Can it do?

I’m not going to go into great detail on what data scraping actually means, but to simplify this, here’s a definition from the Wikipedia page:

    “Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program.”

Let me explain this with a little example…

Imagine a huge community full of individuals within your industry. Each person within the community has a personal profile page that contains information about their interests, contact details, social profiles, etc.

If you were tasked with gathering all of this data on all of the individuals then you might start to hyperventilating at the thought of all the copy and pasting you’d need to do.

Well, an alternative is to scrape all of this content so that you can automate all of this process and easily export all of this information into a manageable, more consumable format in a matter of seconds. It’d be pretty awesome, right?
Luckily for you, I’m going to show you how to do just that!
The Example of Inbound.org

Recently, I wanted to gather a list of digital marketers that were fairly active on social media and shared a lot of content online within communities. These people were going to be some of my core targets to get content from the blog in front of.

To do this, I first found some active communities online where these types of individuals hang out. Being a digital marketer myself, this process was fairly easy and I chose Inbound.org as my starting place.

Scoping out Data Requirements
Each community is different and you’ll be able to gather varying information within each.

The place to look for this information is within the individual user profile pages. This is usually where the contact information or links to social media accounts are likely to be displayed.

For this particular exercise, I wanted to gather the following information:

    Full name
    Job title
    Company name and URL
    Location
    Personal website URL
    Twitter URL, handle and follower/following stats
    Google+ URL, follower count and list of contributor URLs
    Profile image URL
    Facebook URL
    LinkedIn URL

With all of this information I’ll be able to get a huge amount of intelligence about the community members. I’ll also have a list of social media accounts to add and engage with.
On top of this, with all the information on their websites and sites that they write for, I’ll have a wealth of potential link building prospects to work on.

Inbound.org Profiles

You’ll see in the above screenshot that a few of the pieces of data are available to see on the Inbound.org user profiles. We’ll need to get the other bits of information from the likes of Twitter and Google+, but this will all stem from the scraping of Inbound.org.

Sign Up To My Newsletter
Scraping the Data

The idea behind this is that we can set up a template based on one of the user profiles and then automate the data gathering across the rest of the profiles on the site.

This is where you’ll need to install the SEO Tools plugin for Excel (it’s free). If you’ve not used this plugin before, don’t worry – I’ve put together a full tutorial here.

Once you’ve installed the plugin, you’re good to go on the actual scraping side of things…
Quick Note: Don’t worry if you don’t have a good knowledge of coding – you don’t need it. All you’ll need is a very basic understanding of reading some code and some basic Excel skills.

To begin with, you’ll need to do a little Excel admin. Simply add in some column titles based around the data that you’re gathering. For example, with my example of Inbound.org, I had, ‘Name’, ‘Position’, ‘Company’, ‘Company URL’, etc. which you can see in the screenshot below. You’ll also want to add in a sample profile URL to work on building the template around.

spreadsheet admin
Now it’s time to start getting hands on with XPath.
How to Use XPathOnURL()

This handy little formula is made possible within Excel by the SEO Tools plugin. Now, I’m going to keep this very basic because there are loads of XPath tutorials available online that can go into the very advanced queries that are possible to use.

For this, I’m simply going to show you how to get the data we want and you can have a play around yourself afterwards (you can download the full template at the end of this post).

Here’s an example of an XPath query that gathers the name of the person within the profile that we’re scraping:

=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

A2 is simply referencing the cell that contains the URL that we’re scraping. You’ll see in the screenshot above that this is Jason Acidre’s profile page.

The next part of the formula is the XPath.

What this essentially says is to scrape through the HTML to find a tag that has ‘user-profile’ id attached to it. This could be a div, span, a or whatever.

Once it’s found this tag, it then needs to look at the first h2 tag within this area and grab the text within it. This is Jason’s name, which you’ll see in the screenshot below of the code:

website code

Don’t be put off at this stage because you don’t need to go manually trawling through code to build these queries, there’s a much simpler way.

The easiest way to do this is by right-clicking on the element you want to scrape on the webpage (within Chrome); for example, on Inbound.org, this would be the profile name. Now click ‘Inspect element’.

inspect element

The developer tools window should now appear at the bottom of your browser (or in a separate window). Within that, you should see the element that you’ve drilled down on.

All you need to do now is right-click on it and press ‘Copy XPath’.
copy XPath

This will now copy the XPath code for your Excel formula to the clipboard. You’ll just need to add in the first part of the query, i.e. =XPathOnUrl(A2,

You can then paste in the copied XPath after this and add a closing bracket.

Note: When you use ‘Copy XPath’ it will wrap some parts of the code in double apostrophes (“) which you’ll need to change to single apostrophes. You’ll also need to wrap the copied XPath in double apostrophes.

Your finished code will look like this:
=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

You can then apply this formula against any Inbound.org profile and it will automatically grab the user’s full name. Pretty good, right?

Check out the full video tutorial below that I’ve put together that talks you through this whole process:

[sws_blue_box box_size=””] Want more useful video tutorials? Subscribe to my YouTube channel now![/sws_blue_box]

XPath Examples for Grabbing Other Data

As you’re probably starting to see, this technique could be scaled across any website online. This makes big data much more attainable and gives you the kind of results that an expensive paid tool would offer without any of the cost – bonus!

Here’s a few more examples of XPath that you can use in conjunction with the SEO Tools plugin within Excel to get some handy information.

Twitter Follower Count

If you want to grab the number of followers for a Twitter user then you can use the following formula. Simply replace A2 with the Twitter profile URL of the user you want data on. Just a quick word of warning with this one; it looks like it’s really long and complicated, but really I’ve just used another Excel formula to snip of the text ‘followers’ from the end.

=RIGHT(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"),LEN(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"))-10)

Google+ Follower Count

Like with the Twitter follower formula, you’ll need to replace A2 with the full Google+ profile URL of the user you want this data for.

=XPathOnUrl(H67,"//span[@class='BOfSxb']")

List of ‘Contributor to’ URLs

I don’t think I need to tell you the value of pulling in a list of websites that someone contributes content to. If you do want to know then check out this post that I wrote.

This formula is a little more complex than the rest. This is because I’m pulling in a list of URLs as opposed to just one entity. This requires me to use the StringJoin function to separate all of the outputs with a comma (or whatever character you’d like).

Also, you may notice that there is an additional section to the XPath query, “href”. This pulls in the link within the specific code block instead of the text.

As you’ll see in the full Inbound.org scraper template that I’ve made, this is how I pull in the Twitter, Google+, Facebook and LinkedIn profile links.

You’ll want to replace A2 with the Google+ profile URL of the person you wish to gather data on.

=StringJoin(", ",XPathOnUrl(A2,"//a[@rel='contributor-to nofollow']","href"))

Twitter Profile Image URL
If you want to get a large version of someone’s Twitter profile image then I’ve got just the thing for you.
Again, you’ll just need to substitute A2 with their Twitter profile URL.
=XPathOnUrl(A2,"//*[@class='profile-picture media-thumbnail js-tooltip']","data-resolved-url-large")


Some Findings from the Data I’ve Gathered

With all big data sets will come some interesting findings. Here’s a few little things that I’ve found from the top 100 influential users on Inbound.org.

average followers chart

The chart above maps out the average number of followers that the top 100 users have on both Twitter (12,959) and Google+ (9,601). As well as this, it shows the average number of users that they follow on Twitter (1,363).

The next thing that I’ve looked at is the job titles of the top 100 users. You can see the most common occurrences of terms within the tag cloud below:

Job titlesFinally, I had a look through all of the domains listed within each of the top 100 Inbound.org users’ Google+ ‘contributor to’ sections and mapped out the most frequently mentioned sites.

Here’s the spread of domains that were the most popular to be contributed to:

domain frequency
It Doesn’t Stop There

As you’ve probably gathered, this can be scaled out across pretty much any community/forum/directory/website online.

With this kind of intelligence in your armoury, you’ll be able to gather more intelligence on your targets and increase the effectiveness of your outreach campaigns dramatically.

Also, as promised, you can download my full Inbound.org scraper template below:

[sdfile url=”http://www.matthewbarby.com/goodies/MatthewBarby-Inbound-Scraper.xlsx” redirect=”http://www.matthewbarby.com/thanks-downloading-inbound-scraper/”]

TL;DR

    Online communities hold valuable data on your target audiences – use it!
    Scale out your intelligence gathering by brushing up on your XPath.
    Download my Inbound.org scraper template and let it work its magic.

Source: http://www.matthewbarby.com/scraping-communities-with-xpath/

Wednesday, 26 November 2014

Web Data Extraction: driving data your way

Most businesses rely on the web to gather data such as product specifications, pricing information, market trends, competitor information, and regulatory details. More often than not, companies collect this data manually—a process that not only takes a significant amount of time, but also has the potential to introduce costly errors.

By automating data extraction, you're able to free yourself (and your pointer finger) from hours of copy/pasting, eliminate human errors, and focus on the parts of your job that make you feel great.

Web data extraction: What it is, why it's used, and how to get it right on an ongoing basis

Web data extraction, screen scraping, web harvesting—while these terms may have different connotations they all essentially point to the same thing: plucking data from the web and populating it in an organized way in another place for further analysis or more focused use. In an era where “big data” has become a commonplace concept, the appeal of web data extraction has grown; it’s an extremely efficient alternative to web browsing, and culls very specific data for a focused purpose.

How it's used

While each company’s needs vary, data extraction is often used for:

    Competitive intelligence, including web popularity, social perception, other sites linking to them, and placement of competitor advertisements

    Gathering financial data including stock market movement, product pricing, and more

    Creating continuity between price sheets and online websites, catalogs, or inventory databases

    Capturing product specifications like dimensions, color, and materials

    Pulling tabular data from multiple sources for in-depth analysis

Interestingly, some people even find that web data extraction can aid them in their leisure time as well, pulling data from blogs and websites that pertain to their hobbies or interests, and creating their own library of organized information on a topic. Perhaps, for instance, you want a list of all the designers that George Clooney wears (hey- we won’t question what you do in your free time). By using web scraping tools, you could automatically extract this type of data from, say, a fashion blogger who follows celebrity style, and create your own up-to-date shopping list of items.

How it's done

When you think of gathering data from the web, you should mentally juxtapose two different images: one of gathering a bucket of sand one grain at a time, and one of filling a bucket with a shovel that has the perfect scoop size to fill it completely in one sitting. While clearly the second method makes the most sense, the majority of web data extraction happens much like the first process--manually, and slowly.

Let’s take a look at a few different ways organizations extract data today.

The least productive way: manually

While this method is the least efficient, it’s also the most widespread. On the plus side, you need to learn absolutely nothing except “Ctrl+C/V” to use this method, which explains why it is the generally preferred method, despite the hours of time it can take. Imagine, for instance, managing a sales spreadsheet that keeps inventory up to date so that the information can be properly disseminated to a global sales team. Not only does it take a significant amount of time to update the spreadsheet with information from, say, your internal database and manufacturer’s website, but information may change rapidly, leaving sales reps with inaccurate information regardless.

Finding someone in the organization with a talent for programming languages like Python

Generally, automating a task without dedicated automation software requires programming, and therefore an internal resource with a solid familiarity with programming languages to create the task and corresponding script. While most organizations do, in fact, have a resource in IT or engineering with this type of ability, it often doesn’t seem like a worthy time investment for that person to derail the initiatives he or she is working on to automate web data extraction. Additionally, if companies do choose to automate using in-house resources, that person will find himself beholden to a continuing obligation, since he or she will need to adjust scripting if web objects and attributes change, disabling the task.

Outsourcing via Elance or oDesk

Unless there is a dedicated resource ready to automate and maintain data extraction processes (and most organizations wouldn’t necessarily choose to use their in-house employee time this way), companies might turn to outsourcing companies such as Elance or oDesk to hire contract help. While this is an effective way to automate a task using a resource that has a level of acumen in automation, it represents an additional cost--be it one time or on a regular basis as data extraction requirements change or increase.

Using Excel web queries

Since more often than not, data extracted from the web is often populated into an Excel spreadsheet, it’s no wonder that Excel includes web query tools expressly for that purpose. These tools are particularly useful in pulling tabular data from a website (such as product specifications, legal codes, stock prices, and a host of other information) and automatically pushing the data into a spreadsheet. Excel queries do have limitations and a learning curve, however, particularly when creating dynamic web queries. And clearly, if you’re hoping to populate the information in other sources, such as external databases, there is yet another level of difficulty to navigate.

How automation simplifies web data extraction

Culling web data quickly

Using automation is the simplest way to extract web data. As you execute the steps necessary to perform the task one time, a macro recorder captures each action, automatically generates an easily-editable script, and lets you specify how often you would like to repeat the task, and at what speed.

Maintaining the highest level of accuracy

With humans copy/pasting data, or comparing between multiple screens and entering data manually into a spreadsheet, you’re likely to run into accuracy issues (sometimes directly proportionate to the amount of time spent on the task and amount of coffee in the office!) Automation software ensures that “what you see is what you get,” and that data is picked up from the web and put back down where you want it without a hitch.

Storing web data in your preferred format

Not only can you accurately transfer data with automation software, you can also ensure that it’s populated into spreadsheets or databases in the format you prefer. Rather than simply dumping the data into a spreadsheet, you can ensure that the right information is put into the proper column, row, field, and style (think, for instance, of the difference between writing a birth date as “03/13/1912” and “12/3/13”).

Simplifying data analysis

Automation software allows you to aggregate data from disparate sources or enormous stockpiles of structured or unstructured data in a way that makes sense for your business analysis needs. This way, the majority of employees in an organization can perform some level of analysis on their own, making it easier to surface information that informs business decisions.

Reacting to changes without a hitch

Because automation software is built to recognize icons, images, symbols, and other objects regardless of their position on a screen, it can automate processes in a self-perpetuating manner. For example, let’s say you automate data retrieval from a certain chart on a retailer’s website without automation software. If the retailer decides to move that object to another area of the screen, your task would no longer produce accurate results (or work at all), leaving you to make changes to the script (or find someone who can), or re-record the task altogether. With image recognition capabilities, however, the system “memorizes” the object itself, not merely its coordinates, so that the task can continue to run irrespective of changes.

The wide sweeping appeal of automation software

Companies often pick a comprehensive automation solution not only because of its ability to effectively automate any web data extraction task, but also because it goes beyond data extraction. Automation software can permeate into other areas of the business as well, making tasks such as application integration, data migration, IT processes, Excel automation, testing, and routine tasks such as launching applications or formatting files faster and more accurate. Because it requires no programming experience to use, adoption rates are higher and businesses get more “bang for their buck.”

Almost any organization can benefit from using automation software, particularly as they grow and scale. If you are looking to quit “moving grains of sand” and start claiming back time in your day, there are a few steps you can take:

 Watch a short video that shows how web data extraction is done with automation software

 Download a free trial and start reaping the benefits of downloading even just a couple of tasks today.

 See how tasks are automated with our short, step-by-step how-to-sheets (and then give it a try yourself!)

Source: https://www.automationanywhere.com/web-data-extraction

Sunday, 23 November 2014

Outsourcing Data Mining is a Wise Business Decision

Most businesses nowadays have a large volume of raw data that is never processed, because of the lack of time or resources. If your business is facing a similar situation, then you are missing out on valuable information. Without the right information, your company will be unable to make accurate business decisions.

The right information can play a key role in promoting the growth of your business. When unprocessed data is entered, filtered, classified and converted into a workable format, it can be used to maximize your profits, ameliorate your risks and run a seamless workflow.

Over the years, data mining has proved to be extremely useful in various industries, be it, healthcare, direct marketing, e-commerce, finance, customer relationship management or telecommunications. With the right information, companies have been able to make fast and effective business decisions.

Why outsource data mining?

Data mining requires the expertise of professional business and financial analysts who understand how to acquire important information from vast amounts of data. If data mining is done in-house, it can become expensive and time consuming. It can also shift your focus away from core business activities. Outsourcing data mining on the other hand is more fast, cost-effective and can give you access to professional services.

4 commonly outsourced data mining functions

Most companies outsource one or more of the following data mining functions to India:

1. Data congregation: Data is extracted from various web pages and websites, by using methods like web and screen scraping. The collected data is then entered into a database.

2. Contact data collection: Different websites are searched and information concerning contacts is collected.

3. E-commerce data: Data about varied online stores are collected, taking into account information about prices, discounts and products.

4. Data about competitors: Data about business competitors are collected to help a company gauge itself against its competition. With such valuable data, you can effectively re-design your marketing strategy and pricing matrix.

8 advantages of outsourcing data mining to India

With data mining out of your hands, your business can make huge savings in terms of time, money and infrastructure. The following are some of the benefits that you can leverage by outsourcing data mining to India:

    Get qualified and highly skilled data mining experts to work for you at an extremely affordable cost

    Be assured of the quality of information, as Indian data entry companies only extract information from reliable websites and databases

    Save on the cost of investing on the latest data mining software and technology, as your Indian service provider will be making these investments

    Get your data processed within a short turnaround time of 3,6 or 12 hours as Indian data mining companies can provide efficient data mining within a few hours

    When compared to in-house data mining, outsourcing data mining can be a lot cheaper and also bring you better results

    Stay assured about the complete privacy, security and confidentiality of your valuable data as Indian data mining companies use the latest technology to ensure 100% safety

    Get access to data with a wide market coverage as your Indian data mining provider will be serving many business with varied data mining needs

    Improve your overall productivity and generate more profits by making informed decisions about your business

Have you outsourced data mining before? If yes, which data mining service did you outsource? Did you find outsourcing more advantageous that in-house data mining. Let us know.

Source: http://blog.flatworldsolutions.com/outsourcing-data-mining-is-a-wise-business-decision/

Wednesday, 19 November 2014

Online Data Entry & Web Scraping Services

To operate any type of organization smoothly, it is essential to have precise data that is accurate and reliable. When your business expands, data entry on an ongoing basis is a tedious job. It’s a very time consuming task that can often distract employees focusing on core business areas.

Webpop offers all forms of online data entry services that are quick and accurate. We provide data entry services across all verticals that can be completely customized to your business requirements.

Database Population Services

Database population involves content collection from various database sources. This requires a lot of attention to detail, dedication and awareness and can prove a formidable task, especially for websites that largeley depend on it.

Webpop offer a quick and efficient database population service that helps relieve the stress from an extremely laborius task and leaves you more time to focus on more important aspects of your business. By investing just a fraction of the cost, you can outsource your database population tasks to us.

Web Scraping Services

Webpop have been assisting clients in searching, extracting and collecting data from the web for the past 5 years using the latest techniques in web scraping techology. We can scrape all types of information from a variety of sources such as websites, blogs, online directories, e-commerce websites and podcasts to name a few. We use a varied selection of automated and manual web scraping technologies to extract, gather and collect all of the required data you require from any chosen website(s) on the World Wide Web.

We can simplify the whole process from collection to population, converting your scraped data in to structured formats that are applicable to your website. This can be offered as a one time service or an ongoing basis that will assist you in constantly keeping your website’s content fresh and up to date. We can crawl competitors websites, gather sales leads, product details, pricing methodologies and also creat custom campaigns to suit your project’s requirements.

Over the years Webpop has grown from strength-to-strength by providing all types of data entry, database population and web scraping services. All of our data entry services are performed with care, due dilligence and attention to detail. We enjoy a challenge and pride ourselves on delivering results whilst working on precarious projects that require precision and total commitment.

Source:http://www.webpopdesign.com/services/data-entry/

Monday, 17 November 2014

Kimono Is A Smarter Web Scraper That Lets You “API-ify” The Web, No Code Required

A new Y Combinator-backed startup called Kimono wants to make it easier to access data from the unstructured web with a point-and-click tool that can extract information from webpages that don’t have an API available. And for non-developers, Kimono plans to eventually allow anyone track data without needing to understand APIs at all.

This sort of smarter “web scraper” idea has been tried before, and has always struggled to find more than a niche audience. Previous attempts with similar services like Dapper or Needlebase, for example, folded. Yahoo Pipes still chugs along, but it’s fair to say that the service has long since been a priority for its parent company.

But Kimono’s founders believe that the issue at hand is largely timing.

“Companies more and more are realizing there’s a lot of value in opening up some of their data sets via APIs to allow developers to build these ecosystems of interesting apps and visualizations that people will share and drive up awareness of the company,” says Kimono co-founder Pratap Ranade. (He also delves into this subject deeper in a Forbes piece here). But often, companies don’t know how to begin in terms of what data to open up, or how. Kimono could inform them.

Plus, adds Ranade, Kimono is materially different from earlier efforts like Dapper or Needlebase, because it’s outputting to APIs and is starting off by focusing on the developer user base, with an expansion to non-technical users planned for the future. (Meanwhile, older competitors were often the other way around).

The company itself is only a month old, and was built by former Columbia grad school companions Ranade and Ryan Rowe. Both left grad school to work elsewhere, with Rowe off to Frog Design and Ranade at McKinsey. But over the nearly half-dozen or so years they continued their careers paths separately, the two stayed in touch and worked on various small projects together.

One of those was Airpapa.com, a website that told you which movies were showing on your flights. This ended up giving them the idea for Kimono, as it turned out. To get the data they needed for the site, they had to scrape data from several publicly available websites.

“The whole process of cleaning that [data] up, extracting it on a schedule…it was kind of a painful process,” explains Rowe. “We spent most of our time doing that, and very little time building the website itself,” he says. At the same time, while Rowe was at Frog, he realized that the company had a lot of non-technical designers who needed access to data to make interesting design decisions, but who weren’t equipped to go out and get the data for themselves.

With Kimono, the end goal is to simplify data extraction so that anyone can manage it. After signing up, you install a bookmarklet in your browser, which, when clicked, puts the website into a special state that allows you to point to the items you want to track. For example, if you were trying to track movie times, you might click on the movie titles and showtimes. Then Kimono’s learning algorithm will build a data model involving the items you’ve selected.

Screen Shot 2014-02-18 at 4.29.05 PM

Screen Shot 2014-02-18 at 4.29.27 PM

That data can be tracked in real time and extracted in a variety of ways, including to Excel as a .CSV file, to RSS in the form of email alerts, or for developers as a RESTful API that returns JSON. Kimono also offers “Kimonoblocks,” which lets you drop the data as an embed on a webpage, and it offers a simple mobile app builder, which lets you turn the data into a mobile web application.

Screen Shot 2014-02-18 at 4.29.50 PM

For developer users, the company is currently working on an API editor, which would allow you to combine multiple APIs into one.

So far, the team says, they’ve been “very pleasantly surprised” by the number of sign-ups, which have reached ten thousand*. And even though only a month old, they’ve seen active users in the thousands.

Initially, they’ve found traction with hardware hackers who have done fun things like making an airhorn blow every time someone funds their Kickstarter campaign, for instance, as well as with those who have used Kimono for visualization purposes, or monitoring the exchange rates of various cryptocurrencies like Bitcoin and dogecoin. Others still are monitoring data that’s later spit back out as a Twitter bot.

Kimono APIs are now making over 100,000 calls every week, and usage is growing by over 50 percent per week. The company also put out an unofficial “Sochi Olympics API” to showcase what the platform can do.

The current business model is freemium based, with pricing that kicks in for higher-frequency usage at scale.

The Mountain View-based company is a team of just the two founders for now, and has initial investment from YC, YC VC and SV Angel.

Source:http://techcrunch.com/2014/02/18/kimono-is-a-smarter-web-scraper-that-lets-you-api-ify-the-web-no-code-required/

Thursday, 13 November 2014

Big Data Democratization via Web Scraping

Big Data Democratization via Web Scraping

If  we had to put democratization of data inline with the classroom definition of democracy, it would read- Data by the people, for the people, of the people. Makes a lot of sense, doesn’t it? It resonates with the generic feeling we have these days with respect to easy access to data for our daily tasks. Thanks to the internet revolution, and now the social media.

Big-data-crawling

Big Data web Crawling

By the people- most of the public data on the web is a user group’s sentiments, analyses and other information.

Of the people- Although the “of” here does not literally mean that the data is owned, all such data on the internet either relates to the user group itself or its views on things.

For the people- Most of this data is presented via channels (either social media, news, etc.) for public benefit be it travel tips, daily news feeds, product price comparisons, etc.

Essentially, data democratization has come to mean that by leveraging cloud computing, data that’s mostly user-generated on the internet has become accessible by all industries- big or small for their own internal use (commercial or not). This democratization has been put to use for unearthing hidden patterns from big blobs of datasets. Use cases have evolved with the consumer internet landscape and Big Data is now being used for various other means quite unanticipated.

With respect to the democratization, we’ve also heard enough about how data analytics is paving way beyond data analysts within companies and becoming available to even the non-tech-savvies. But did anyone mention DaaS providers who aid in the very first phase of data acquisition? Data scraping or web crawling (whatever your lingo is) has come to become an indivisible part of data democratization, especially when talking large-scale. The first step into bringing the public data to use is acquiring it which is where setting up web crawlers internally or partnering with DaaS providers comes to play. This blog guides towards making a choice. Its not always all the data that companies crunch or should crunch from the web. There’s obviously certain channels that are of more interest to the community than the rest and there lies the barrier- to identify sources of higher ROI and acquire data in a machine-readable format.

DaaS providers usually come to help with the entire data acquisition pipeline- starting from picking the right sources through crawl, extraction, dedup as well as data normalization based on specific requirements. Once the data has been acquired, its most likely published on another channel. Such network effect bolsters the democracy.

Steps in Data Acquisition Pipeline

crawl-extract-norm

Note- PromptCloud only delivers structured data as per the schema provided.

So while democratization may refer to easy access of computing resources in order to draw patterns from Big Data, it could also be analogous to ensuring right data in the right format at right intervals. In fact, DaaS providers have themselves used this democracy to empower it further.

Source:https://www.promptcloud.com/blog/big-data-democratization-using-web-scraping-2/

Wednesday, 12 November 2014

Why Businesses Need Data Scraping Service?

With the ever-increasing popularity of internet technology there is an abundance of knowledge processing information that can be used as gold if used in a structured format. We all know the importance of information. It has indeed become a valuable commodity and most sought after product for businesses. With widespread competition in businesses there is always a need to strive for better performances.

Taking this into consideration web data scraping service has become an inevitable component of businesses as it is highly useful in getting relevant information which is accurate. In the initial periods data scraping process included copying and pasting data information which was not relevant because it required intensive labor and was very costly. But now with the help of new data scraping tools like Mozenda, it is possible to extract data from websites easily. You can also take the help of data scrapers and data mining experts that scrape the data and automatically keep record of it.

How Professional Data Scraping Companies and Data Mining Experts Device a Solution?

Data Scraping Plan and Solutions

ImageCredit:http://www.loginworks.com/images/newscapingpage/data-as-service-plan.png

Why Data Scraping is Highly Essential for Businesses?

Data scraping is highly essential for every industry especially Hospitality, eCommerce, Research and Development, Healthcare, Financial and data scraping can be useful in marketing industry, real estate industry by scraping properties, agents, sites etc., travel and tourism industry etc. The reason for that is it is one of those industries where there is cut-throat competition and with the help of data scraping tools it is possible to extract useful information pertaining to preferences of customers, their preferred location, strategies of your competitors etc.

It is very important in today’s dynamic business world to understand the requirements of your customers and their preferences. This is because customers are the king of the market they determine the demand. Web data scraping process will help you in getting this vital information. It will help you in making crucial decisions which are highly critical for the success of business. With the help of data scraping tools you can automate the data scraping process which can result in increased productivity and accuracy.

Reasons Why Businesses Opt. For Website Data Scraping Solutions:

Website Scraping
Demand For New Data:

There is an overflowing demand for new data for businesses across the globe. This is due to increase in competition. The more information you have about your products, competitors, market etc. the better are your chances of expanding and persisting in competitive business environment. The manner in which data extraction process is followed is also very important; as mere data collection is useless. Today there is a need for a process through which you can utilize the information for the betterment of the business. This is where data scraping process and data scraping tools come into picture.

ImageCredit:3idatascraping.com
Capitalize On Hot Updates:

Today simple data collection is not enough to sustain in the business world. There is a need for getting up to date information. There are times when you will have the information pertaining to the trends in the market for your business but they would not be updated. During such times you will lose out on critical information. Hence; today in businesses it is a must to have recent information at your disposal.

The more recent update you have pertaining to the services of your business the better it is for your growth and sustenance. We are already seeing lot of innovation happening in the field of businesses hence; it is very important to be on your toes and collect relevant information with the help of data scrapers. With the help of data scrapping tools you can stay abreast with the latest developments in your business albeit; by spending extra money but it is necessary tradeoff in order to grow in your business or be left behind like a laggard.

Analyzing Future Demands:

Foreknowledge about the various major and minor issues of your industry will help you in assessing the future demand of your product / service. With the help of data scraping process; data scrapers can gather information pertaining to possibilities in business or venture you are involved in. You can also remain alert for changes, adjustments, and analysis of all aspects of your products and services.

Appraising Business:

It is very important to regularly analyze and evaluate your businesses. For that you need to evaluate whether the business goals have been met or not. It is important for businesses to know about your own performance. For example; for your businesses if the world market decides to lower the prices in order to grow their customer base you need to be prepared whether you can remain in the industry despite lowering the price. This can be done only with the help of data scraping process and data scraping tools.

Source:http://www.habiledata.com/blog/why-businesses-need-data-scraping-service

Monday, 10 November 2014

Example of Scraping with Selenium WebDriver in C#

In this article I will show you how it is easy to scrape a web site using Selenium WebDriver. I will guide you through a sample project which is written in C# and uses WebDriver in conjunction with the Chrome browser to login on the testing page and scrape the text from the private area of the website.

Downloading the WebDriver

First of all we need to get the latest version of Selenium Client & WebDriver Language Bindings and the Chrome Driver. Of course, you can download WebDriver bindings for any language (Java, C#, Python, Ruby), but within the scope of this sample project I will use the C# binding only. In the same manner, you can use any browser driver, but here I will use Chrome.

After downloading the libraries and the browser driver we need to include them in our Visual

Studio solution:

VS Solution

Creating the scraping program

In order to use the WebDriver in our program we need to add its namespaces:

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;


Then, in the main function, we need to initialize the Chrome Driver:

using (var driver = new ChromeDriver())

{

 This piece of code searches for the chromedriver.exe file. If this file is located in a directory different from the directory where our program is executed, then we need to specify explicitly its path in the ChromeDriver constructor.

When an instance of ChromeDriver is created, a new Chrome browser will be started. Now we can control this browser via the driver variable. Let’s navigate to the target URL first:

driver.Navigate().GoToUrl("http://testing-ground.scraping.pro/login");

Then we can find the web page elements needed for us to login in the private area of the website:

var userNameField = driver.FindElementById("usr");
var userPasswordField = driver.FindElementById("pwd");
var loginButton = driver.FindElementByXPath("//input[@value='Login']");

Here we search for user name and password fields and the login button and put them into the corresponding variables. After we have found them, we can type in the user name and the password  and press the login button:

userNameField.SendKeys("admin");
userPasswordField.SendKeys("12345");
loginButton.Click();


At this point the new page will be loaded into the browser, and after it’s done we can scrape the text we need and save it into the file:

var result = driver.FindElementByXPath("//div[@id='case_login']/h3").Text;

File.WriteAllText("result.txt", result);

That’s it! At the end, I’d like to give you a bonus – saving a screenshot of the current page into a

file:

driver.GetScreenshot().SaveAsFile(@"screen.png", ImageFormat.Png);

The complete program listing

using System.IO;
using System.Text;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;


namespace WebDriverTest
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize the Chrome Driver
            using (var driver = new ChromeDriver())
            {
                // Go to the home page
                driver.Navigate().GoToUrl("http://testing-ground.scraping.pro/login");

                // Get the page elements
                var userNameField = driver.FindElementById("usr");
                var userPasswordField = driver.FindElementById("pwd");
                var loginButton = driver.FindElementByXPath("//input[@value='Login']");

                // Type user name and password
                userNameField.SendKeys("admin");
                userPasswordField.SendKeys("12345");

                // and click the login button
                loginButton.Click();

                // Extract the text and save it into result.txt
                var result = driver.FindElementByXPath("//div[@id='case_login']/h3").Text;
                File.WriteAllText("result.txt", result);

                // Take a screenshot and save it into screen.png
                driver.GetScreenshot().SaveAsFile(@"screen.png", ImageFormat.Png);
            }
        }
    }
}

Also you can download a ready project here.

Conclusion

I hope you are impressed with how easy it is to scrape web pages using the WebDriver. You can naturally press keys and click buttons as you would in working with the browser. You don’t even need to understand what kind of HTTP requests are sent and what cookies are stored; the browser does all this for you. This makes the WebDriver a wonderful tool in the hands of a web scraping specialist.

Source:http://scraping.pro/example-of-scraping-with-selenium-webdriver-in-csharp/

Friday, 7 November 2014

Web Scraping: Business Intelligence

Web scraping is simply getting of information that is both hidden and unhidden from the internet. Web scraping is one of the latest technologies used in harvesting data from WebPages. It has been used to extract useful information for practical and beneficial applications and its interpretation has been tested in decision making. Web scraping is a new term that overshadows the traditional data harvesting technique that was used before. It has been regarded as knowledge discovery in databases for research and even marketing monitoring.

This article explores the various business intelligence ways in which web scraping can be used to be of importance.

Web scraping services has been used by many companies that have a strong customer focus. These companies range from sectors like retail, financial services, and marketing and communication organizations. It quite important to realize that web scraping has great signifies and impact in the varied commercial applications for the better understanding and prediction of the critical data. The data may range from stocks to consumer behaviors. The consumer behaviors are better shown in trends like customer profiles, purchasing and industry analysis among others.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-business-intelligence/

Wednesday, 5 November 2014

Web Scraping Popularity Soars

The world is stirred because of the ever-growing web scraping success in almost all of its services. Success stories pertaining to the benefits of online data collection in business, research, politics, health, and almost all aspects of human life are endless. With this popularity surge, it has become a hot issue and many are questioning its legality and reliability.

Looking back, this simple harvesting of pertinent data from competitors and the global market in general like anything else started as a non-threatening and advanced form of web research. Eventually, when the benefits begin to manifest and the system improves, many are lured into it that it has become one of the strongest and fastest growing business in the world.

Simple Beginnings

As naturally as a law of life that great things come from small beginnings, data mining was conceived as a process in gaining information, mostly in research. This act of collecting information through the internet was never imagined to be what it has become nowadays.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-popularity-soars/

Monday, 8 September 2014

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python

mechanize and beautifulsoup to do the scraping.

I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is

streamed into the table and mechanize.Browser() just stops listening.

So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500

contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use

browser.open() at that url all you get is the first ~5 rows.

I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a

time.sleep(10) between the .open() and .read() but that didn't make much difference.

And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-

source'). SO I don't think its a javascript issue.

Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.

-Will

2 Answers

Why not use an html parser like lxml with xpath expressions.

I tried

>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>



Similarly, you can create xpath expression of your choice to work with.


Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-

fashion