No matter how intelligent and sophisticated your technology is, what you ultimately need for Big Data Analysis is data. Lots of data. Versatile and coming from many sources in different formats. In many cases, your data will come in a machine-readable format ready for processing — data from sensors is an example. Such formats and protocols for automated data transfer are rigidly structured, well-documented and easily parsed. But what if you need to analyze information meant for humans? What if all you have are numerous websites?
This is the place where data scraping, or web scraping steps in: the process of importing information from a website into a spreadsheet or local file saved on your computer. In contrast to regular parsing, data scraping processes output intended for display to an end-user, rather than as input to another program, usually neither documented nor structured. To successfully process such data, data scraping often involves ignoring binary data, such as images and multimedia, display formatting, redundant labels, superfluous commentary, and other information which is doomed irrelevant.
Applications of Data Scraping
When we start thinking about data scraping the first and irritating application that comes to mind is email harvesting — uncovering people’s email addresses to sell them on to spammers or …