Web Scraping or Data scraping is a process of extracting vast amount of unstructured data from websites using software and stored in a structured manner.
Web scraping is an important part of how modern world and businesses work. Almost every online site uses web scraping to build its databases. Google spends hundreds of billions of dollars on data scraping to build their search databases (Goldman, 2015).
There are several ways one can extract data from the web:
I. The most common way of accessing web-data is by using APIs (Application Programming Interfaces). Almost all social media sites like Facebook, Twitter provide APIs through which structured data can be accessed.
II. People with computing background can use either ‘Web Crawler’, an automated program, or Python, a high-level programming language, to scrape web-data (Ray, 2015).
“Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.”
III. AgentMat, an implemented scraping system, can extract information in any desired format (Be o, Mi?ek & Zavoral, 2009).
IV. With the help of WebHarvy, a visual web-scraper and Import.io, a data mining tool, even a non-programmer can scrape online information (Ray, 2015).
People scrape data all the time, especially data scientists, data managers. But is it legal? Scraping one’s own website is not illegal, but scraping other’s website without their permission, is a disregard of their ‘Terms of Service’ (ToS). But people are not bothered about legal and ethical challenges of web scraping. Nowadays, people share every moment of their life on social networking sites. These data are open, but not accessible to everyone and it should remain like that. According to the UK Data Protection Act, ‘Personal Data’ is the information related to an individual who can be identified from those data. Accessing someone’s personal data without permission is a violation of data security and the Data Protection Act gives the right to find out what information the government and other organisations have on someone and take actions accordingly to protect privacy.
THE OKCUPID AND FACEBOOK CASES ARE EXAMPLES OF UNAUTHORIZED DATA SCRAPING AND CROSS ALL ETHICAL LINES. THE QVC CASE PROVES THAT WEB SCRAPING IS NOT ONLY AGAINST ETHICS, IT CAN CREATE LEGAL ISSUES TOO.
In 2016, some researchers from Aarhus University released a dataset, containing intimate details like usernames, age, gender, religion, relationship preferences etc. of almost 70,000 users of an online dating site, called OkCupid (Hackett, 2016). They scraped these data using a software to conduct a study on the behaviour and personality of people. According to them, as these data were already public, they seem no harm to either use them or publish them with a draft paper on Open Science Framework for other researchers to use. In 2008, some Harvard Researchers also used the “already public” excuse to publish a dataset comprising details of 1700 college students collected from their Facebook profiles (Zimmer, 2016). When cases like OkCupid or Facebook happens, it breaches privacy and also the sense of security. Unapproved data harvesting is unethical, but use of that information without permission is illegal, even if it is used to serve academic purposes.
On May 2014, the automated scraper of Resultly, a start-up shopping app, overloaded the server of QVC (an American TV retailer) and cost QVC around $2 Million in revenue (Goldman, 2015). QVC blocked Resultly and took legal action based on the Computer Fraud & Abuse Act. Nowadays a lot of companies have their own web scrapers, but they should not practice unauthorized collection and use of data. And the companies, whose websites get scraped more often, need to be careful about securing their data. Violation of contract and copyright laws, trespass of chattels are some offences caused by unauthorized scraping and there are serious laws against these misdeeds. Also maintaining morale in the society and ethics in businesses are required to have a bright future.
These recent incidents of unethical web scraping have influenced ICO to publish this Policy Brief. The Data Protection Act 1998 protects and controls the personal information of the people, used by various organisations and the government. The new data protection act, GDPR (General Data Protection Regulation), to be published in 2018, is all about how the organisations must protect personal data (Scope, 2016). According to GDPR, organisations of all sizes should practice these principles:
“DATA IS PUBLICLY ACCESSIBLE”- IT DOES NOT PROVIDE PERMISSION TO ANYONE TO USE PERSONAL DATA WITHOUT THE OWNER’S CONSENT.
The organisations should adopt proper technical and organisational measures to protect personal data.
· The organisations, that have personal information on EU residentials, should adapt to the new regulation and change their infrastructure and method of processing data accordingly.
· Every organisation need to keep records of data processing activities and make them available to the higher authority on request.
· Risk analysis and taking consent before processing data are necessary.
· If any organisation becomes aware of a data breach, then it needs to report it to the superior authority within 72 hours.
· The organisations would have to pay a fine up to 4% of annual global turnover or 20 million euro, whichever is greater, for non-compliance of the new regulation.
· Depending on the gravity of the breach, one can even take legal actions against the organisation.
Web scraping is not a trivial issue and this is an earnest request to all the Senior Managers not to take it for granted. Data security should be an important concern for every organization.