Used in Data Science, What is Web Scraping?

For Computer scientists, the term Data Science will most certainly ring a bell for them because it is in fact considered as one of the many specializations of the Computer Science major. Data Science refers to a field within the Computer Science major specialized in programming, analytics, and statistics. It is a study on how data will be gathered, stored, processed, and manipulated to be able to be used and analyzed later on. According to IBM, Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data.

Based on its name and the explanation above, Data is one of the most important things to have in Data Science, but how does one actually collect those data? There are various ways of collecting data and one of them is called Web Scraping. Web Scraping is a method of data scraping that can extract data from websites. Web Scraping software can access the World Wide Web in a web browser to fetch and download the page which later can be processed and extracted with parsing or reformatting. Web Scraping can be done manually by software using an automated process using bots, web crawlers, or libraries like Beautiful Soup.

Beautiful Soup is a Python library available for Python 2.7 and Python 3 used for parsing structured data that allows its user to interact with HTML or XML. It provides idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. Beautiful Soup was created by Leonard Richardson in 2004 and supported by Tidelift.

While Web Scraping allows Data Scientists to gather data easily, Web Scraping is still in a somewhat gray area between legal and illegal so that is why ethical Web Scraping is needed. There are some principles that a Web Scraper should adhere to. Firstly, give credits to the website owners and don’t claim the scraped content as our own. They put a lot of effort into creating those articles and data so we need to respect them. Secondly, refrain from scraping from a website that doesn’t want to be scraped which many websites stated in their Terms of Use. Lastly, Web Scraping will constantly try to make a request to a website that may affect the website’s performance, thus slowing them. Too many requests can be categorized as a DDOS attack so Web Scraping needs to be done responsibly without disrupting the regular traffic of the website.

 

References


  • Writer: Michael Christopher