Web Scraping: What is it and what is it good for?

M.Sc. Chris Wojzechowski

Web Scraping: What is it and what is it good for?

In the field of IT security and data science, there is always the issue of so-called web scraping. But what exactly is scraping anyway and why is it so essential to know about it in these industries? In today’s article, we would like to go into a little more detail.

Here, we will explain to you in great detail what web scraping is all about and why we as IT security professionals need to use it on a regular basis. Of course, we’ll also explain how the web scraping thing works technically. You can also find out here whether scraping is legal at all in the end.

What exactly is web scraping?

Scraping translates as “scraping” and that’s basically what web scraping is. To put it bluntly and directly, it’s about scraping together information from websites. So, web scraping is an automated method to read, collect and store certain data from the web.

Basically, web scraping is all about retrieving certain data in order to be able to perform further analyses based on it. As mentioned, this is predominantly relevant in the area of data science and IT security, as there is often more data freely available than initially assumed. So an attempt is made to secure everything possible by means of web scraping, in order to then subject it to a more detailed evaluation in the further course.

A practical and simple example of web scraping would be collecting phone numbers. Since these are often found in the masthead of a website, we could use web scraping to collect all phone numbers from a dataset of websites and then sort them by specific area codes. Using web scraping, we would then have secured a long list of phone numbers. But be careful, that is exactly what would be illegal in this case. About legality we write a few more sentences below.

How does web scraping work?

Technically, web scraping is closely related to Python in general and in language usage. Whenever scraping is mentioned, the term pyhton is often mentioned. This has the simple reason that the Pyhton programming language is particularly well suited for the application purpose “scraping”. This, in turn, is due in part to the fact that Python has many positive features that play an important role in web scraping. The ease of customization, for example, or the handling of subsequent word processing.

In principle, web scraping always takes place automatically, because it often involves particularly large amounts of data. Therefore, copying a few pieces of content does not have much to do with scraping. Large data sets that can then be further processed are always the goal in web scraping. Basically, certain content is to be automatically extracted from websites in order to use it elsewhere.

As mentioned in the example, we could send out a bot that scans websites for a link to the masthead and then recognizes and copies the phone number there. In this way, we would collect phone numbers by means of web scraping. Likewise, however, we could gather all sorts of other information, extract it, and store it separately to work with. Prices of online stores, as well as search results for certain keywords and much more.

What is web scraping used for?

Besides the just mentioned example with phone numbers, databases with contacts could also be created. The scraper can also retrieve specific website data that is not accessible via API, but that you want to store or display elsewhere. Basically, it is possible to scrap all the content of a website, at least if you manage to grab it appropriately.

Scrapers are often used when there are no official APIs that provide the corresponding data. Or if these APIs cost a lot of money and someone would like to grab the data for free. Moreover, in the field of data science, where gigantic data sets and corresponding analyses are involved, there are often no official sources. The data that is needed must therefore be scraped together by means of scraping.

However, there is unfortunately also content theft that takes place by means of scraping. In this way, complete websites are replicated, files are copied or download links are scraped that should not actually be publicly accessible. Which brings us to the next point, the legality of web scraping.

In fact, there are already various court rulings on this. In general, the courts currently assume that web scraping is legal. The content that is extracted using Scraper is publicly available and may therefore be accessed. Even if this is done automatically by means of a scraper.

However, you are always in a problematic or gray area when the webmaster of a website tries to hide this content. Because if a directory is actively hidden, this could be interpreted as a kind of protection mechanism, the circumvention of which is then logically illegal.

If firewalls are bypassed or protection mechanisms are ignored per se, web scraping is illegal anyway. This also applies to the example in this article regarding phone numbers and addresses. These are covered by the GDPR and storage is generally only permitted with the consent of the respective persons.

The situation is different with Open Data, for example. The city of Gelsenkirchen provides extensive data sets under its Open Data Portal. These are freely available and can be queried and processed without scraping.

The city of Gelsenkirchen makes extensive data sets freely available at https://opendata.gelsenkirchen.de. Scraping of the data is not necessary. (Source:gelsenkirchen.de)

However, whether the data is available via download or can be automatically retrieved via API and subsequently processed depends on the platform.

The Open Data platform of the city of Gelsenkirchen also informs about new datasets. These are also suitable for end users, as some of the data is available for download as a PDF or Excel file. (Source:gelsenkirchen.de)

Can web scraping be prevented?

In principle, yes. Firewalls can effectively prevent certain accesses or bots, such as scrapers, from reaching the website in the first place. It can also be assumed that accesses appear in the log files and are seen and evaluated accordingly by the admin. Therefore, you are not anonymous and accesses from the server are always noticed in different ways.

Moreover, anyone who circumvents such security systems is acting illegally. Because, as just mentioned, scraping is legal only when it accesses public content that has not been protected in any way. However, if it is protected content or servers that block scrapers through a firewall, it is no longer legal.

Scraping can therefore very well be prevented, and it will be as soon as it becomes inconvenient. So if you prefer to use your own scraper instead of an API, you will certainly get into trouble quickly. At the latest when the load at the respective provider becomes too great. However, if you are scraping for data analysis only, you are usually on the safe side. This is also because the goal is only a one-time collection of large data sets, not the repeated and constant tapping of the same.

Conclusion on automated data extraction from web pages

Without web scraping, many offerings would not even exist and some things are not easily solved via an API. Especially also because not every website offers an API for the provided content. So web scraping is useful and also makes sense. Depending on the application, of course.

Especially in IT security or the field of data science, web scraping also ensures that large-scale tests or studies can be completed. Only with as much data as possible is it possible to make a broad analysis and draw significant conclusions. You can’t do that with just a few websites, and it’s hard to collect the data manually.

Those who want to prevent web scraping can easily take steps to prevent it. Bots can be blocked out and a user agent can be blocked. The firewall also blocks Scraper at the entrance and makes it clear that it can only be used illegally from here on. We hope our little insight into the topic of web scraping has helped you and could provide a little more clarity.

AWARE7 GmbH is one of the leading institutions in the field of web measurements together with the Institute for Internet Security. For example, we have defined criteria that make reproducible and replicable measurements of the Internet possible:

Photo of author

M.Sc. Chris Wojzechowski

My name is Chris Wojzechowski and I studied my Master in Internet Security in Gelsenkirchen a few years ago. I am one of two managing directors of AWARE7 GmbH and a trained IT Risk Manager, IT-Grundschutz practitioner (TÜV) and possess the test procedure competence for § 8a BSIG. Our bread and butter business is performing penetration testing. We are also committed to promoting a broad understanding of IT security in Europe, which is why we offer the majority of our products free of charge.