REST and web scraping

COURSE COMPLETE
HTML is not the only output

Web pages are built in the HTML language. That language gives structure to pure text data.

A number of larger online platforms also offer other output formats than just HTML. For example, you can extract data from Wikipedia in the form of JSON data. This is especially useful if you want to develop your own software that uses data from other websites. In some cases this is free, in other cases you have to pay for such a service. After payment you will receive an authentication key that you must include in your programming code.

To retrieve the data, it is often sufficient to build a URL correctly. The technique to request information in standardized and / or structured formats is called REST (Representational state transfer).

Take the test with REST

Watch the movie to see how you have to start with the exercises.

To be able to use REST in a handy way, you need to know how a URL of a web page is structured. In many cases you can perform a search in the underlying database via the URL.

If you search for "ucll" in Google and press the search button, the following URL will appear in the address bar:

https://www.google.be/search?safe=off&sxsrf=ACYBGNRxaumzC8SvVTsipHrNLzXv9-iP_Q%3A1570048077035&source=hp&ei=TAiVXaXVO5KmwQKJq4-YAg&q=ucll&oq=ucll&gs_l=psy-ab.3..35i39j0i131j0l8.3582.4016..4193...0.0..0.82.245.4......0....1..gws-wiz.......0i131i67.jtGR6yVwP50&ved=0ahUKEwilq6LvtP7kAhUSU1AKHYnVAyMQ4dUDCAY&uact=5

After the domain name www.google.be/ follows the webpage or web service with the name "search" and a question mark. If you look closely at everything that appears behind the question mark, you will see key and value pairs separated by an "&" sign.

The following key-value pairs are of interest to us:

KeyValue
oqucll
qucll
......


We can now prepare links ourselves:

https://www.google.be/search?q=YOUR QUERY

You will notice that that link works perfectly, without all those other parameters from the previous link. Cool, we used the Google Search Engine REST API to generate custom search results.

REST with other output formats

Google Books

Base URL: https://www.googleapis.com/books/v1/volumes?

KeyValue
qa search term (string or string)
maxResults
The number of results you want to get (integer)
startIndex
the point from where you want to start (integer)
Example: https://www.googleapis.com/books/v1/volumes?q=Kris%20Merckx%20in%20geen%20tijd&maxResults=1&startIndex=0


Example of a "book service" built with the Google Books API: https://www.drukhier.be/booksprint/apps/biblio/bib.php?search=Kris+Merckx+in+geen+tijd


Wikipedia

Base URL: http://nl.wikipedia.org/w/api.php?

KeyValye
actionquery
listsearch
srsearch
your search term (string)
formatjsonfm ...
Example: http://nl.wikipedia.org/w/api.php?action=query&list=search&srsearch=Charles Darwin&format=jsonfm

Open Street Maps

Base URL: https://nominatim.openstreetmap.org/search?

KeyValue
formathtml, xml, json, jsonv2, geojson, geocodejson
streethouse number street name
citymunicipality
countycounty name
statestate
countrycountry
postalcodepostal code
qyour search term (string)
Example: https://nominatim.openstreetmap.org/search?format=geocodejson&street=77%20Processieweg&city=Hakendover&postalcode=3300

Important to know

1. The order of the keys in the construction of the URL is of no importance.

2. Not all keys are required. However, some are necessary. You will notice that by playing around with it.

Exercise

You already received the assignment in PDF. Upload your exercises in Toledo.

What is web scraping?



Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. (Source: https://en.wikipedia.org/wiki/Web_scraping)

VIDEO WILL FOLLOW

Kris Merckx - 2020

home