REST and web scraping

COURSE COMPLETE

HTML is not the only output

Web pages are built in the HTML language. That language gives structure to pure text data.

A number of larger online platforms also offer other output formats than just HTML. For example, you can extract data from Wikipedia in the form of JSON data. This is especially useful if you want to develop your own software that uses data from other websites. In some cases this is free, in other cases you have to pay for such a service. After payment you will receive an authentication key that you must include in your programming code.

To retrieve the data, it is often sufficient to build a URL correctly. The technique to request information in standardized and / or structured formats is called REST (Representational state transfer).

Take the test with REST

Watch the movie to see how you have to start with the exercises.

To be able to use REST in a handy way, you need to know how a URL of a web page is structured. In many cases you can perform a search in the underlying database via the URL.

If you search for "ucll" in Google and press the search button, the following URL will appear in the address bar:

https://www.google.be/search?safe=off&sxsrf=ACYBGNRxaumzC8SvVTsipHrNLzXv9-iP_Q%3A1570048077035&source=hp&ei=TAiVXaXVO5KmwQKJq4-YAg&q=ucll&oq=ucll&gs_l=psy-ab.3..35i39j0i131j0l8.3582.4016..4193...0.0..0.82.245.4......0....1..gws-wiz.......0i131i67.jtGR6yVwP50&ved=0ahUKEwilq6LvtP7kAhUSU1AKHYnVAyMQ4dUDCAY&uact=5

After the domain name www.google.be/ follows the webpage or web service with the name "search" and a question mark. If you look closely at everything that appears behind the question mark, you will see key and value pairs separated by an "&" sign.

The following key-value pairs are of interest to us:

Key	Value
oq	ucll
q	ucll
...	...

We can now prepare links ourselves:

https://www.google.be/search?q=YOUR QUERY

You will notice that that link works perfectly, without all those other parameters from the previous link. Cool, we used the Google Search Engine REST API to generate custom search results.

REST with other output formats

Google Books

Base URL: https://www.googleapis.com/books/v1/volumes?

Key	Value
q	a search term (string or string)
maxResults	The number of results you want to get (integer)
startIndex	the point from where you want to start (integer)

Example: https://www.googleapis.com/books/v1/volumes?q=Kris%20Merckx%20in%20geen%20tijd&maxResults=1&startIndex=0

Example of a "book service" built with the Google Books API: https://www.drukhier.be/booksprint/apps/biblio/bib.php?search=Kris+Merckx+in+geen+tijd

Wikipedia

Base URL: http://nl.wikipedia.org/w/api.php?

Key	Valye
action	query
list	search
srsearch	your search term (string)
format	jsonfm ...

Example: http://nl.wikipedia.org/w/api.php?action=query&list=search&srsearch=Charles Darwin&format=jsonfm

Open Street Maps

Base URL: https://nominatim.openstreetmap.org/search?

Key	Value
format	html, xml, json, jsonv2, geojson, geocodejson
street	house number street name
city	municipality
county	county name
state	state
country	country
postalcode	postal code
q	your search term (string)

Example: https://nominatim.openstreetmap.org/search?format=geocodejson&street=77%20Processieweg&city=Hakendover&postalcode=3300

Important to know

1. The order of the keys in the construction of the URL is of no importance.

2. Not all keys are required. However, some are necessary. You will notice that by playing around with it.

Exercise

You already received the assignment in PDF. Upload your exercises in Toledo.

What is web scraping?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. (Source: https://en.wikipedia.org/wiki/Web_scraping)

VIDEO WILL FOLLOW

Kris Merckx - 2020