REST and web scraping
COURSE COMPLETE
Web pages are built in the HTML language. That language gives structure to pure text data.
A number of larger online platforms also offer other output formats than just HTML. For example, you can extract data from Wikipedia in the form of JSON data. This is especially useful if you want to develop your own software that uses data from other websites. In some cases this is free, in other cases you have to pay for such a service. After payment you will receive an authentication key that you must include in your programming code.
To retrieve the data, it is often sufficient to build a URL correctly. The technique to request information in standardized and / or structured formats is called REST (Representational state transfer).
Take the test with REST
Watch the movie to see how you have to start with the exercises.
To be able to use REST in a handy way, you need to know how a URL of a web page is structured. In many cases you can perform a search in the underlying database via the URL.
If you search for "ucll" in Google and press the search button, the following URL will appear in the address bar:
After the domain name www.google.be/ follows the webpage or web service with the name "search" and a question mark. If you look closely at everything that appears behind the question mark, you will see key and value pairs separated by an "&" sign.
The following key-value pairs are of interest to us:
Key | Value |
oq | ucll |
q | ucll |
... | ... |
We can now prepare links ourselves:
https://www.google.be/search?q=YOUR QUERY
You will notice that that link works perfectly, without all those other parameters from the previous link. Cool, we used the Google Search Engine REST API to generate custom search results.
REST with other output formats
Google Books
Base URL: https://www.googleapis.com/books/v1/volumes?
Key | Value |
q | a search term (string or string) |
maxResults | The number of results you want to get (integer) |
startIndex | the point from where you want to start (integer) |
Example: https://www.googleapis.com/books/v1/volumes?q=Kris%20Merckx%20in%20geen%20tijd&maxResults=1&startIndex=0
Example of a "book service" built with the Google Books API: https://www.drukhier.be/booksprint/apps/biblio/bib.php?search=Kris+Merckx+in+geen+tijd
Wikipedia
Base URL: http://nl.wikipedia.org/w/api.php?
Key | Valye |
action | query |
list | search |
srsearch | your search term (string) |
format | jsonfm ... |
Example: http://nl.wikipedia.org/w/api.php?action=query&list=search&srsearch=Charles Darwin&format=jsonfm
Open Street Maps
Base URL: https://nominatim.openstreetmap.org/search?
Key | Value |
format | html, xml, json, jsonv2, geojson, geocodejson |
street | house number street name |
city | municipality |
county | county name |
state | state |
country | country |
postalcode | postal code |
q | your search term (string) |
Example: https://nominatim.openstreetmap.org/search?format=geocodejson&street=77%20Processieweg&city=Hakendover&postalcode=3300
Important to know
1. The order of the keys in the construction of the URL is of no importance.
2. Not all keys are required. However, some are necessary. You will notice that by playing around with it.
Exercise
You already received the assignment in PDF. Upload your exercises in Toledo.
What is web scraping?
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. (Source: https://en.wikipedia.org/wiki/Web_scraping)
VIDEO WILL FOLLOW
Kris Merckx - 2020