Introduction
Web scraping is used to extract data from websites. This is incredible easy with R. In this tutorial you will learn how to extract a table from a complex site. In the Wikipedia article (Wind power by country)[https://en.wikipedia.org/wiki/Wind_power_by_country] you find a table with installed wind power capacity, embedded in a complex site with text, graphs and other tables. How can you extract information of this specific table?
- Objective: Learn how to extract components of websites
- Requirements: Google Chrome, R Basics
Preparation
First, you need to install and load the package rvest.
# install package once
#install.packages("rvest")
# load library rvest
library("rvest")
Now you can define the url.
url <- "https://en.wikipedia.org/wiki/Wind_power_by_country"
To get the information of the specific table you need to get xpath information. xpath can be used to navigate in XML documents. We will use it to identify the table of interest. If you have installed Google Chrome it is easy. (If you are using a different browser, there are many addins for this purpose.)
If you in the browser and on the website, just click CTRL + SHIFT + I or right-click and Inspect. On the right side of your screen a new windows appears called DevTools. If you hover over the items the objects on the website are highlighted. Now you browse until you find the table highlighted.
Now you only have to copy the xpath string by right-clicking –> Copy –> Copy XPath. This is used within R for accessing the table.
wind <- read_html(url) %>%
html_node(xpath = '//*[@id="mw-content-text"]/div/table[3]') %>%
html_table(fill = T)
In a first step the complete url is downloaded. In a second step only the relevant component, identified by xpath, is filtered. Finally, this needs to be converted from HTML to a dataframe. For this html_table() is applied.
Now, a dataframe wind is available for further processing.
Finally, let’s take a look at the table and compare it to the source. It worked perfectly.
kable(head(wind[, 2:ncol(wind)]))
Nation | 2006 | 2007 | 2008[19] | 2009[20] | 2010[21] | 2011[22] | 2012[23] | 2013[24] | 2014[25] | 2015[4] | 2016[26] | 2017[27] |
---|---|---|---|---|---|---|---|---|---|---|---|---|
China | 2599 | 5912 | 12210 | 25104 | 44733 | 62733 | 75564 | 91412 | 114763 | 145104 | 168690 | 188232 |
European Union | 48122 | 56614 | 65255 | 74919 | 84278 | 93957 | 106454 | 117384 | 128752 | 141579 | 153730 | 169319 |
United States | 11603 | 16819 | 25170 | 35159 | 40200 | 46919 | 60007 | 61110 | 65879 | 74472 | 82183 | 89077 |
Germany | 20622 | 22247 | 23903 | 25777 | 27214 | 29060 | 31332 | 34250 | 39165 | 44947 | 50019 | 56132 |
India | 6270 | 7850 | 9587 | 10925 | 13064 | 16084 | 18421 | 20150 | 22465 | 27151 | 28665 | 32848 |
Spain | 11630 | 15145 | 16740 | 19149 | 20676 | 21674 | 22796 | 22959 | 22987 | 23025 | 23075 | 23170 |