Web scraping

Introduction

Web scraping is used to extract data from websites. This is incredible easy with R. In this tutorial you will learn how to extract a table from a complex site. In the Wikipedia article (Wind power by country)[https://en.wikipedia.org/wiki/Wind_power_by_country] you find a table with installed wind power capacity, embedded in a complex site with text, graphs and other tables. How can you extract information of this specific table?

  • Objective: Learn how to extract components of websites
  • Requirements: Google Chrome, R Basics

Preparation

First, you need to install and load the package rvest.

# install package once
#install.packages("rvest")

# load library rvest
library("rvest")

Now you can define the url.

url <- "https://en.wikipedia.org/wiki/Wind_power_by_country"

To get the information of the specific table you need to get xpath information. xpath can be used to navigate in XML documents. We will use it to identify the table of interest. If you have installed Google Chrome it is easy. (If you are using a different browser, there are many addins for this purpose.)

If you in the browser and on the website, just click CTRL + SHIFT + I or right-click and Inspect. On the right side of your screen a new windows appears called DevTools. If you hover over the items the objects on the website are highlighted. Now you browse until you find the table highlighted.

Now you only have to copy the xpath string by right-clicking –> Copy –> Copy XPath. This is used within R for accessing the table.

wind <- read_html(url) %>% 
    html_node(xpath =  '//*[@id="mw-content-text"]/div/table[3]') %>% 
    html_table(fill = T)

In a first step the complete url is downloaded. In a second step only the relevant component, identified by xpath, is filtered. Finally, this needs to be converted from HTML to a dataframe. For this html_table() is applied.

Now, a dataframe wind is available for further processing.

Finally, let’s take a look at the table and compare it to the source. It worked perfectly.

kable(head(wind[, 2:ncol(wind)]))
Nation 2006 2007 2008[19] 2009[20] 2010[21] 2011[22] 2012[23] 2013[24] 2014[25] 2015[4] 2016[26] 2017[27]
China 2599 5912 12210 25104 44733 62733 75564 91412 114763 145104 168690 188232
European Union 48122 56614 65255 74919 84278 93957 106454 117384 128752 141579 153730 169319
United States 11603 16819 25170 35159 40200 46919 60007 61110 65879 74472 82183 89077
Germany 20622 22247 23903 25777 27214 29060 31332 34250 39165 44947 50019 56132
India 6270 7850 9587 10925 13064 16084 18421 20150 22465 27151 28665 32848
Spain 11630 15145 16740 19149 20676 21674 22796 22959 22987 23025 23075 23170

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close