We will learn to work with strings. For this we will analyse one of my favorite books: George Orwell’s 1984.
- Objectives: Learn string handling, e.g. functions grep(), gsub(), nchar(), strsplit(), and many more
- Requirements: None
Import
Luckily this book is available for download. The link is stored in variable “url” and the text is downloaded with readLines() and save in “text_1984”.
url <- "http://gutenberg.net.au/ebooks01/0100021.txt"
text_1984 <- readLines(url)
Filtering
The book has some overhead: introductory text at the beginning and some appendix at the end. We want to analyse the pure book, so we filter the text to its core.
text_1984_filt <- text_1984[47: length(text_1984)]
text_1984_filt <- text_1984_filt[1:9865]
What structure does this object have?
str(text_1984_filt)
## chr [1:9865] "PART ONE" "" "" "" "Chapter 1" "" "" "" ...
It is a character vector with 9865 elements. There is just one problem. Some elements contain several words, some don’t contain a single word. Our aim is to have a vector with a single word as each element.
Concatenation with paste()
First, we collapse this vector to one single string. This can be done with paste() function. Separators are blank signs between the words. In a second step we modify all letters to lower letters with tolower().
# collapse to a single string
text_1984_one_single_string <- paste(text_1984_filt, collapse = " ")
text_1984_one_single_string <- tolower(text_1984_one_single_string)
Separate a String with strsplit()
Now we create the character vector with single words. We can use strsplit() and split at each blank sign “ ”. The result is a list with one element. We access this element with “[[1]]”.
# separate each word
text_1984_separate_words <- strsplit(x = text_1984_one_single_string, split = " ")[[1]]
head(text_1984_separate_words, n=10)
## [1] "part" "one" "" "" "" "chapter" "1" ## [8] "" "" ""
This looks as desired.
Finding Patterns with grep()
Are there numbers in the text? We will find out with grep(). grep() searches for matches within the text.
But how can we define all numbers? The easy way is to run grep() with parameter “0”, “1”, “2”, … But this takes quite some effort and contradicts DRY (don’t repeat yourself principle).
There is a better way. You can use “[0-9]” for all numbers from 0 to 9. This is a regular expression. Regular expressions are extremely powerful. Some links are shown at the end of this article.
head(grep("[0-9]", text_1984_separate_words, value = T))
## [1] "1" "300" "4th," "1984." "1984." "1944"
We can use this regular expression to remove numbers and hyphens
# delete numbers
text_1984_separate_words <- gsub("[0-9]", "", text_1984_separate_words)
# delete hyphens
text_1984_separate_words <- gsub("-", " ", text_1984_separate_words)
head(text_1984_separate_words)
## [1] "part" "one" "" "" "" "chapter"
Detect Empty Words with nchar()
There are still empty elements, which we will delete in the next step. Empty element have zero characters, which we can find out with nchar(). so we filter for nchar() > 0.
# delete empty words
text_1984_separate_words <- text_1984_separate_words[nchar(text_1984_separate_words) > 0]
Characters Occurances
The main characters are “Winston”, “Julia”, “O’Brien” and of in a way “big brother”. with table() the number of occurances are shown. We concentrate on the main characters and filter for them with”[ ]“.
table(text_1984_separate_words)[c("winston", "julia", "o\'brien", "brother")]
## text_1984_separate_words ## winston julia o'brien brother ## 315 44 120 40
Not surprisingly “Winston” as the main character has the most appearances.This is not sorted. We can order this table with sort(). Default is ascending order, but with parameter “decreasing = T” it is changed to decreasing.
sort(table(text_1984_separate_words)[c("winston", "julia", "o\'brien", "brother")], decreasing = T)
## text_1984_separate_words ## winston o'brien julia brother ## 315 120 44 40
Findest the shortest Word
I am curious about finding out, what the shortest word is. We already know the length of a word can be found with nchar(). Now, we need to find the position of maximum and use which.max().
pos_min <- which.min(nchar(text_1984_separate_words))
text_1984_separate_words[pos_min]
## [1] "a"
Surprise, surprise. The shortest word is “a”. The opposite is which.max(). Find out for yourself what the longest word is.
Word Lengths
What are the distribution of word lengths. Let’s see with hist()_ and plot a histogram.
hist(nchar(text_1984_separate_words), breaks = seq(1, 30, 1))
Letter Frequencies
How often does each letter appear in the text? To find out we split our single text-string at each position with “”. With _table() we can calculate occurance of each letter.
single_chars <- tolower(strsplit(text_1984_one_single_string, "")[[1]])
char_freq <- table(single_chars)[letters]
The relative frequencies are shown in this graph.
What is this good for? Well, each language has its unique distribution of letter. Without being able to speak english at all we can find out that this is an english text – just compare this distribution with Wikipedia article on letter frequencies (link at the end of the article).
If some monoalphabetic encryption is used, this is the key to decrypt it and find the plain text. If you are interested in a simple encryption system read this article.
Wordclouds
Finally, we will use some nice visualisation technique: wordclouds. Wordclouds take all words and present the most common words. Sizes represent number of occurances.
suppressPackageStartupMessages(library(wordcloud))
wordcloud(words = text_1984_separate_words[1:2000])
More Information
- 1984 Book Text http://gutenberg.net.au/ebooks01/0100021.txt
- Regular expressions http://www.regular-expressions.info/rlanguage.html
- More Regular expression https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
- Letter Frequency https://en.wikipedia.org/wiki/Letter_frequency