If you deal with datasets that contain date and time, these are usually imported as factors, which is not helpful for further processing. You need to apply some special measures to transform it to an R date object or to extract information like days, or hours. In this tuturial you will learn how this works.

Objectives: learning to create R-time objects (POSIX) from characters or factors, extract components from POSIX timestamp
Requirements: R Basics; ggplot and plyr might be helpful, but are not required

Data Import and Understanding

We use “CalIt2 Building People Counts Data Set” from “UCI Machine Learning Repository”. It includes observations on people flowing in or out of a building, measured over 15 weeks, half hour count aggregates.

For analysis we need dplyr package. ggplot2 is used for visualisation. Please make sure you have installed it before trying to load it.

suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))

“url” gets the http address of data. Data is downloaded with read.csv(). Column names are not included in the data, so we add this manually. Information on inflow or outflow is coded with numerics (7 for outflow, 9 for inflow). We create an additional column with corresponding information as string (“out” or “in”). Original “flow” column is deleted afterwards. Finally, we take a look at the data with tbl_df() from dplyr package.

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/CalIt2.data"
building <- read.csv(url, header = F)
colnames(building) <- c("flow", "date", "time", "count")
building$flow_str <- NA
building$flow_str[building$flow == 7] <- "out"
building$flow_str[building$flow == 9] <- "in"
building$flow <- as.factor(building$flow_str)
tbl_df(building)

## # A tibble: 10,080 x 5
##    flow  date     time     count flow_str
##    <fct> <fct>    <fct>    <int> <chr>   
##  1 out   07/24/05 00:00:00     0 out     
##  2 in    07/24/05 00:00:00     0 in      
##  3 out   07/24/05 00:30:00     1 out     
##  4 in    07/24/05 00:30:00     0 in      
##  5 out   07/24/05 01:00:00     0 out     
##  6 in    07/24/05 01:00:00     0 in      
##  7 out   07/24/05 01:30:00     0 out     
##  8 in    07/24/05 01:30:00     0 in      
##  9 out   07/24/05 02:00:00     0 out     
## 10 in    07/24/05 02:00:00     0 in      
## # ... with 10,070 more rows

There is a column “date” and a column “time”. Both have the type factor. You can’t use factor to plot a date or time information. You first need to convert it to a “POSIX” object.

Time and Date Conversion

Information on date and time are stored in two separate columns, but it makes more sense to have it as a single timestamp, because it describes one unique timestamp. Thus, both columns are concatenated with paste().

building$timestamp_chr <- paste(building$date, building$time)
str(building$timestamp_chr)

##  chr [1:10080] "07/24/05 00:00:00" "07/24/05 00:00:00" ...

The result is of type character, not helpful either, we will change this in the next step. But now we know which component of timestamp is stored at which position and what the separator is. R cannot know by itself which component is a year, and which one a month. This knowledge we need to explain to R in the next step.

We found out that our timestamp has the following components: first month, then day, year, hour, minute, and second. With as.POSIXct() we convert the characters to type “POSIXct”. There is a special syntax you have to use.

%m for month
%d for day
%y for a year with two digits
%Y for a year with four digits
%H for hour
%M for minute
%S for second

To create variable “timestamp” as.POSIXct() function is called with “timestamp_chr” information and a format string. Format string defines order of components and separators in between. Here, we use “%m/%d/%y %H:%M:%S” for matching. Please make sure that each component is at the right place, as well as each separator, otherwise conversion will fail.

building$timestamp <- as.POSIXct(building$timestamp_chr, format = "%m/%d/%y %H:%M:%S")
str(building$timestamp)

##  POSIXct[1:10080], format: "2005-07-24 00:00:00" "2005-07-24 00:00:00" ...

It worked. Column is of type “POSIXct” and timestamp information was captured correctly.

Let’s take a look at in and outflow. To see some details, only the first 500 datasets are shown. Details of ggplot() are not described here. If you are not familiar with ggplot() please take a look at ggplot tutorial.

g <- ggplot(building[1:500, ], aes(x = timestamp, y = count, color = flow))
g <- g + geom_line()
g

Extract Timestamp Information

We want to know how many people went in and out of the building in each hour of the day. For this, we need to extract “hour” from timestamp. We need to convert the timestamp (currently POSIXct) to POSIXlt format. With unclass() components are extracted and stored in a named list. From this, “hour” is the information we need.

building$hour <- unclass(as.POSIXlt(building$timestamp))$hour

Other information in unclassed object are

sec for seconds
min for minutes
mday for monthday
mon for month (0 refers to January, 11 to December; so typically you need to add 1)
year for year
yday for day of the year
isdst for is daylight saving time (summer time)
zone for timezone
gmtoff for offset to GMT

We are done with time and date handling at this point. If you are interested in diurnal flow, please read on.

What is diurnal flow pattern? We can analyse it with ddply() from plyr package and visualise it with ggplot.

If you are not familiar with ddply(), here is a short explanation. Our data “building” is used as input. We create a summary for “hour” and “flow”. For each combination of “hour” and “flow” data is splitted and a summary of following functions is created for the subgroup. We create a summary for “count”, which is the sum of all counts in this group. The result is visualised with ggplot().

hour_flow <- building %>% 
    ddply(.(hour, flow), summarise,
          count = sum(count))

g <- ggplot(hour_flow, aes(x = hour, y = count, color = flow))
g <- g + geom_line()
g <- g + scale_x_continuous(breaks = seq(0, 23, 1))
g

Most people start working between 9AM and 10AM, and head home between 5 and 7PM. It is likely there is a canteen in the building since there is a spike of inflow around 1PM.

More Information

CalIt2 Building People Counts Data Set http://archive.ics.uci.edu/ml/datasets/CalIt2+Building+People+Counts