Loading and Piping Data

Can load data into R using read.csv() to get data from a comma separated variable (CSV) file.

DataFrameName <- read.csv(MyDataFile.csv)

Tidyverse package provides a lot of useful functions for manipulating data.

  • head(DataFrameName) – Shows first 6 rows of the data
  • tail(DataFrameName) – Shows last 6 rows of data
  • view(DataFrameName) – Shows all rows in tabular format

Can address elements of a dataframe using square brackets or the dollar sign.

  • DataFrameName[1 , 3] – Show the value in the third column of the first row
  • DataFrameName[ , 3] – Show the value in the third column for all rows
  • DataFrameName$ColumnName – Show the value in the column ColumnName for all rows

R supports UNIX style piping using %>% to represent the pipe. Below code installs the tidyverse package (line is commented out as only needs to be run once per installation of R then to will be available to all scripts), loads tidyverse into the session using require() and then loads the data from a CSV file into a dataframe called BusData using read.csv(). It then pipes the data through the select() function to just get 4 columns:

  • Financial year – The financial year being reported on (April to March in this case)
  • Month – The month within that financial year being reported on
  • Total.Bus.Patronage.per.month..Concessionary..Non.Concessionary – The total number of bus journeys taken in that month (both fare paying and free concessionary)
  • Free.Concessionary.per.month – The number of journeys where a free concessionary bus pass was used

The reduced dataset is then filtered to only return those rows for the Financial Year 2008 (April 2008 to March 2009).

# install.packages("tidyverse")
BusData <- read.csv("rawBusPatronage.csv")
BusData %>%
select(Financial.year, Month, Total.Bus.Patronage.per.month..Concessionary...Non.Concessionary., Free.Concessionary.per.month) %>%
filter(Financial.year == 2008)

This is the equivalent of the SQL statement (some column names changed to be legal in common RDBMSes):

select FinancialYear
, Month
, TotalMonthlyBusJournies
, TotalMonthlyConcessionaryJournies
from BusData
where Financialyear=2008;

The data set used here is based on an open data set downloaded from Birmingham Data Factory and then manipulated in Excel.

The arrange() function could be used to sort the output.

Published by stephenboothuk

A former Oracle DBA, then Technical Business Analyst and now I'm not sure what I am. If you want to find out more about me, my LinkedIn profile can be found at: http://www.linkedin.com/in/stephenboothuk

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: