Stephen is learning data analysis

Loading and Piping Data

Can load data into R using read.csv() to get data from a comma separated variable (CSV) file.

DataFrameName <- read.csv(MyDataFile.csv)

Tidyverse package provides a lot of useful functions for manipulating data.

head(DataFrameName) – Shows first 6 rows of the data
tail(DataFrameName) – Shows last 6 rows of data
view(DataFrameName) – Shows all rows in tabular format

Can address elements of a dataframe using square brackets or the dollar sign.

DataFrameName[1 , 3] – Show the value in the third column of the first row
DataFrameName[ , 3] – Show the value in the third column for all rows
DataFrameName$ColumnName – Show the value in the column ColumnName for all rows

R supports UNIX style piping using %>% to represent the pipe. Below code installs the tidyverse package (line is commented out as only needs to be run once per installation of R then to will be available to all scripts), loads tidyverse into the session using require() and then loads the data from a CSV file into a dataframe called BusData using read.csv(). It then pipes the data through the select() function to just get 4 columns:

Financial year – The financial year being reported on (April to March in this case)
Month – The month within that financial year being reported on
Total.Bus.Patronage.per.month..Concessionary..Non.Concessionary – The total number of bus journeys taken in that month (both fare paying and free concessionary)
Free.Concessionary.per.month – The number of journeys where a free concessionary bus pass was used

The reduced dataset is then filtered to only return those rows for the Financial Year 2008 (April 2008 to March 2009).

# install.packages("tidyverse")
require("tidyverse")
BusData <- read.csv("rawBusPatronage.csv")
BusData %>%
select(Financial.year, Month, Total.Bus.Patronage.per.month..Concessionary...Non.Concessionary., Free.Concessionary.per.month) %>%
filter(Financial.year == 2008)

This is the equivalent of the SQL statement (some column names changed to be legal in common RDBMSes):

select FinancialYear
, Month
, TotalMonthlyBusJournies
, TotalMonthlyConcessionaryJournies
from BusData
where Financialyear=2008;

The data set used here is based on an open data set downloaded from Birmingham Data Factory and then manipulated in Excel.

The arrange() function could be used to sort the output.

Date Formats – Lessons Not Learned

An article from BATIMES popped up in my inbox this morning, the author raises the issue of varying date formats causing problems of misinterpretation. Different countries, different companies within the same country, different departments in the same company and different systems within the same department use differing formats. Even different reports from the same system use different formats (PlanView, I’m looking at you). Worst case I’m aware of is there’s a report I produce daily in a shared Excel spreadsheet where most people who use it want the DD/MM/YYYY format but one person insists on changing the format to M/D/YY whenever he looks at a tab.

The author of the article points out that we’re slipping back into the 2 digit year issue that cause problems with Y2K and that combined with different formats leads to her wonder if 11/05/18 is 11th May 2018 (DD/MM/YY), 5th November 2018 (MM/DD/YY) or 18th May 2011 (YY/MM/DD). Without a standard format, that is universally used, we can not be sure, especially going forward where the data may be separated from any accompanying documentation.

Looking at the datasets I have to deal with often the more fundemental problem is ensuring that users put only the data they are supposed to in a field. ‘N/A’ is not a valid date format in any country, nor is ‘Will provide next week’.

Intro to R Video on YouTube

Data Cleansing m/d/yyyy Dates

A common issue I’ve found with data cleansing when pulling into Excel is the date format wil often default to US format. If it’s a 2 digit month and day (mm/dd/yyyy) then that’s not much of a problem but where it can be one or two digit (m/d/yyyy so 1/1/2020 and 10/10/2020 are both possiblilities) then it can be more tricky. I found this page which describes a way to fix the issue. It seems like a kludge but when I tried it it did work.

First bash at R

The book I’m starting on is “Statistics for Linguists – An Introduction Using R” by Bodo Winter. This selection was made purely because I heard him speak at BirminghamR meetup and he mentioned that the book had just come out and was aimed at undergraduates. I’ve also got the O’Reilly “R for Data Science” book because it was recommended by someone at work. I’ve tried learning R before from a book but struggled because the author went straight into a complex scenario about feeding the Chinese army so the R got lost in the scenario. I was looking for the R equivalent of ‘Hello, world!’, not writing a full UNIX Kernel from scratch.

So after plodding through for about an hour yesterday I’ve discovered that if you type 2 + 2 then R responds [1] 4 and if you type sqrt(4) it unsurprisingly responds [1] 2. The number in the square brackets means that this is the first element of a vector, all variables in R are vectors.

Vector seems to mean something slightly different here to what I learned in school. In school we learned that vectors have scale and direction, as opposed to a scalar which has only scale; so 100 miles is a scalar but 100 miles north is a vector. In R vector seems to be what in programming would be called an array. Also, rather worryingly, vectors seem not to be typed, that is you can’t define what type of data a vector should store so you could put a 1 in the vector x then put the letter ‘a’ in the same x. Could be an issue if later you want to do sqrt(PI*X^2).

I also discovered in that hour that abs(x) will return the absolute value of x, that is the unsigned value so if x==2 or x==-2 then abs(x) will be 2.

If you want to see what vectors you have then you need to use ls(), which kind of makes sense to me as ls is the UNIX shell command to list the files in a directory. To see what is stored in each vector just type the vector name at the prompt.

Assigning a value to a vector is done using <- which reminds me somewhat of C++ where you would use << to pass variables into a string for output. So to put the number 4 into x (or more properly the first element of the verctor x) you use x <- 4, or the plain single equals sign also works but it’s not standard and most scripts written by others will use <- so you need to get use to it and unless you want to get confused, and confuse others, use x <- 4 not x = 4.

To put more than one value into a vector in a single command you can use the c() function. So x <- c(1, 3, 5, 7, 9) puts the numbers 1, 3, 5, 7 and 9 into x as the 1st, 2nd, 3rd, 4th and 5th elements. Unlike most programming languages, which start arrays at element 0, R starts it’s vectors at element 1, there is no zero. If you want to put a range of numbers (e.g. the numbers 1 to 10) into a vector you use a colon between them such as x <- 1:10, you don’t need the c() in this case.

The book also introduced a number of functions that work with vectors:

sum(x) – Adds up all of the elements of x

min(x) – returns the smallest value of the elements of x

max(x) – returns the largest value of the elements of x

range(x) – returns the smallest and largest elements of x

diff(range(x)) – returns the difference between the smallest and largest elements of x

mean(x) – returns the mean (average) of the values of the elements of x

median(x) – returns the median (another type of average) of the values of the elements of x

var(x) – returns the variance of the values of the elements of x

sd(x) – returns the standard deviation of the values of the elements of x, if memory serves that means that sd(x) == sqrt(var(x))

length(x) – returns the number of elements in x

If you need to address a specific element of a vector you can use the square bracket notation. So, x[1] is the first element and x[47] is the 47th element. the numbers in the square brackets are referred to as the index of the element, so the first element has an index of 1 and the 47th and index of 47. If you want to return a range of elements then you can use the colon notation again, with your square brackets. So, to get the first 4 elements of x use x[1:4]. A negative number in the square brackets returns every element except that one. So, x[-2] returns every element except the second element.

If you carry out a mathematical operation on a vector with more than one element then that will be carried out on each element individually so x *5 will return a vector containing the same number of elements as x but with each value in x multiplied by five. Note that x itself is not changed so if you want to use the result you will need to assign it to another vector, e.g. z <- x^2 will square each value in x and assign it to a corresponding element in z. So if x holds the radii of some circles and you want to calculate the areas of those circles then z <- (pi*x^2) will populate z with the areas of the circles whose radii are stored in x.

R also includes the usual crop of comparison operators:

x==y – returns TRUE or FALSE for if each element of x is equal to the corresponding element of y

x > y – returns TRUE or FALSE for if each element of x is greater than the corresponding element of y

x < y – returns TRUE or FALSE for if each element of x is less than the corresponding element of y

x >= y – returns TRUE or FALSE for if each element of x is greater than or equal to the corresponding element of y

x <= y – returns TRUE or FALSE for if each element of x is less than or equal to the corresponding element of y

x != y – returns TRUE or FALSE for if each element of x is not equal to the corresponding element of y

You can use these to filter the values in a vector and assign the indexes of the values for which the conmditional is true to another vector. So, morethanthree <- x > 3 will populate the vector morethenthree with the indices for those elements of x that are more than 3.

I can see that being useful where you want to work with only a subset of data that meets specific criteria. For example if you had surveyed people as to how many dogs they owned (which could be zero) and put the results in a vector Dogs then you could answer the question “Given someone owns at least one dog, what is the average (mean) number of dogs owned?” with

HasDogs <- Dogs > 0

mean(Dogs[HasDogs])

You could also work out the percentage of people who own at least 1 dog with (length(Dogs[HasDogs])/length(Dogs))*100

The oldest apprentice in town (probably)

My employer, keen to get some value out of the apprenticeship levy, has decided to put a number of us on appreticeship programmes (I suppose it’s cheaper than regular training). In my case, a few months short of turning 50, that means a Level 4 Data Analysis diploma because I know a bit about Excel.

I had my induction last week and am still waiting to see if the training materials will be sent through so I’ve spent too much of my own hard earned cash on a second hand laptop and a book on learning R. After R it will mostly likley be Python and probably PowerBI.