The book I’m starting on is “Statistics for Linguists – An Introduction Using R” by Bodo Winter. This selection was made purely because I heard him speak at BirminghamR meetup and he mentioned that the book had just come out and was aimed at undergraduates. I’ve also got the O’Reilly “R for Data Science” book because it was recommended by someone at work. I’ve tried learning R before from a book but struggled because the author went straight into a complex scenario about feeding the Chinese army so the R got lost in the scenario. I was looking for the R equivalent of ‘Hello, world!’, not writing a full UNIX Kernel from scratch.
So after plodding through for about an hour yesterday I’ve discovered that if you type 2 + 2
then R responds [1] 4
and if you type sqrt(4)
it unsurprisingly responds [1] 2
. The number in the square brackets means that this is the first element of a vector, all variables in R are vectors.
Vector seems to mean something slightly different here to what I learned in school. In school we learned that vectors have scale and direction, as opposed to a scalar which has only scale; so 100 miles is a scalar but 100 miles north is a vector. In R vector seems to be what in programming would be called an array. Also, rather worryingly, vectors seem not to be typed, that is you can’t define what type of data a vector should store so you could put a 1 in the vector x then put the letter ‘a’ in the same x. Could be an issue if later you want to do sqrt(PI*X^2)
.
I also discovered in that hour that abs(x)
will return the absolute value of x, that is the unsigned value so if x==2
or x==-2
then abs(x)
will be 2
.
If you want to see what vectors you have then you need to use ls()
, which kind of makes sense to me as ls
is the UNIX shell command to list the files in a directory. To see what is stored in each vector just type the vector name at the prompt.
Assigning a value to a vector is done using <-
which reminds me somewhat of C++ where you would use <<
to pass variables into a string for output. So to put the number 4 into x (or more properly the first element of the verctor x) you use x <- 4
, or the plain single equals sign also works but it’s not standard and most scripts written by others will use <-
so you need to get use to it and unless you want to get confused, and confuse others, use x <- 4
not x = 4
.
To put more than one value into a vector in a single command you can use the c()
function. So x <- c(1, 3, 5, 7, 9) puts the numbers 1, 3, 5, 7 and 9 into x as the 1st, 2nd, 3rd, 4th and 5th elements. Unlike most programming languages, which start arrays at element 0, R starts it’s vectors at element 1, there is no zero. If you want to put a range of numbers (e.g. the numbers 1 to 10) into a vector you use a colon between them such as x <- 1:10
, you don’t need the c()
in this case.
The book also introduced a number of functions that work with vectors:
sum(x)
– Adds up all of the elements of x
min(x)
– returns the smallest value of the elements of x
max(x)
– returns the largest value of the elements of x
range(x)
– returns the smallest and largest elements of x
diff(range(x))
– returns the difference between the smallest and largest elements of x
mean(x)
– returns the mean (average) of the values of the elements of x
median(x)
– returns the median (another type of average) of the values of the elements of x
var(x)
– returns the variance of the values of the elements of x
sd(x)
– returns the standard deviation of the values of the elements of x, if memory serves that means that sd(x) == sqrt(var(x))
length(x)
– returns the number of elements in x
If you need to address a specific element of a vector you can use the square bracket notation. So, x[1]
is the first element and x[47]
is the 47th element. the numbers in the square brackets are referred to as the index of the element, so the first element has an index of 1 and the 47th and index of 47. If you want to return a range of elements then you can use the colon notation again, with your square brackets. So, to get the first 4 elements of x use x[1:4]
. A negative number in the square brackets returns every element except that one. So, x[-2] returns every element except the second element.
If you carry out a mathematical operation on a vector with more than one element then that will be carried out on each element individually so x *5 will return a vector containing the same number of elements as x but with each value in x multiplied by five. Note that x itself is not changed so if you want to use the result you will need to assign it to another vector, e.g. z <- x^2
will square each value in x and assign it to a corresponding element in z. So if x holds the radii of some circles and you want to calculate the areas of those circles then z <- (pi*x^2)
will populate z with the areas of the circles whose radii are stored in x.
R also includes the usual crop of comparison operators:
x==y
– returns TRUE or FALSE for if each element of x is equal to the corresponding element of y
x > y
– returns TRUE or FALSE for if each element of x is greater than the corresponding element of y
x < y
– returns TRUE or FALSE for if each element of x is less than the corresponding element of y
x >= y
– returns TRUE or FALSE for if each element of x is greater than or equal to the corresponding element of y
x <= y
– returns TRUE or FALSE for if each element of x is less than or equal to the corresponding element of y
x != y
– returns TRUE or FALSE for if each element of x is not equal to the corresponding element of y
You can use these to filter the values in a vector and assign the indexes of the values for which the conmditional is true to another vector. So, morethanthree <- x > 3
will populate the vector morethenthree with the indices for those elements of x that are more than 3.
I can see that being useful where you want to work with only a subset of data that meets specific criteria. For example if you had surveyed people as to how many dogs they owned (which could be zero) and put the results in a vector Dogs then you could answer the question “Given someone owns at least one dog, what is the average (mean) number of dogs owned?” with
HasDogs <- Dogs > 0
mean(Dogs[HasDogs])
You could also work out the percentage of people who own at least 1 dog with (length(Dogs[HasDogs])/length(Dogs))*100