Dipping back in to “Statistics for Linguists” by Bodo Winter for a bit.
Example given is gender <- c('F', 'M', 'F', 'F')
, apparently you can use either single or double quotes. When you execute a call to gender
to display it on the console then R responds with double quotes.
You can use class(gender)
to confirm that this is indeed a character vector and address elements individually using the square brackets notation as you can with numeric vectors so gender[2]
returns "M"
. You can also use logical statements in the square brackets, which Bodo mentions like it’s been said before but I don’t recall it being. Anyhow, gender[gender == 'F']
returns the three elements that are F, I’d have thought it would be more useful to return the indices of those elements so if you had a matching vector of the names you could use something like names[gender[gender=='F']]
to get the corresponding elements in names for those in gender that are F. I tried this and it just returned the first element of names three times. To do that you can use names[gender == 'F']
which works but seems less clear.
Next we’re introduced to factors, but not what they are. gender <- as.factor(gender)
converts the vector gender to a factor. The key change seems to be that when you display the vector on the console R doesn’t put the letters in quotes and returns an additional line that reads Levels: F M
. From the description it looks like the values are tokenised. The levels()
function displays the valid levels for the vector and if you try to replace a value with a different one that isn’t a valid level, e.g. gender[3] <- 'not declared'
, then you get an error message and the element is repalced with NA. If you need to add a new level you can do so using the levels(function)
and the c()
function to populate it, levels(gender) <- c('F', 'M', 'not declared')
, then you can do your gender[3] <- 'not declared'
.