Dipping back in to “Statistics for Linguists” by Bodo Winter for a bit.
Example given is
gender <- c('F', 'M', 'F', 'F'), apparently you can use either single or double quotes. When you execute a call to
gender to display it on the console then R responds with double quotes.
You can use
class(gender) to confirm that this is indeed a character vector and address elements individually using the square brackets notation as you can with numeric vectors so
"M". You can also use logical statements in the square brackets, which Bodo mentions like it’s been said before but I don’t recall it being. Anyhow,
gender[gender == 'F'] returns the three elements that are F, I’d have thought it would be more useful to return the indices of those elements so if you had a matching vector of the names you could use something like
names[gender[gender=='F']] to get the corresponding elements in names for those in gender that are F. I tried this and it just returned the first element of names three times. To do that you can use
names[gender == 'F'] which works but seems less clear.
Next we’re introduced to factors, but not what they are.
gender <- as.factor(gender) converts the vector gender to a factor. The key change seems to be that when you display the vector on the console R doesn’t put the letters in quotes and returns an additional line that reads
Levels: F M. From the description it looks like the values are tokenised. The
levels() function displays the valid levels for the vector and if you try to replace a value with a different one that isn’t a valid level, e.g.
gender <- 'not declared', then you get an error message and the element is repalced with NA. If you need to add a new level you can do so using the
levels(function) and the
c() function to populate it,
levels(gender) <- c('F', 'M', 'not declared'), then you can do your
gender <- 'not declared'.