A factor is a special type of vector, normally used to hold a categorical variable–such as smoker/nonsmoker, state of residency, zipcode–in many statistical functions. Such vectors have class “factor”. Factors are primarily used in Analysis of Variance (ANOVA) or other situations when “categories” are needed. When a factor is used as a predictor variable, the corresponding indicator variables are created (more later).
Note of caution that factors in R often appear to be character vectors when printed, but you will notice that they do not have double quotes around them. They are stored in R as numbers with a key name, so sometimes you will note that the factor behaves like a numeric vector.
# create the character vectorcitizen<-c("uk","us","no","au","uk","us","us","no","au") # convert to factorcitizenf<-factor(citizen) citizen
[1] "uk" "us" "no" "au" "uk" "us" "us" "no" "au"
citizenf
[1] uk us no au uk us us no au
Levels: au no uk us
# convert factor back to character vectoras.character(citizenf)
[1] "uk" "us" "no" "au" "uk" "us" "us" "no" "au"
# convert to numeric vectoras.numeric(citizenf)
[1] 3 4 2 1 3 4 4 2 1
R stores many data structures as vectors with “attributes” and “class” (just so you have seen this).
Tabulating factors is a useful way to get a sense of the “sample” set available.
table(citizenf)
citizenf
au no uk us
2 2 2 3
The default factor levels are the unique set of possible values. It is possible to specify a subset of factor levels. Note how missing values are introduced if a value is not included.
[1] uk us <NA> <NA> uk us us <NA> <NA>
Levels: us uk
table(citizenf2)
citizenf2
us uk
3 2
Missing values are exlcuded by default. There is an option to override this setting.
addNA(citizenf2)
[1] uk us <NA> <NA> uk us us <NA> <NA>
Levels: us uk <NA>
table(addNA(citizenf2))
us uk <NA>
3 2 4
Caution
This emphasizes that default settings may or may not be appropriate for your analysis. It’s important to know what those settings are and choose alternatives as necessary.