12  Factors

Author

Sean Davis & Lori Kern

Published

August 27, 2025

Modified

August 27, 2025

12.1 Factors

A factor is a special type of vector, normally used to hold a categorical variable–such as smoker/nonsmoker, state of residency, zipcode–in many statistical functions. Such vectors have class “factor”. Factors are primarily used in Analysis of Variance (ANOVA) or other situations when “categories” are needed. When a factor is used as a predictor variable, the corresponding indicator variables are created (more later).

Note of caution that factors in R often appear to be character vectors when printed, but you will notice that they do not have double quotes around them. They are stored in R as numbers with a key name, so sometimes you will note that the factor behaves like a numeric vector.

# create the character vector
citizen<-c("uk","us","no","au","uk","us","us","no","au") 

# convert to factor
citizenf<-factor(citizen)                                
citizen             
[1] "uk" "us" "no" "au" "uk" "us" "us" "no" "au"
citizenf
[1] uk us no au uk us us no au
Levels: au no uk us
# convert factor back to character vector
as.character(citizenf)
[1] "uk" "us" "no" "au" "uk" "us" "us" "no" "au"
# convert to numeric vector
as.numeric(citizenf)
[1] 3 4 2 1 3 4 4 2 1

R stores many data structures as vectors with “attributes” and “class” (just so you have seen this).

attributes(citizenf)
$levels
[1] "au" "no" "uk" "us"

$class
[1] "factor"
class(citizenf)
[1] "factor"
# note that after unclassing, we can see the 
# underlying numeric structure again
unclass(citizenf)
[1] 3 4 2 1 3 4 4 2 1
attr(,"levels")
[1] "au" "no" "uk" "us"

Tabulating factors is a useful way to get a sense of the “sample” set available.

table(citizenf)
citizenf
au no uk us 
 2  2  2  3 

The default factor levels are the unique set of possible values. It is possible to specify a subset of factor levels. Note how missing values are introduced if a value is not included.

citizenf2 <- factor(citizen, levels=c("us", "uk"))
citizenf2
[1] uk   us   <NA> <NA> uk   us   us   <NA> <NA>
Levels: us uk
table(citizenf2)
citizenf2
us uk 
 3  2 

Missing values are exlcuded by default. There is an option to override this setting.

addNA(citizenf2)
[1] uk   us   <NA> <NA> uk   us   us   <NA> <NA>
Levels: us uk <NA>
table(addNA(citizenf2))

  us   uk <NA> 
   3    2    4 
Caution

This emphasizes that default settings may or may not be appropriate for your analysis. It’s important to know what those settings are and choose alternatives as necessary.