Factor is a data structure used for fields that takes only a predefined, finite number of values (categorical data).
For example: a data field such as marital status may contain only values from single, married, separated, divorced, or widowed.
In such a case, we know the possible values beforehand and these predefined, distinct values are called levels.
How to create a factor in R?
We can create a factor using the function factor()
. Levels of a factor are inferred from the data if not provided.
x <- factor(c("single", "married", "married", "single"))
print(x)
x <- factor(c("single", "married", "married", "single"), levels = c("single", "married", "divorced"))
print(x)
Output
[1] single married married single Levels: married single [1] single married married single Levels: single married divorced
We can see from the above example that levels may be predefined even if not used.
Factors are closely related with vectors. In fact, factors are stored as integer vectors. This is clearly seen from its structure.
x <- factor(c("single", "married", "married", "single"))
print(x)
str(x)
Output
[1] single married married single Levels: married single Factor w/ 2 levels "married","single": 2 1 1 2
We see that levels are stored in a character vector and the individual elements are actually stored as indices.
Factors are also created when we read non-numeric columns into a data frame.
By default, data.frame()
function converts character vectors into factors. To suppress this behavior, we have to pass the argument stringsAsFactors = FALSE
.
How to access components of a factor?
Accessing components of a factor is very much similar to that of vectors.
x <- factor(c("single", "married", "married", "single"))
print(x)
print(x[3])
print(x[c(2, 4)])
print(x[-1])
print(x[c(TRUE, FALSE, FALSE, TRUE)])
Output
[1] single married married single Levels: married single [1] married Levels: married single [1] married single Levels: married single [1] married married single Levels: married single [1] single single Levels: married single
How to modify a factor?
Components of a factor can be modified using simple assignments. However, we cannot choose values outside of its predefined levels.
x <- factor(c("single", "married", "married", "single"), levels = c("single", "married", "divorced"))
print(x)
x[2] <- "divorced"
print(x)
x[3] <- "widowed"
print(x)
Output
[1] single married married single Levels: single married divorced [1] single divorced married single Levels: single married divorced Warning message: In `[<-.factor`(`*tmp*`, 3, value = "widowed") : invalid factor level, NA generated [1] single divorced <NA> single Levels: single married divorced
A workaround to this is to add the value to the level first.
x <- factor(c("single", "divorced", "widowed", "single"), levels = c("single", "married", "divorced"))
print(x)
levels(x) <- c(levels(x), "widowed")
x[3] <- "widowed"
print(x)