When performing data analysis with large datasets or public data, it’s important to be aware of the types of missing values that might be present in the data, and might accumulate as you compute summary statistics.
Many already will know about the NA
value, which propagates into summary statistics like so:
x <- c(1,2,3,NA)
mean(x)
## [1] NA
sd(x)
## [1] NA
And can be removed from those statistics either through the function or manually using is.na
:
mean(x, na.rm=TRUE)
## [1] 2
sd(x, na.rm=TRUE)
## [1] 1
is.na(x)
## [1] FALSE FALSE FALSE TRUE
mean(x[!is.na(x)])
## [1] 2
When we have constant values in a matrix, and we compute z-scores, we can get two kinds of output, either 0/0 which is “not a number” NaN
, or any non-zero number divided by 0 which gives Inf
or -Inf
:
m <- matrix(c(0,0,0,1,1,1,1:9),ncol=3,byrow=TRUE)
m
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 1 1 1
## [3,] 1 2 3
## [4,] 4 5 6
## [5,] 7 8 9
rowMeans(m)
## [1] 0 1 2 5 8
apply(m, 1, sd)
## [1] 0 0 1 1 1
z <- apply(m, 1, function(x) mean(x)/sd(x))
z
## [1] NaN Inf 2 5 8
We can identify the NaN
and infinite values using is.nan
and is.finite
:
z[!is.nan(z)]
## [1] Inf 2 5 8
z[!is.nan(z) & is.finite(z)]
## [1] 2 5 8
We could also deal with these by explicitly checking if the denominator is equal to 0 and returning whatever we want in this case:
apply(m, 1, function(x) if (sd(x) == 0) 0 else mean(x)/sd(x))
## [1] 0 0 2 5 8