When performing data analysis with large datasets or public data, it’s important to be aware of the types of missing values that might be present in the data, and might accumulate as you compute summary statistics.

Many already will know about the NA value, which propagates into summary statistics like so:

x <- c(1,2,3,NA)
mean(x)
## [1] NA
sd(x)
## [1] NA

And can be removed from those statistics either through the function or manually using is.na:

mean(x, na.rm=TRUE)
## [1] 2
sd(x, na.rm=TRUE)
## [1] 1
is.na(x)
## [1] FALSE FALSE FALSE  TRUE
mean(x[!is.na(x)])
## [1] 2

When we have constant values in a matrix, and we compute z-scores, we can get two kinds of output, either 0/0 which is “not a number” NaN, or any non-zero number divided by 0 which gives Inf or -Inf:

m <- matrix(c(0,0,0,1,1,1,1:9),ncol=3,byrow=TRUE)
m
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    1    1    1
## [3,]    1    2    3
## [4,]    4    5    6
## [5,]    7    8    9
rowMeans(m)
## [1] 0 1 2 5 8
apply(m, 1, sd)
## [1] 0 0 1 1 1
z <- apply(m, 1, function(x) mean(x)/sd(x))
z
## [1] NaN Inf   2   5   8

We can identify the NaN and infinite values using is.nan and is.finite:

z[!is.nan(z)]
## [1] Inf   2   5   8
z[!is.nan(z) & is.finite(z)]
## [1] 2 5 8

We could also deal with these by explicitly checking if the denominator is equal to 0 and returning whatever we want in this case:

apply(m, 1, function(x) if (sd(x) == 0) 0 else mean(x)/sd(x))
## [1] 0 0 2 5 8