Detection of anomalous space-time clusters using the scan statistics methodology. Focuses on prospective surveillance of data streams, scanning for clusters with ongoing anomalies. Hypothesis testing is made possible by the generation of Monte Carlo p-values.

An R package for space-time anomaly detection using scan statistics.

To install the development version of this package, type the following:

devtools::install_github("benjak/scanstatistics")

Scan statistics are used to detect anomalous clusters in spatial or space-time data. The gist of the methodology, at least in this package, is this:

- Monitor one or more data streams at multiple
*locations*over intervals of time. - Form a set of space-time
*clusters*, each consisting of (1) a collection of locations, and (2) an interval of time stretching from the present to some number of time periods in the past. - For each cluster, compute a statistic based on both the observed and the expected responses. Report the clusters with the largest statistics.

: computes a scan statistic for data following a Poisson distribution.`scan_poisson`

: computes a scan statistic for data following a negative binomial distribution.`scan_negbin`

: computes a scan statistic for data following a zero-inflated Poisson distribution.`scan_zip`

: Creates a set of spatial`knn_zones`

*zones*(groups of locations) to scan for anomalies. Input is a matrix in which rows are the enumerated locations, and columns the (k) nearest neighbors. To create such a matrix, the following two functions are useful:: use`coords_to_knn`

`stats::dist`

to get the (k) nearest neighbors of each location into a format usable by`knn_zones`

.: use an already computed distance matrix to get the (k) nearest neighbors of each location into a format usable by`dist_to_knn`

`knn_zones`

.

: An alternative to`flexible_zones`

`knn_zones`

that uses the adjacency structure of locations to create a richer set of zones. The additional input is an adjacency matrix, but otherwise works as`knn_zones`

.

: Score each location by how likely it is to have an ongoing anomaly in it. This score is heuristically motivated, not theoretically so.`score_locations`

: Get the top (k) space-time clusters, either overlapping or non-overlapping in the spatial dimension.`top_clusters`

To demonstrate the scan statistics in this package, we will use a dataset of the annual number of brain cancer cases in the counties of New Mexico, for the years (1973-1991). This data was studied by Kulldorff et al. (1998), who detected a cluster of cancer cases in the counties Los Alamos and Santa Fe during the years (1986-1989), though the excess of brain cancer in this cluster was not deemed statistically significant. The data originally comes from the package *rsatscan* (Kleinman 2015), which provides an interface to the program SatScan, but it has been aggregated and extended for the *scanstatistics* package.

To get familiar with the counties of New Mexico, we begin by plotting them on a map using the data table `NM_map`

which comes with the *scanstatistics* package:

library(scanstatistics)#> Loading required package: data.tablelibrary(ggplot2) # Load map datadata(NM_map) # Place the county names at the centroid of the counties. The following function# calculates the geographical centroids from the polygons in NM_map.# See: https://en.wikipedia.org/wiki/Centroid#Centroid_of_polygonpolygon_centroid <- function(x0, y0) { x1 <- c(x0[-1], x0[1]) y1 <- c(y0[-1], y0[1]) A <- sum(x0 * y1 - x1 * y0) / 2 Cx <- sum((x0 + x1) * (x0 * y1 - x1 * y0)) / (6 * A) Cy <- sum((y0 + y1) * (x0 * y1 - x1 * y0)) / (6 * A) data.frame(long = Cx, lat = Cy)} # Calculate geographic centroids for each countycentroids <- NM_map[, polygon_centroid(long, lat), by = .(subregion)] # Plot map with labels at centroidsggplot() + geom_polygon(data = NM_map, mapping = aes(x = long, y = lat, group = group), color = "grey", fill = "white") + geom_text(data = centroids, mapping = aes(x = long, y = lat, label = subregion)) + ggtitle("Counties of New Mexico")

It should be noted that Cibola county was split from Valencia county in 1981, and cases in Cibola have been counted to Valencia in the data.

The anomalies considered in the *scanstatistics* package have both a temporal and a spatial component. The spatial component, called a zone, consists of one or more locations grouped together according to their similarity across features. In this example, the locations are the counties of New Mexico and the features are the coordinates of the county seats. These are made available in the data table `NM_geo`

. Similarity will be measured using the geographical distance between the seats of the counties, taking into account the curvature of the earth. A distance matrix is calculated using the `spDists`

function from the *sp* package, which is then passed to `dist_to_knn`

(with (k=15) neighbors) and on to `knn_zones`

:

library(sp)library(magrittr) data(NM_geo) zones <- NM_geo[, c("long", "lat"), with = FALSE] %>% as.matrix %>% spDists(x = ., y = ., longlat = TRUE) %>% dist_to_knn(k = 15) %>% knn_zones

The Poisson distribution is a natural first option when dealing with (practically) unbounded count data. The *scanstatistics* package provides the function `scan_poisson`

, which is an expectation-based[1] scan statistic for univariate Poisson-distributed data proposed by Neill et al. (2005).

For the expectation-based Poisson scan statistic, the null hypothesis of no anomaly states that at each location (i) and duration (t), the observed count is Poisson-distributed with expected value (\mu_{it}): [ H_0 ! : Y_{it} \sim \textrm{Poisson}(\mu_{it}), ] for locations (i=1,\ldots,m) and durations (t=1,\ldots,T), with (T) being the maximum duration considered. Under the alternative hypothesis, there is a space-time cluster (W) consisting of a spatial zone (Z \subset {1,\ldots,m}) and a time window (D = {1, 2, \ldots, d} \subseteq {1,2,\ldots,T}) such that the counts in (W) have their expected values inflated by a factor (q_W > 1) compared to the null hypothesis: [ H_1 ! : Y_{it} \sim \textrm{Poisson}(q_W \mu_{it}), ~~(i,t) \in W. ] For locations and durations outside of this window, counts are assumed to be distributed as under the null hypothesis. Calculating the scan statistic then involves three steps:

- For each space-time window (W), find the maximum likelihood estimate of (q_W), treating all (\mu_{it})'s as constants.
- Plug the estimated (q_W) into (the logarithm of) a likelihood ratio with the likelihood of the alternative hypothesis in the numerator and the likelihood under the null hypothesis (in which (q_W=1)) in the denominator, again for each (W).
- Take the scan statistic as the maximum of these likelihood ratios, and the corresponding window (W^*) as the most likely cluster (MLC).

The first argument to `scan_poisson`

should be a **data table** with columns 'location', 'duration', 'count' and 'mu'. The latter two columns contain the observed counts and the estimated Poisson expected value parameters respectively, and the table holds data for the period in which we want to detect anomalies. Locations should be encoded as the integers 1, 2, ..., which means that factor variables can be used for this purpose. The duration column counts time backwards, so that a duration of 1 is the most recent time interval, duration 2 is the second most recent, and so on.

We will create such a table by subsetting the `NM_popcas`

table, which holds the population and the number of brain cancer cases for each year between (1973-1991) and each county of New Mexico. Note that the population numbers are (perhaps poorly) interpolated from the censuses conducted in 1973, 1982, and 1991.

data(NM_popcas) tab <- NM_popcas[year >= 1986 & year < 1990, ]tab[, duration := max(year) - year + 1]tab[, location := county]

We still need to add the column 'mu', which should hold the predicted Poisson expected value parameter (\mu_{it}) for each location (i) and time interval (t). In this example we would like to detect a potential cluster of brain cancer in the counties of New Mexico during the years (1986-1989). Thus, we will use data from the years prior to 1986 to estimate the Poisson parameter for all counties in the years following. A simple generalized linear model (GLM) with a linear time trend and an offset for county population size will suffice to demonstrate the scan statistic. We fit such a model and create the needed column as follows:

mod_poisson <- glm(count ~ offset(log(population)) + 1 + I(year - 1985), data = NM_popcas[year < 1986, ], family = poisson(link = "log")) # Add the expected value parameter columntab[, mu := predict(mod_poisson, tab, type = "response")]

We can now calculate the Poisson scan statistic. To give us more confidence in our detection results, we will perform 99 Monte Carlo replications, by which data is generated using the parameters from the null hypothesis and a new scan statistic calculated. This is then summarized in a (p)-value, calculated as the proportion of times the replicated scan statistics exceeded the observed one. The output of `scan_poisson`

is an object of class "scanstatistic", which comes with the print method seen below.

set.seed(1)poisson_result <- scan_poisson(tab, zones, n_mcsim = 99)print(poisson_result)#> Data distribution: Poisson#> Type of scan statistic: Expectation-based#> Number of locations considered: 32#> Maximum duration considered: 4#> Number of spatial zones: 415#> Number of Monte Carlo replicates: 99#> p-value of observed statistic: 0.01#> Most likely event duration: 4#> ID of locations in most likely cluster: 15, 26

As we can see, the most likely cluster for an anomaly stretches from (1986-1989) and involves the locations numbered 15 and 26, which correspond to the counties

counties <- as.character(NM_geo$county)counties[c(15, 26)][1] "losalamos" "santafe"

These are the same counties detected by Kulldorff et al. (1998), though their analysis was retrospective rather than prospective as ours was. Ours was also data dredging (adjective) as we used the same study period with hopes of detecting the same cluster.

We can score each county according to how likely it is to be part of a cluster in a heuristic fashion using the function `score_locations`

, and visualize the results on a heatmap as follows:

# Calculate scores and add column with county namescounty_scores <- score_locations(poisson_result)county_scores[, county := counties] # Create a table for plottingscore_map_df <- merge(NM_map, county_scores, by = "county", all.x = TRUE)score_map_df[subregion == "cibola", relative_score := county_scores[county == "valencia", relative_score]] ggplot() + geom_polygon(data = score_map_df, mapping = aes(x = long, y = lat, group = group, fill = relative_score), color = "grey") + scale_fill_gradient(low = "#e5f5f9", high = "darkgreen", guide = guide_colorbar(title = "Relative\nScore")) + geom_text(data = centroids, mapping = aes(x = long, y = lat, label = subregion), alpha = 0.5) + ggtitle("County scores")

As we can see, this does not match up entirely with the previous results as Torrance was not part of the most likely cluster.

Finally, if we want to know not just the most likely cluster, but say the five top-scoring space-time clusters, we can use the function `top_clusters`

. The clusters returned can either be overlapping or non-overlapping in the spatial dimension, according to our liking.

top5 <- top_clusters(poisson_result, k = 5, overlapping = FALSE) # Find the counties corresponding to the spatial zones of the 5 clusters.top_counties <- top5$zone %>% purrr::map(get_zone, zones = zones) %>% purrr::map(function(x) counties[x]) # Add the counties corresponding to the zones as a columntop5[, counties := top_counties]

To get (p)-values for these clusters, the values of the cluster-specific statistics in the table above can be compared to the replicate scan statistics calculated earlier. These (p)-values will be conservative, since secondary clusters from the original data are compared to the most likely clusters from the replicate data sets.

top5[, pvalue := mc_pvalue(statistic, poisson_result$replicated)] top5#> zone duration statistic#> 1: 49 4 9.1806711#> 2: 3 2 6.8196550#> 3: 140 4 3.5377879#> 4: 10 4 3.4072029#> 5: 9 2 0.8372729#> counties pvalue#> 1: losalamos,santafe 0.01#> 2: chaves 0.01#> 3: bernalillo,lincoln,sierra,socorro,torrance,valencia 0.48#> 4: guadalupe 0.50#> 5: grant 1.00

For count data with overdispersion, the *scanstatistics* package provides the function `scan_negbin`

, which is the expectation-based scan statistic invented by Tango, Takahashi, and Kohriyama (2011). This scan statistic assumes that the data follows a negative binomial distribution parametrized by its expected value (\mu) and a parameter (\theta) such that a count (Y) has variance (\text{Var}(Y) = \mu + \mu^2 / \theta). The parameters (\mu) and (\theta) may vary over both location and time.

The negative binomial scan statistic comes in two versions, each with a different assumption of how an anomaly with manifest in the data. The first version makes the same assumption as the Poisson scan statistic in the previous section: an anomaly that occurs in a space-time window (W) will have the effect of increasing the expected value of the counts in that window by a factor (q_W>1) in comparison to what was predicted. This factor (q_W) is the same for all locations in (W) and constant over time. In the second version of the scan statistic, the factor (q_W) increases monotonically over time. The null and alternative hypotheses are otherwise as for the Poisson scan statistic, except that the negative binomial distribution is used instead. Further, the scan statistic is calculated using the score and Fisher information rather than a likelihood ratio.

Similar to `scan_poisson`

, the first argument to `scan_negbin`

should be a **data table** with columns 'location', 'duration', 'count', 'mu' and 'theta'. The second and third arguments specify the spatial zones and the number of Monte Carlo replications respectively. The fourth argument specifies the alternative hypothesis; the choices are `version = "ordinary"`

(default) and `version = "increasing"`

, with the implications described above.

To demonstrate the negative binomial scan statistic, we fit a negative binomial GLM to the data. It should be noted that this is purely for demonstrational purposes: the negative binomial distribution does not fit this data well at all. So if you happen to know of a dataset more suitable, please let me know!

suppressWarnings(mod_negbin <- MASS::glm.nb(count ~ offset(log(population)) + 1 + I(year - 1985), data = NM_popcas[year < 1986, ], link = log)) # Add the parameters as columnstab[, mu := predict(mod_negbin, tab, type = "response")]tab[, theta := mod_negbin$theta]

We can now use `scan_negbin`

to find the most likely cluster using the ordinary version of the scan statistic, again making 99 Monte Carlo replications to obtain a (p)-value:

set.seed(1)negbin_result1 <- scan_negbin(tab, zones, n_mcsim = 99, version = "ordinary")print(negbin_result1)#> Data distribution: negative binomial#> Type of scan statistic: expectation-based#> Number of locations considered: 32#> Maximum duration considered: 4#> Number of spatial zones: 415#> Number of Monte Carlo replicates: 99#> p-value of observed statistic: 0.07#> Most likely event duration: 4#> ID of locations in most likely cluster: 15, 26

The ordinary version of the negative binomial scan statistic evidently finds the same cluster as the Poisson scan statistic. The increasing version does not, however:

set.seed(1)negbin_result2 <- scan_negbin(tab, zones, n_mcsim = 99, version = "increasing")print(negbin_result2)#> Data distribution: negative binomial#> Type of scan statistic: expectation-based#> Number of locations considered: 32#> Maximum duration considered: 4#> Number of spatial zones: 415#> Number of Monte Carlo replicates: 99#> p-value of observed statistic: 0.04#> Most likely event duration: 2#> ID of locations in most likely cluster: 3

The cluster found here consists of Chaves county in the years 1988-1989, which was one of the secondary clusters found by Kulldorff et al. (1998). The results from the two negative binomial scans above could be further explored using the functions `score_locations`

and `top_clusters`

, just as for the Poisson scan statistic. Since these functions have already been demonstrated, we move on to a scan statistic for data with an excess of zeros.

For zero-inflated count data, the *scanstatistics* package provides the function `scan_zip`

, which an expectation-based scan statistic for zero-inflated Poisson (ZIP) data devised by Kjellson (2015). The ZIP distribution is parametrized by the expected value (\mu) of the Poisson component and the probability (p) of a structural zero.

The ZIP scan statistic makes a similar assumtion regarding outbreaks as the Poisson scan statistic does: an anomaly that occurs in a space-time window (W) will have the effect of increasing the Poisson expected value parameter of the counts in that window by a factor (q_W>1) in comparison to what was predicted. This factor (q_W) is the same for all locations in (W) and constant over the duration of the anomaly. For all windows (W) considered, (q_W) is estimated using the EM algorithm and a likelihood ratio statistic is computed. The scan statistic is the maximum of these statistics over all windows (W).

Similar to `scan_poisson`

, the first argument to `scan_zip`

should be a **data table** with columns 'location', 'duration', 'count', 'mu' and 'p'. The second and third arguments specify the spatial zones and the number of Monte Carlo replications respectively.

To demonstrate the ZIP scan statistic, we fit a zero-inflated Poisson regression model to the data. Just as for the negative binomial distribution, it should be noted that this is purely for demonstrational purposes: there are probably more suitable datasets out there, and if you happen to know of one, please let me know!

library(pscl, quietly = TRUE)mod_zip <- zeroinfl(count ~ offset(log(population)) + 1 + I(year - 1985), data = NM_popcas[year < 1986, ], dist = "poisson", link = "logit") # Add the parameters as columnstab[, mu := predict(mod_zip, tab, type = "count")]tab[, p := predict(mod_zip, tab, type = "zero")]

We can now use `scan_zip`

to find the most likely cluster, again making 99 Monte Carlo replications to obtain a (p)-value:

set.seed(1)zip_result <- scan_zip(tab, zones, n_mcsim = 99)print(zip_result)#> Data distribution: zero-inflated Poisson#> Type of scan statistic: Expectation-based#> Number of locations considered: 32#> Maximum duration considered: 4#> Number of spatial zones: 415#> Number of Monte Carlo replicates: 99#> p-value of observed statistic: 0.01#> Most likely event duration: 4#> ID of locations in most likely cluster: 15, 26

The zero-inflated Poisson statistic finds the same cluster as Kulldorff et al. (1998).

Kjellson, Benjamin. 2015. “Spatiotemporal Outbreak Detection: A Scan Statistic Based on the Zero-Inflated Poisson Distribution.” Master’s thesis, Sweden: Stockholm University, Division of Mathematical Statistics.

Kleinman, Ken. 2015. *Rsatscan: Tools, Classes, and Methods for Interfacing with SaTScan Stand-Alone Software*. https://CRAN.R-project.org/package=rsatscan.

Kulldorff, Martin, William F. Athas, Eric J. Feuer, Barry A. Miller, and Charles R. Key. 1998. “Evaluating Cluster Alarms: A Space-Time Scan Statistic and Brain Cancer in Los Alamos.” *American Journal of Public Health* 88 (9): 1377–80.

Neill, Daniel B., Andrew W. Moore, Maheshkumar Sabhnani, and Kenny Daniel. 2005. “Detection of Emerging Space-Time Clusters.” In *Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining*, 218–27. ACM.

Tango, Toshiro, Kunihiko Takahashi, and Kazuaki Kohriyama. 2011. “A Space-Time Scan Statistic for Detecting Emerging Outbreaks.” *Biometrics* 67 (1): 106–15.

[1] Expectation-based scan statistics use past non-anomalous data to estimate distribution parameters, and then compares observed cluster counts from the time period of interest to these estimates. In contrast, *population-based* scan statistics compare counts in a cluster tothose outside, only using data from the period of interest.