The 'anomalize' package enables a "tidy" workflow for detecting anomalies in data. The main functions are time_decompose(), anomalize(), and time_recompose(). When combined, it's quite simple to decompose time series, detect anomalies, and create bands separating the "normal" data from the anomalous data at scale (i.e. for multiple time series). Time series decomposition is used to remove trend and seasonal components via the time_decompose() function and methods include seasonal decomposition of time series by Loess ("stl") and seasonal decomposition by piecewise medians ("twitter"). The anomalize() function implements two methods for anomaly detection of residuals including using an inner quartile range ("iqr") and generalized extreme studentized deviation ("gesd"). These methods are based on those used in the 'forecast' package and the Twitter 'AnomalyDetection' package. Refer to the associated functions for specific references for these methods.
anomalize
enables a tidy workflow for detecting anomalies in data. The
main functions are time_decompose()
, anomalize()
, and
time_recompose()
. When combined, it’s quite simple to decompose time
series, detect anomalies, and create bands separating the “normal” data
from the anomalous
data.
Check out our entire Software Intro Series on YouTube!
You can install the development version with devtools
or the most
recent CRAN version with install.packages()
:
# devtools::install_github("business-science/anomalize")install.packages("anomalize")
anomalize
has three main functions:
time_decompose()
: Separates the time series into seasonal, trend,
and remainder componentsanomalize()
: Applies anomaly detection methods to the remainder
component.time_recompose()
: Calculates limits that separate the “normal”
data from the anomalies!Load the tidyverse
and anomalize
packages.
library(tidyverse)library(anomalize)
Next, let’s get some data. anomalize
ships with a data set called
tidyverse_cran_downloads
that contains the daily CRAN download counts
for 15 “tidy” packages from 2017-01-01 to 2018-03-01.
tidyverse_cran_downloads %>%ggplot(aes(date, count)) +geom_point(color = "#2c3e50", alpha = 0.25) +facet_wrap(~ package, scale = "free_y", ncol = 3) +theme_minimal() +theme(axis.text.x = element_text(angle = 30, hjust = 1)) +labs(title = "Tidyverse Package Daily Download Counts",subtitle = "Data from CRAN by way of cranlogs package")
Suppose we want to determine which daily download “counts” are
anomalous. It’s as easy as using the three main functions
(time_decompose()
, anomalize()
, and time_recompose()
) along with a
visualization function, plot_anomalies()
.
tidyverse_cran_downloads %>%# Data Manipulation / Anomaly Detectiontime_decompose(count, method = "stl") %>%anomalize(remainder, method = "iqr") %>%time_recompose() %>%# Anomaly Visualizationplot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) +labs(title = "Tidyverse Anomalies", subtitle = "STL + IQR Methods")
If you’re familiar with Twitter’s AnomalyDetection
package, you can
implement that method by combining time_decompose(method = "twitter")
with anomalize(method = "gesd")
. Additionally, we’ll adjust the trend = "2 months"
to adjust the median spans, which is how Twitter’s
decomposition method works.
# Get only lubridate downloadslubridate_dloads <- tidyverse_cran_downloads %>%filter(package == "lubridate") %>%ungroup()# Anomalize!!lubridate_dloads %>%# Twitter + GESDtime_decompose(count, method = "twitter", trend = "2 months") %>%anomalize(remainder, method = "gesd") %>%time_recompose() %>%# Anomaly Visualziationplot_anomalies(time_recomposed = TRUE) +labs(title = "Lubridate Anomalies", subtitle = "Twitter + GESD Methods")
Last, we can compare to STL + IQR methods, which use different decomposition and anomaly detection approaches.
lubridate_dloads %>%# STL + IQR Anomaly Detectiontime_decompose(count, method = "stl", trend = "2 months") %>%anomalize(remainder, method = "iqr") %>%time_recompose() %>%# Anomaly Visualizationplot_anomalies(time_recomposed = TRUE) +labs(title = "Lubridate Anomalies", subtitle = "STL + IQR Methods")
There are a several extra capabilities:
time_frequency()
and time_trend()
for generating frequency and
trend spans using date and datetime information, which is more
intuitive than selecting numeric values. Also, period = "auto"
automatically selects frequency and trend spans based on the time
scale of the data.# Time Frequencytime_frequency(lubridate_dloads, period = "auto")#> frequency = 7 days#> [1] 7
# Time Trendtime_trend(lubridate_dloads, period = "auto")#> trend = 91 days#> [1] 91
plot_anomaly_decomposition()
for visualizing the inner workings of
how algorithm detects anomalies in the “remainder”.tidyverse_cran_downloads %>%filter(package == "lubridate") %>%ungroup() %>%time_decompose(count) %>%anomalize(remainder) %>%plot_anomaly_decomposition() +labs(title = "Decomposition of Anomalized Lubridate Downloads")
iqr()
and gesd()
. These
are great for just using on numeric data. Note that trend and
seasonality should already be removed for time series data.# Data with outliersset.seed(100)x <- rnorm(100)idx_outliers <- sample(100, size = 5)x[idx_outliers] <- x[idx_outliers] + 10# IQR methodiqr(x, alpha = 0.05, max_anoms = 0.2)#> [1] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [12] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [23] "No" "No" "No" "No" "No" "No" "No" "No" "Yes" "No" "No"#> [34] "No" "No" "No" "Yes" "No" "No" "No" "No" "No" "No" "No"#> [45] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [56] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [67] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [78] "No" "No" "Yes" "No" "No" "No" "No" "No" "No" "No" "No"#> [89] "No" "Yes" "No" "No" "No" "No" "Yes" "No" "No" "No" "No"#> [100] "No"
verbose = TRUE
, we can return a nice report
of useful information related to the outliers.lubridate_dloads %>%time_decompose(count) %>%anomalize(remainder, verbose = TRUE)#> $anomalized_tbl#> # A time tibble: 425 x 8#> # Index: date#> date observed season trend remainder remainder_l1 remainder_l2#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 2017-01-01 643. -2078. 2474. 246. -3323. 3310.#> 2 2017-01-02 1350. 518. 2491. -1659. -3323. 3310.#> 3 2017-01-03 2940. 1117. 2508. -685. -3323. 3310.#> 4 2017-01-04 4269. 1220. 2524. 525. -3323. 3310.#> 5 2017-01-05 3724. 865. 2541. 318. -3323. 3310.#> 6 2017-01-06 2326. 356. 2558. -588. -3323. 3310.#> 7 2017-01-07 1107. -1998. 2574. 531. -3323. 3310.#> 8 2017-01-08 1058. -2078. 2591. 545. -3323. 3310.#> 9 2017-01-09 2494. 518. 2608. -632. -3323. 3310.#> 10 2017-01-10 3237. 1117. 2624. -504. -3323. 3310.#> # ... with 415 more rows, and 1 more variable: anomaly <chr>#>#> $anomaly_details#> $anomaly_details$outlier#> [1] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [12] "Yes" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [23] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [34] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [45] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [56] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [67] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [78] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [89] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [100] "No" "No" "No" "No" "No" "No" "No" "No" "No" "Yes" "No"#> [111] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [122] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [133] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [144] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [155] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [166] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [177] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [188] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [199] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [210] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [221] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [232] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [243] "No" "Yes" "No" "No" "No" "No" "No" "Yes" "No" "No" "No"#> [254] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [265] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [276] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [287] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [298] "No" "No" "No" "No" "No" "Yes" "No" "No" "No" "No" "No"#> [309] "No" "No" "No" "No" "No" "No" "No" "No" "Yes" "Yes" "No"#> [320] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [331] "No" "No" "No" "No" "No" "No" "No" "Yes" "Yes" "No" "No"#> [342] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [353] "No" "No" "No" "No" "No" "No" "No" "No" "Yes" "No" "No"#> [364] "No" "No" "Yes" "No" "No" "No" "Yes" "No" "No" "No" "No"#> [375] "No" "No" "No" "Yes" "No" "No" "No" "No" "No" "No" "No"#> [386] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [397] "No" "No" "No" "No" "No" "No" "Yes" "Yes" "Yes" "Yes" "No"#> [408] "No" "No" "No" "No" "No" "No" "No" "No" "No" "No" "No"#> [419] "Yes" "Yes" "No" "No" "No" "No" "No"#>#> $anomaly_details$outlier_idx#> [1] 419 405 370 420 303 406 250 366 318 244 338 317 12 378 339 404 403#> [18] 361 109#>#> $anomaly_details$outlier_vals#> [1] -8518.886 -7779.522 -6293.275 -6218.430 5557.429 -5477.838 4619.824#> [8] -4553.173 4240.767 -4136.721 3804.789 3626.129 -3522.194 3494.339#> [15] 3486.598 3477.376 3385.065 -3381.355 3347.284#>#> $anomaly_details$outlier_direction#> [1] "Down" "Down" "Down" "Down" "Up" "Down" "Up" "Down" "Up" "Down"#> [11] "Up" "Up" "Down" "Up" "Up" "Up" "Up" "Down" "Up"#>#> $anomaly_details$critical_limits#> limit_lower limit_upper#> -3323.425 3310.268#>#> $anomaly_details$outlier_report#> # A tibble: 85 x 7#> rank index value limit_lower limit_upper outlier direction#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>#> 1 1. 419. -8519. -3323. 3310. Yes Down#> 2 2. 405. -7780. -3323. 3310. Yes Down#> 3 3. 370. -6293. -3323. 3310. Yes Down#> 4 4. 420. -6218. -3323. 3310. Yes Down#> 5 5. 303. 5557. -3323. 3310. Yes Up#> 6 6. 406. -5478. -3323. 3310. Yes Down#> 7 7. 250. 4620. -3323. 3310. Yes Up#> 8 8. 366. -4553. -3323. 3310. Yes Down#> 9 9. 318. 4241. -3323. 3310. Yes Up#> 10 10. 244. -4137. -3323. 3310. Yes Down#> # ... with 75 more rows
Several other packages were instrumental in developing anomaly detection
methods used in anomalize
:
AnomalyDetection
, which implements decomposition using
median spans and the Generalized Extreme Studentized Deviation
(GESD) test for anomalies.forecast::tsoutliers()
function, which implements the IQR method.ggplot2
issues in plot_anomalies()
. Solves "Error in FUN(X[[i]], ...) : object '.group' not found".plot_anomaly_decomposition()
. Solves "Error in -x : invalid argument to unary operator".NEWS.md
file to track changes to the package.