This summer I had the opportunity of working as an intern for RStudio.
Among the available projects/mentors, I was fortunate to collaborate
with Max Kuhn developing the applicable
modeling R package now
available here. The package
comprises of a collection of applicability domain models to measure the
reliability of a given model’s prediction. These models are borrowed
from Chemistry, where they are widely used to measure the reliability of
QSARs models.
Applicability Domain Methods
Applicability domain (AD) models are unsupervised approaches to characterize the nature of a prediction’s extrapolation. They are based on the premise that a model’s prediction is reliable if the sample from which a prediction is made is similar to the training set. The following are examples of AD methods:
-
Simple distance measures can be used, e.g., the leverage or hat value of a data point.
-
Similarity measurements, which are preferred when the predictor space consists of all binary predictors. A common metric is the Jaccard index.
Let us dive into some of these methods through the applicable
package.
The applicable
package
The applicable package contains models to analyze binary and continuous data. The Jaccard index – a similarity statistics – is a specialized indicator when working with binary data. On the other hand, for continuous data, both principal component analysis and hat values are employed.
library(applicable)
Binary data: the Jaccard index
Similarity statistics can be used to compare data sets where all of the
predictors are binary. One of the most common measures is the Jaccard
index. Given sets A
and B
, the Jaccard similarity between A and B is
defined as
$$J(A, B) = \frac{\|A \cap B\|}{\|A \cup B\|}$$
For a training set of size n
, there are n
similarity statistics for
each new sample. These can be summarized via the mean statistic or a
quantile. In general, we want similarity to be low within the training
set (i.e., a diverse training set) and high for new samples to be
predicted.
To analyze the Jaccard metric, applicable
provides the following
methods:
-
apd_similarity
: analyzes samples in terms of similarity scores. For a training set of n samples, a new sample is compared to each, resulting in n similarity scores. These can be summarized into the median similarity. -
autoplot
: shows the cumulative probability versus the unique similarity values in the training set. -
score
: scores new samples using similarity methods. In particular, it calculates the similarity scores and ifadd_percentile = TRUE
, it also estimates the percentile of the similarity scores.
The example data is from two QSAR data sets where binary fingerprints are used as predictors.
data(qsar_binary)
Let us construct the model:
jacc_sim <- apd_similarity(binary_tr)
jacc_sim
## Applicability domain via similarity
## Reference data were 67 variables collected on 4330 data points.
## New data summarized using the mean.
As we can see below, this is a fairly diverse training set:
library(ggplot2)
# Plot the empirical cumulative distribution function for the training set
autoplot(jacc_sim)
We can compare the similarity between new samples and the training set:
# Summarize across all training set similarities
mean_sim <- score(jacc_sim, new_data = binary_unk)
mean_sim
## # A tibble: 5 x 2
## similarity similarity_pctl
## <dbl> <dbl>
## 1 0.376 49.8
## 2 0.284 13.5
## 3 0.218 6.46
## 4 0.452 100
## 5 0.0971 5.59
Samples 3 and 5 are definitely extrapolations based on these predictors. In other words, the new samples are not similar to the training set and so predictions on them may not be very reliable.
library(applicable)
Continuous data: PCA
When working with continuous data, applicable
provides the following
methods to analyze the applicability domain of your model:
- Principal component analysis
- Hat values statistics
To see applicable
in action, let us look at the principal component
analysis of the Ames IA housing data.
library(AmesHousing)
ames <- make_ames()
There are 2,930 properties in the data.
The Sale Price was recorded along with 81 predictors, including:
- Location (e.g., neighborhood) and lot information.
- House components (garage, fireplace, pool, porch, etc.).
- General assessments such as overall quality and condition.
- Number of bedrooms, baths, and so on.
More details can be found in De Cock (2011, Journal of Statistics Education).
The raw data are at http://bit.ly/2whgsQM but we will use a processed version
found in the AmesHousing
package.
applicable
also contains an update for these data for three new properties
(although fewer fields were collected on these).
To pre-process the training set, we will use the recipes package. We first tell the recipes that there is an additional value for the neighborhood in these data, then direct it to create dummy variables for all categorical predictors. In cases where there are no levels observed for a factor, we eliminate predictors with a single unique value, then estimate a transformation that will make the predictor distributions more symmetric. After these, the data are centered and scaled. These same transformations will be applied to the new data points using the statistics estimated from the training set.
library(recipes)
library(dplyr)
ames_cols <- names(ames_new)
ames_new$Neighborhood <- as.factor(ames_new$Neighborhood)
training_data <-
ames %>%
# For consistency, only analyze the data on new properties
dplyr::select(one_of(ames_cols)) %>%
mutate(
# There is a new neighborhood in ames_new
Neighborhood = as.character(Neighborhood),
Neighborhood = factor(Neighborhood, levels = levels(ames_new$Neighborhood))
)
training_recipe <-
recipe( ~ ., data = training_data) %>%
step_dummy(all_nominal()) %>%
# Remove variables that have the same value for every data point.
step_zv(all_predictors()) %>%
# Transform variables to be distributed as Gaussian-like as possible.
step_YeoJohnson(all_numeric()) %>%
# Normalize numeric data to have a mean of zero and
# standard deviation of one.
step_normalize(all_numeric())
The following functions in applicable
are used for principal component
analysis:
apd_pca
: computes the principal components that account for up to either 95% or the providedthreshold
of variability. It also computes the percentiles of the principal components and the mean of each principal component.autoplot
: plots the distribution function for pcas. You can also provide an optional set ofdplyr
selectors, such asdplyr::matches()
ordplyr::starts_with()
, for selecting which variables should be shown in the plot.score
: calculates the principal components of the new data and their percentiles as compared to the training data. The number of principal components computed depends on thethreshold
given at fit time. It also computes the multivariate distance between each principal component and its mean.
To see how these functions work, we will borrow the outputs of the shiny application we are currently developing for the package.
Let us apply apd_pca
modeling function to our data:
ames_pca <- apd_pca(training_recipe, training_data)
Since no threshold
was provided, the function computed the number of
principal components that accounted for at most 95% of the total
variance. Plotting the distribution function for the PCA scores is also
helpful. In the shiny
application, we can set the threshold as well as the subset of PCAs
displayed.
For illustration, setting threshold = 0.30
or 30%, we now need only 13
principal components:
We can also update the number of PCAs displayed:
Finally, the score
function compares the training data to new samples.
This is the output of pca_score <- score(ames_pca, ames_new)
:
Conclusion
The project was ambitious and complex in nature, which made it that much
more intereting. I learned a lot from Max and had an amazing experience
working for RStudio. As I mentioned before, an application using the
shiny
package is currently in development. You can take a pick at it
here.