UltraHighthroughput screening workflow
A sorelyneeded example of how concepts in drug discovery screening are put together at scale
 UPDATED
The Setup
High throughput screening (HTS) is a staple of any drug discovery company these days. This is a process that involves using robotics, liquidhandling devices, and sensitive instruments to conduct millions of chemical, pharmacological, or genetic tests simultaneously. Variants of this process are used for identifying active compounds, antibodies, or relevant genes in a pathway. Theres’s enormous variety in the mechanical setup for HTS, but many of the statistical and analytical techniques are remarkably consistent. Despite this, there are very few readilyaccessible tutorials on how to do analyze highthroughput screening data. Making this even more complex is the existence of ultrahighthroughputscreening (uHTS). This refers (circa 2008) to screening in excess of 100,000 compounds per day.
Supose we already have a method for assay plate preparation. Suppose we’ve also procured the best instruments for observing whatever reaction we want to see. Suppose we’ve also properly automated this whole samplehandling pipeline and we have our data. What then?
This tutorial will guide you through some of the thought processes that go into assay quality control, hit selection, and predictive model creation within a HTS or uHTS pipeline.
Our data
What we have is a synthetic dataset that mimics values that we could expect an HTS platform to generate. Our tabular dataset cointains the following columns:
FOV
 The field of view. Our platescanner captures our HTS plate in a righttoleft, lefttoright serpentine fashion. In total, we could expect our platereader to capture 792 FOVs in total.Bead_ID
 The Bead_ID column values range from 1 to 40,000 (there are 40,000 beads per FOV in a 200 x 200 layout), and the numbering is in rowmajor order.BB1
 Building Block #1. Indicates the first component of a moleule attached to one of our beads. We have 8 options, 6 of which are included as options for BB2.BB2
 Building Block #2. Indicates the second component of a moleule attached to one of our beads. We have 474 options, 6 of which are included as options for BB1.BB3
 Building Block #3. Indicates the third component of a moleule attached to one of our beads. We have 786 options, all of which are unique to the BB3 position.BL
 Baseline fluorescence intensity. No protein is bound to beads during this measurement.81nM
 81nanomolar fluorescence intensity. Our protein of interest is at a concentration of 81nM. Thats 81 nmol/L, or about $4.8779 \times 10^{16}$ proteins per liter of solvent.
Our data contains 18,196,930 beads in total. If every bead position on our plate were filled we would have 31,680,000 beads (40,000 beads per FOV times 792 FOVs), but for a variety of reasons we might not use every single bead position on the chip. After all, it’s better to make sure all the beads are embedded correctly rather than trying to fill the entire chip. As such, only about 57.4% of the full wells are being used. With our FOVs forming a $22 \times 36$ matrix, with $200 \times 200$ positions for each, the full chip size would be $4400 \times 7200$, bounded by corner corrdinates at [0,0]
, [4399,0]
, [0,7199]
, and [4399,7199]
.
For our data in question, we would have this stored in a 1.24GB HDF5 file (CSVs are for chumps).
Preprocessing
As sterile and controlled as any good HTS lab would be, we still need to think about preprocessing. Before we can even think of hitdetection or building ML models based on our data, we would want to do some kind of background corrections on our data. Luckily for us, we can incorporate the baselines and/or nearestneighbors within our BL
and 81nM
columns to perform baseline/background corrections. Most papers and textbooks will reaffirm that chips with too much noise are better off being discarded rather than contaminating our results. That being said, there is plenty we can do when it comes to corrections before we even need to think of discarding the plates our poor robots worked so hard on.
Background subtraction can be simple or complex. We could easily dive into the literature on imageprocessing or HTSsubtypespecific corrections. However, we can also build a relatively simple linear model from which background gradients can be subtracted. We can build our knearest neighbors regressor to take in 4 variables: row
, column
, row * column
, and fov
. If we wanted to get exotic with our nearest neighbors algorithm we could try out a Radius neighbors regressor, however this technique has its drawbacks. If you use radiusneighbors, you’ll find that it has a far more variable and uncertain memory demand and runtime than its much more popular and reliable cousing knearest neighbors.
Some versions of background subtraction may involve substituting missing values with the median of the plate or FOV. This might be okay for chips that are nearly complete, but since our chip is only at ~50% capacity this would be a bad idea.
Once we calculate this positiondependent gradient, it’s just a simple matter of subtracting it from the data. A correlation plot will reveal whether this background correction has been done right.
Quality Control and Hitdetection
With our cleanedup data, we want to use it to identify the next blockbuster drug (or if not that, identify promising hits for the next stage of drug discovery).
While our data is cleaner than what we started with, we still want to apply anomaly detection methods to remove noise and unphysical characteristics prior to hitdetection.
We have a variety of statistical methods we would use to confidently classify hits. Which methods we use (and even which formulations of those methods) depends on the numbers of replicates, the numbers of controls, and whether experimental samples and controls are paired.
For assay quality control, we can assess the quality of paired tests with
For hitdetection, hits can be identified based on their zscores/tstatistics, Zfactors, and Strictly standardized mean differences (SSMDs).
Some of our top hits include the following compounds:
Rank  BB1  BB2  BB3 

1  CP_981 
CP_262 
CP_80 
2  CP_763 
CP_261 
CP_194 
3  CP_781 
CP_31 
CP_194 
4  CP_763 
CP_1218 
CP_194 
5  CP_692 
CP_116 
CP_80 
Since our approach to librarybuilding is combinatorial, we might be interested in the prominent BB1s, BB2s, and BB3s as well. While we’re at it, what about the building blocks in general? What building block seems to be most responsible for binding? Answering this is complicated by the fact that there’s usually multiple copies of the same molecule on a given chip. We could approach this by calculating the median absolute deviations from 0 for the coefficients of a simple onehotencoded linear regression model, with the predicted output being the SSMD.
For outlier detection, the plan was originally to test methods such as Isolation Forest, Mahalanobis Distance, Untrained autoencoders, and KolmogorovSmirnov. However, much of these would be less useful if one already knew the background subtraction had not been done as fully as possible.
Building a Regression Model from our hits
Screening for hits can be a timeconsuming and expensive process, partly because the search space is truly enormous (on the order of 10^20 possible molecules). Machine learning seems like a compelling antidote to this. After all, if we have a model that can predict binding activity just from looking at a molecule’s shape, that makes navigating moleculespace much easier.
Dataset setup
Because of the skewed distribution (log2 values of flourescence, not just the raw values), It seemed prudent to apply a power trandform. THis would take the distribution from something like a Poisson or LogNormal distribution and make it resemble a lessskewed normal distribution. Out of the options for the power transform (BoxCox, YeoJohnson, and Quantile transformation) I went with YeoJohnson. BoxCox is usually chosen if one is absolutely certain that there are only positive values.
Ideally we’d like to work with all 18M+ data points, but if that’s not possible we could use a representative subset of the data and perform a random 80/20 split. (whatever we do we must make sure to set the random seed/state for the split and regressors if applicable).
Architecture Choice
Let’s take the data from above and build a regressor model based on the BB1, BB2, and BB3 positions. We could pick from countless approaches to architectures, but let’s keep things simple and choose from one of the following:
 ridge regression
 random forest regression
 support vector regression
 kernel ridge regression
 XGBoost regression
Hyperparameter tuning
For hyperparameter tuning we can apply Tree Parzen estimation to the problem.
References
The overwhelming majority of my guidance for these 3 sections came from the book Optimal HighThroughput Screening, Practical Experimental Design and Data Analysis for GenomeScale RNAi Research – Xiaohua Douglas Zhang (which actually has a lot to say about both RNA and small molecule screens).
Part 1:

Background correction WITHOUT Imputation of missing values:
 kNN Regression (SKLearn Implementation)

 Given how buggy many of the Python implementations of this are compared to the relevant R Packages, kNN took the place of this. kNN in Sklearn also has better support for smoothing in 2D and 3D than the the StatsModels LOESS implementation.

Median Polishing

Background correction WITH Imputation of missing values:

Nearest neighbors correction method
 sklearn kNearest Neighbors for images and regression
 kNN smoothing for singlecell RNASeq data
 sklearn RadiusNearest Neighbors regression (while theoretically promising, this proved impractical due to its higly variable and often enormous runtime complexity for relatively small parameter values)

Tukey’s median polish alghoritm

Rolling Ball Background subtraction
 Tools in areas like Scikit image for background correction (e.g., This rolling ball approach )

For additional intensitycolumnspecific corrections, there are a few approaches that I explored

Domainagnostic Image processing techniques:
 Nearest Neighbors: Nearest neighbors correction is often used for baseline corrections. This OpenCV example shows how KNN background correction is done in video
 Scikitimage implementation of regional maxima filtering

Techniques from 2color microarray background correction
 A comparison of background correction methods for twocolour microarrays: As the name suggests, this covers twocolor microarray background correction. However, there are still some tools we can try out:
 This method goes into further detal about maximum likelihood extimation (again for 2color arrays)
 The
drLumi
R Package ( PLoS paper ) goes into detail about corrections for beadbased immunoassays. This is closer to what we want. While there’s not an immediately available python package, we can look to the algorithms themselves.
Other packages used:
Part 2:
The following is a list of papers and resources I used that are NOT Optimal HighThroughput Screening, Practical Experimental Design and Data Analysis for GenomeScale RNAi Research – Xiaohua Douglas Zhang, the main resource I turned to.
 Statistical Methods for Analysis of HighThroughput RNA Interference Screens (this was an inspiration for much of the robust statistics forulations)
 High throughput screening
 Illustration of SSMD, z Score, SSMD, z Score, and t Statistic for Hit Selection in RNAi HighThroughput Screens, This was one of the earlier sources I tried out for alternative calculations of zscore and SSMD. Confirmed its usefulness after receiving an email from Nabre recommending paper.
 Detecting and overcoming systematic bias in highthroughput screening technologies: a comprehensive review of practical issues and methodological solutions, (for further details on Bscore than were available in the textbook)
 Robust statistical methods for hit selection in RNA interference highthroughput screening experiments
 Adaptation of robust Z’ factor for assay quality assessment in microelectrode array based screening using adult dorsal root ganglion neurons
 Novel Analytic Criteria and Effective Plate Designs for Quality Control in GenomeScale RNAi Screens
 Scikit Learn Robust vs Empirical covariance estimate
Part 3:

BoxCox, YeoJohnson, and Quantile transformation, all options on the table for scaling the outpt log_2 of the fluorescence in the model training (this is going beyond the basic onehot encoding).
 A source quote I refer to in my explanation about the importance of output target scaling even in nonneuralnetwork models.

For proof of Random Search’s superiority to Grid Search, Bengio’s and Bergstra’s Random Search for HyperParameter Optimization
 A blog post pointing out how random search has a probability of 95% of finding a combination of parameters within the 5% optima with only 60 iterations. Also compared to other methods it doesn’t bog down in local optima.

For the main TPE Implementation, I used the Hyperopt package.
 Normally I would have done much more extensive testing on the other model types allowed ( ridge, random forest, support vector regression, kernel ridge regression ), but due to RAM constraints and dependency conflicts with ScikitOptimize and HyperoptSklearn, there wasn’t enough time to test those.
Other packages used:
Cited as:
@article{mcateer2021htsw,
title = "Ultra High Throughput Screening Workflow",
author = "McAteer, Matthew",
journal = "matthewmcateer.me",
year = "2021",
url = "https://matthewmcateer.me/blog/uhtsworkflow/"
}
If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me
] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.
See you in the next post 😄