Ultra-High-throughput screening workflow

A sorely-needed example of how concepts in drug discovery screening are put together at scale

The Setup

High throughput screening (HTS) is a staple of any drug discovery company these days. This is a process that involves using robotics, liquid-handling devices, and sensitive instruments to conduct millions of chemical, pharmacological, or genetic tests simultaneously. Variants of this process are used for identifying active compounds, antibodies, or relevant genes in a pathway. Theres’s enormous variety in the mechanical setup for HTS, but many of the statistical and analytical techniques are remarkably consistent. Despite this, there are very few readily-accessible tutorials on how to do analyze high-throughput screening data. Making this even more complex is the existence of ultra-high-throughput-screening (uHTS). This refers (circa 2008) to screening in excess of 100,000 compounds per day.

Supose we already have a method for assay plate preparation. Suppose we’ve also procured the best instruments for observing whatever reaction we want to see. Suppose we’ve also properly automated this whole sample-handling pipeline and we have our data. What then?

This tutorial will guide you through some of the thought processes that go into assay quality control, hit selection, and predictive model creation within a HTS or uHTS pipeline.

You can follow along with this tutorial in this Colab notebook.

Our data

What we have is a synthetic dataset that mimics values that we could expect an HTS platform to generate. Our tabular dataset cointains the following columns:

  1. FOV - The field of view. Our plate-scanner captures our HTS plate in a right-to-left, left-to-right serpentine fashion. In total, we could expect our plate-reader to capture 792 FOVs in total.
  2. Bead_ID - The Bead_ID column values range from 1 to 40,000 (there are 40,000 beads per FOV in a 200 x 200 layout), and the numbering is in row-major order.
  3. BB1 - Building Block #1. Indicates the first component of a moleule attached to one of our beads. We have 8 options, 6 of which are included as options for BB2.
  4. BB2 - Building Block #2. Indicates the second component of a moleule attached to one of our beads. We have 474 options, 6 of which are included as options for BB1.
  5. BB3 - Building Block #3. Indicates the third component of a moleule attached to one of our beads. We have 786 options, all of which are unique to the BB3 position.
  6. BL - Baseline fluorescence intensity. No protein is bound to beads during this measurement.
  7. 81nM - 81-nanomolar fluorescence intensity. Our protein of interest is at a concentration of 81nM. Thats 81 nmol/L, or about 4.8779×10164.8779 \times 10^{16} proteins per liter of solvent.

Our data contains 18,196,930 beads in total. If every bead position on our plate were filled we would have 31,680,000 beads (40,000 beads per FOV times 792 FOVs), but for a variety of reasons we might not use every single bead position on the chip. After all, it’s better to make sure all the beads are embedded correctly rather than trying to fill the entire chip. As such, only about 57.4% of the full wells are being used. With our FOVs forming a 22×3622 \times 36 matrix, with 200×200200 \times 200 positions for each, the full chip size would be 4400×72004400 \times 7200, bounded by corner corrdinates at [0,0], [4399,0], [0,7199], and [4399,7199].

For our data in question, we would have this stored in a 1.24GB HDF5 file (CSVs are for chumps).


As sterile and controlled as any good HTS lab would be, we still need to think about pre-processing. Before we can even think of hit-detection or building ML models based on our data, we would want to do some kind of background corrections on our data. Luckily for us, we can incorporate the baselines and/or nearest-neighbors within our BL and 81nM columns to perform baseline/background corrections. Most papers and textbooks will reaffirm that chips with too much noise are better off being discarded rather than contaminating our results. That being said, there is plenty we can do when it comes to corrections before we even need to think of discarding the plates our poor robots worked so hard on.

Background subtraction can be simple or complex. We could easily dive into the literature on image-processing or HTS-subtype-specific corrections. However, we can also build a relatively simple linear model from which background gradients can be subtracted. We can build our k-nearest neighbors regressor to take in 4 variables: row, column, row * column, and fov. If we wanted to get exotic with our nearest neighbors algorithm we could try out a Radius neighbors regressor, however this technique has its drawbacks. If you use radius-neighbors, you’ll find that it has a far more variable and uncertain memory demand and runtime than its much more popular and reliable cousing k-nearest neighbors.

Some versions of background subtraction may involve substituting missing values with the median of the plate or FOV. This might be okay for chips that are nearly complete, but since our chip is only at ~50% capacity this would be a bad idea.

Once we calculate this position-dependent gradient, it’s just a simple matter of subtracting it from the data. A correlation plot will reveal whether this background correction has been done right.

Quality Control and Hit-detection

With our cleaned-up data, we want to use it to identify the next blockbuster drug (or if not that, identify promising hits for the next stage of drug discovery).

While our data is cleaner than what we started with, we still want to apply anomaly detection methods to remove noise and un-physical characteristics prior to hit-detection.

We have a variety of statistical methods we would use to confidently classify hits. Which methods we use (and even which formulations of those methods) depends on the numbers of replicates, the numbers of controls, and whether experimental samples and controls are paired.

For assay quality control, we can assess the quality of paired tests with

For hit-detection, hits can be identified based on their z-scores/t-statistics, Z-factors, and Strictly standardized mean differences (SSMDs).

Some of our top hits include the following compounds:


Since our approach to library-building is combinatorial, we might be interested in the prominent BB1s, BB2s, and BB3s as well. While we’re at it, what about the building blocks in general? What building block seems to be most responsible for binding? Answering this is complicated by the fact that there’s usually multiple copies of the same molecule on a given chip. We could approach this by calculating the median absolute deviations from 0 for the coefficients of a simple one-hot-encoded linear regression model, with the predicted output being the SSMD.

For outlier detection, the plan was originally to test methods such as Isolation Forest, Mahalanobis Distance, Untrained autoencoders, and Kolmogorov-Smirnov. However, much of these would be less useful if one already knew the background subtraction had not been done as fully as possible.

Building a Regression Model from our hits

Screening for hits can be a time-consuming and expensive process, partly because the search space is truly enormous (on the order of 10^20 possible molecules). Machine learning seems like a compelling antidote to this. After all, if we have a model that can predict binding activity just from looking at a molecule’s shape, that makes navigating molecule-space much easier.

Dataset setup

Because of the skewed distribution (log2 values of flourescence, not just the raw values), It seemed prudent to apply a power trandform. THis would take the distribution from something like a Poisson or Log-Normal distribution and make it resemble a less-skewed normal distribution. Out of the options for the power transform (Box-Cox, Yeo-Johnson, and Quantile transformation) I went with Yeo-Johnson. Box-Cox is usually chosen if one is absolutely certain that there are only positive values.

Ideally we’d like to work with all 18M+ data points, but if that’s not possible we could use a representative subset of the data and perform a random 80/20 split. (whatever we do we must make sure to set the random seed/state for the split and regressors if applicable).

Architecture Choice

Let’s take the data from above and build a regressor model based on the BB1, BB2, and BB3 positions. We could pick from countless approaches to architectures, but let’s keep things simple and choose from one of the following:

  • ridge regression
  • random forest regression
  • support vector regression
  • kernel ridge regression
  • XGBoost regression

Hyperparameter tuning

For hyperparameter tuning we can apply Tree Parzen estimation to the problem.

Additional Quality control

As with any drug discovery machine learning pipeline, it’s always important to double check that the top hits make sense. This would usually be done by checking whether the molecular structure for the compound in question fits into the binding pocket of the molecule of interest. One would use a tool like Mol* (pronounced mol-star) to do that. Of course, it’s worth remembering that some full mechanisms of action can take much longer to figure out.


The overwhelming majority of my guidance for these 3 sections came from the book Optimal High-Throughput Screening, Practical Experimental Design and Data Analysis for Genome-Scale RNAi Research – Xiaohua Douglas Zhang (which actually has a lot to say about both RNA and small molecule screens).

Part 1:

For additional intensity-column-specific corrections, there are a few approaches that I explored

Other packages used:

Part 2:

The following is a list of papers and resources I used that are NOT Optimal High-Throughput Screening, Practical Experimental Design and Data Analysis for Genome-Scale RNAi Research – Xiaohua Douglas Zhang, the main resource I turned to.

Part 3:

Other packages used:

Cited as:

    title = "Ultra High Throughput Screening Workflow",
    author = "McAteer, Matthew",
    journal = "matthewmcateer.me",
    year = "2021",
    url = "https://matthewmcateer.me/blog/uhts-workflow/"

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.