Deploying and Scaling ML

Practical considerations of scaling and implementing ML in the real world

This post is Part 4 of the 4-part Machine Learning Research Interview Handbook series (you can see the rest of the series here).

Whether you’re an ML engineer or a researcher, you’ll need to know how to deploy machine learning jobs. If you’re an engineer, this could involve scaling ML-based applications across millions of users. If you’re in a research lab, you may need to scale training across months of GPU-hours. Here are a few useful questions for figuring out how well you know how to scale and deploy machine learning applications.

Table of Contents


What is Tensorflow? What is it good for?

(click the triangle to expand and reveal the full answer)

Google’s Tensorflow - the most popular Deep Learning system today. Google, Uber, Airbnb, Nvidia, and several other popular companies use it as their go-to tool. TF is turrently the most common DL frameword today, but frankly, it’s a rare case when popularity is equivalent to performance.

Python is the easiest language for working with TensorFlow. However, experimental interfaces are also available in C#, C++, Java, Scala, JavaScript, C++, Go, and Julia. TF was built not only with strong computing clusters in mind, but also iOS and Android compatibility. TF will require a lot of coding from you. It’s not going to give you a strong AI overnight, it is only a technique for deep learning research that might potentially make it a little less sluggish. You need to consider carefully about the neural network architecture, measure the size and amount of input and output data correctly. Tensorflow works with a static computational graph. In other words, first we describe the structure, then we run the calculations through it and, if we need adjustments are needed, we re-train the model. This design choice was picked with efficiency in mind, but other frameworks have been built response to this loss in learning speed as an alternative (e.g., PyTorch, the close #2 most common framework).

As for what it’s good for, it is useful for developing and playing with deep learning models, and its implementation is helpful for data integration (such as binding graph representations, SQL tables, and images together). Google has also invested a lot of time an effort into building and maintaining it, so they have an incentive to make sure it sticks around.

What is Sonnet? What is it good for?

(click the triangle to expand and reveal the full answer)

Sonnet deep learning framework built on top of TensorFlow. It is designed to create neural networks with a complex architecture by the world famous company DeepMind.

Sonnet is defined by it’s high-level object-oriented libraries that bring about abstraction when developing neural networks (NN) or other machine learning (ML) algorithms. The idea of Sonnet is to construct the primary Python objects corresponding to a specific part of the neural network. Further, these objects are independently connected to the computational TensorFlow graph. Separating the process of creating objects and associating them with a graph simplifies the design of high-level architectures. More information about these principles can be found in the framework documentation.

The main advantage of Sonnet, is you can use it to reproduce the research demonstrated in DeepMind’s papers with greater ease than Keras, since DeepMind will be using Sonnet themselves.

So all-in-all, it’s a flexible functional abstractions tool that is absolutely a worthy opponent for TF and PyTorch.

What is Keras? What is it good for?

(click the triangle to expand and reveal the full answer)

Keras is a machine learning framework that might be your new best friend if you have a lot of data and/or you’re after the state-of-the-art in AI: deep learning. Plus, it’s the most minimalist approach to using TensorFlow, Theano, or CNTK is the high-level Keras shell.

Keras is usable as a high-level API on top of other popular lower level libraries such as Theano and CNTK in addition to Tensorflow. Prototyping here is facilitated to the limit. Creating massive models of deep learning in Keras is reduced to single-line functions. But this strategy makes Keras a less configurable environment than low-level frameworks.

Keras is by far the best Deep Learning framework for those who are just starting out. It’s ideal for learning and prototyping simple concepts, to understand the very essence of the various models and processes of their learning. Keras is a beautifully written API. The functional nature of the API helps you completely and gets out of your way for more exotic applications. Keras does not block access to lower level frameworks. Keras results in a much more readable and succinct code. Keras model Serialization/Deserialization APIs, callbacks, and data streaming using Python generators are very mature.

It should be noted that one cannot really compare Keras and Tensorflow since they sit on different abstraction levels:

  • Tensorflow is on the Lower Level: This is where frameworks like MXNet, Theano, and PyTorch sit. This is the level where mathematical operations like Generalized Matrix-Matrix multiplication and Neural Network primitives like Convolutional operations are implemented.
  • Keras is on the higher Level. At this Level, the lower level primitives are used to implement Neural Network abstraction like Layers and models. Generally, at this level, other helpful APIs like model saving and model training are also implemented.

What is MXNet? What is it good for?

(click the triangle to expand and reveal the full answer)

MXNet is a highly scalable deep learning tool that can be used on a wide variety of devices. Although it does not appear to be as widely used as yet compared to TensorFlow, MXNet growth likely will be boosted by becoming an Apache project.

The framework initially supports a large number of languages (C ++, Python, R, Julia, JavaScript, Scala, Go, and even Perl). The main emphasis is placed on the fact that the framework is very effectively parallel on multiple GPUs and many machines. This, in particular, has been demonstrated by his work on Amazon Web Services.

MXNet supports multiple GPUs very well (with optimized computations and fast context switching). It also has a clean and easily maintainable codebase (Python, R, Scala, and other APIs). Although it is not as popular as Tensorflow, MXNet has detailed documentation and is easy to use, with the ability to choose between imperative and symbolic programming styles, making it a great candidate for both beginners and experienced engineers.

What is Gluon? What is it good for?

(click the triangle to expand and reveal the full answer)

Much like how Keras is a simplified version of Tensorflow, Gluon can be thought of as a simplified wrapper over MXNet. Like PyTorch, Gluon was built with ease-of-prototyping in mind, and also uses dynamic graphs to define network structures.

The main advantage of Gluon is the ease of use almost on the level of Keras, but with the added ability to create dynamic graph models. There are also performance benefits on AWS, along with preservation of the Python language’s control flow.

What is Chainer? What is it good for?

(click the triangle to expand and reveal the full answer)

Until the advent of DyNet at CMU, and PyTorch at Facebook, Chainer was the leading neural network framework for dynamic computation graphs or nets that allowed for input of varying length, a popular feature for NLP tasks.

The code is written in pure Python on top of the Numpy and CuPy libraries. Chainer is the first framework to use a dynamic architecture model (as in PyTorch). Chainer several times beat records on the effectiveness of scaling when modeling problems solved by neural networks.

By its own benchmarks, Chainer is notably faster than other Python-oriented frameworks, with TensorFlow the slowest of a test group that includes MxNet and CNTK. It has better GPU & GPU data center performance than TensorFlow. (TensorFlow is optimized for TPU architecture) Recently, Chainer became the world champion for GPU data center performance. It also has good Japanese support and an OOP-like programming style.

What is DL4J? What is it good for?

(click the triangle to expand and reveal the full answer)

Most of the frameworks described apply to languages like Python or Swift. DL4J (Deep Learning for Java) was built with Java and Scala in mind.

If you absolutely must use Java, DL4J is a great machine learning framework to use. DL4J splits neural network training across parallel clusters, and is supported by Hadoop and Spark. DL4J also allows for building deep learning applications for programs on Android devices.

What is ONNX? What is it good for?

(click the triangle to expand and reveal the full answer)

As the number of ML frameworks grew, so did the number of ways to store pre-trained model. ONNX was the result of a joint effort between Facebook AI and Microsot Research to create something akin to a universal standard for saving machine learning models. Using the ONNX format, you can pass saved models between programs written in different frameworks. Right now ONNX is supported by CNTK, PyTorch, MXNet, Caffe2, and as of recently Tensorflow and Keras (though the latter have slightly less coverage).

What is Jax? What is it good for?

(click the triangle to expand and reveal the full answer)

New framework out of Google Brain. It is built to resemble the numpy library, though it also has a bunch of built-in deep learning tools (as well as autodifferentiation). Great for researchers, as well as beginners.

What is Swift? What is it good for?

(click the triangle to expand and reveal the full answer)

Swift is a bit of a break from the rest of these since it’s not actually a framework. You’re probably familiar with Swift as the go-to tool for development for iOS or MacOS. It’s important to pay attention to in the deep learning space because two frameworks, Apple’s own CoreML and Swift for Tensorflow, integrate directly into it.

Swift has a few important features for machine learning practicioners. For example, like JAX, it has extensive support for autodifferentiation. You can take derivatives of any function or make differentiable data structures. As an added bonus, Swift can also be used on top of Jupyter, LLDB, and Google Colab. Swift is a good choice in general if dynamically typed languages are presenting too much trouble (e.g., training a model for a long time only to run into a type error at the very end).

Which framework should you use?

(click the triangle to expand and reveal the full answer)

If you are just starting out and want to figure out what’s what, the best choice is Keras. For research purposes, choose PyTorch (or possibly Jax). For production, you need to focus on the environment. So, for Google Cloud, the best choice is TensorFlow, for AWS - MXNet and Gluon. Android developers should pay attention to D4LJ. For iOS, a similar range of tasks is compromised by Core ML. Finally, ONNX will help with questions of interaction between different frameworks.

General-Purpose System Design Principles

A fairly typical research and deployment hybrid pipeline

What is “Scalability” with regards to system design?

(click the triangle to expand and reveal the full answer)

Scalability is the capability of a system, process, or a network to grow and manage increased demand. Any distributed system that can continuously evolve in order to support the growing amount of work is considered to be scalable. A system may have to scale because of many reasons like increased data volume or increased amount of work, e.g., number of transactions. A scalable system would like to achieve this scaling without performance loss.

Generally, the performance of a system, although designed (or claimed) to be scalable, declines with the system size due to the management or environment cost. For instance, network speed may become slower because machines tend to be far apart from one another. More generally, some tasks may not be distributed, either because of their inherent atomic nature or because of some flaw in the system design. At some point, such tasks would limit the speed-up obtained by distribution. A scalable architecture avoids this situation and attempts to balance the load on all the participating nodes evenly.

Horizontal vs. Vertical Scaling: Horizontal scaling means that you scale by adding more servers into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM, Storage, etc.) to an existing server.

With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool; Vertical-scaling is usually limited to the capacity of a single server and scaling beyond that capacity often involves downtime and comes with an upper limit.

Good examples of horizontal scaling are Cassandra and MongoDB as they both provide an easy way to scale horizontally by adding more machines to meet growing needs. Similarly, a good example of vertical scaling is MySQL as it allows for an easy way to scale vertically by switching from smaller to bigger machines. However, this process often involves downtime.

What is “Availability” with regards to system design?

(click the triangle to expand and reveal the full answer)

By definition, availability is the time a system remains operational to perform its required function in a specific period. It is a simple measure of the percentage of time that a system, service, or a machine remains operational under normal conditions. An aircraft that can be flown for many hours a month without much downtime can be said to have a high availability. Availability takes into account maintainability, repair time, spares availability, and other logistics considerations. If an aircraft is down for maintenance, it is considered not available during that time.

Reliability is availability over time considering the full range of possible real-world conditions that can occur. An aircraft that can make it through any possible weather safely is more reliable than one that has vulnerabilities to possible conditions.

Reliability Vs. Availability

If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In other words, high reliability contributes to high availability, but it is possible to achieve a high availability even with an unreliable product by minimizing repair time and ensuring that spares are always available when they are needed. Let’s take the example of an online retail store that has 99.99% availability for the first two years after its launch. However, the system was launched without any information security testing. The customers are happy with the system, but they don’t realize that it isn’t very reliable as it is vulnerable to likely risks. In the third year, the system experiences a series of information security incidents that suddenly result in extremely low availability for extended periods of time. This results in reputational and financial damage to the customers.

What is “Efficiency” with regards to system design?

(click the triangle to expand and reveal the full answer)

To understand how to measure the efficiency of a distributed system, let’s assume we have an operation that runs in a distributed manner and delivers a set of items as result. Two standard measures of its efficiency are the response time (or latency) that denotes the delay to obtain the first item and the throughput (or bandwidth) which denotes the number of items delivered in a given time unit (e.g., a second). The two measures correspond to the following unit costs:

  • Number of messages globally sent by the nodes of the system regardless of the message size.
  • Size of messages representing the volume of data exchanges.

The complexity of operations supported by distributed data structures (e.g., searching for a specific key in a distributed index) can be characterized as a function of one of these cost units. Generally speaking, the analysis of a distributed structure in terms of ‘number of messages’ is over-simplistic. It ignores the impact of many aspects, including the network topology, the network load, and its variation, the possible heterogeneity of the software and hardware components involved in data processing and routing, etc. However, it is quite difficult to develop a precise cost model that would accurately take into account all these performance factors; therefore, we have to live with rough but robust estimates of the system behavior.

What is “Manageability” with regards to system design?

(click the triangle to expand and reveal the full answer)

Another important consideration while designing a distributed system is how easy it is to operate and maintain. Serviceability or manageability is the simplicity and speed with which a system can be repaired or maintained; if the time to fix a failed system increases, then availability will decrease. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate (i.e., does it routinely operate without failure or exceptions?).

Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center (without human intervention) when the system experiences a system fault.

What is CAP Theorem?

(click the triangle to expand and reveal the full answer)

Machine learning has the “no free lunches” theorem. Basically, there’s no universal algorthm that can be applied to all problem. CAP theorem is similar. Namely it’s impossible to create a system that’s both Consistent, Available, and Partition-tolerant (you can only choose two). In more detail:

Consistency: All nodes see the same data at the same time. Consistency is achieved by updating several nodes before allowing further reads.

Availability: Every request gets a response on success/failure. Availability is achieved by replicating the data across different servers.

Partition tolerance: The system continues to work despite message loss or partial failure. A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.

We cannot build a general data store that is continually available, sequentially consistent, and tolerant to any partition failures. We can only build a system that has any two of these three properties. Because, to be consistent, all nodes should see the same set of updates in the same order. But if the network suffers a partition, updates in one partition might not make it to the other partitions before a client reads from the out-of-date partition after having read from the up-to-date one. The only thing that can be done to cope with this possibility is to stop serving requests from the out-of-date partition, but then the service is no longer 100% available.

What system design choices can we make for various CAP Theorem tradeoffs?

(click the triangle to expand and reveal the full answer)

Consistency & Availability (C/A): Most types of Relational Database Management System (RDBMS) will qualify, such as MySQL, Microsoft SQL Server, IBM DB2, MariaDB, PostgreSQL, Sybase, and Oracle Database

Availability & Partition-Tolerance (A/P): CassandraDB, CouchDB, Riak, Aerospike, Voldemort, Dynamo DB, SimpleDB, Oracle NoSQL Database (depending on configuration)

Consistency & Partition Tolerance (C/P): Google BigTable, MongoDB, HBase, Redis, MemcacheDB, Oracle NoSQL Database (depending on configuration)

What are the differences between SQL and no-SQL?

(click the triangle to expand and reveal the full answer)

Storage: SQL stores data in tables where each row represents an entity and each column represents a data point about that entity; for example, if we are storing a car entity in a table, different columns could be ‘Color’, ‘Make’, ‘Model’, and so on.

NoSQL databases have different data storage models. The main ones are key-value, document, graph, and columnar. We will discuss differences between these databases below.

Schema: In SQL, each record conforms to a fixed schema, meaning the columns must be decided and chosen before data entry and each row must have data for each column. The schema can be altered later, but it involves modifying the whole database and going offline.

In NoSQL, schemas are dynamic. Columns can be added on the fly and each ‘row’ (or equivalent) doesn’t have to contain data for each ‘column.’

Querying: SQL databases use SQL (structured query language) for defining and manipulating the data, which is very powerful. In a NoSQL database, queries are focused on a collection of documents. Sometimes it is also called UnQL (Unstructured Query Language). Different databases have different syntax for using UnQL.

Scalability: In most common situations, SQL databases are vertically scalable, i.e., by increasing the horsepower (higher Memory, CPU, etc.) of the hardware, which can get very expensive. It is possible to scale a relational database across multiple servers, but this is a challenging and time-consuming process.

On the other hand, NoSQL databases are horizontally scalable, meaning we can add more servers easily in our NoSQL database infrastructure to handle a lot of traffic. Any cheap commodity hardware or cloud instances can host NoSQL databases, thus making it a lot more cost-effective than vertical scaling. A lot of NoSQL technologies also distribute data across servers automatically.

Reliability or ACID Compliancy (Atomicity, Consistency, Isolation, Durability): The vast majority of relational databases are ACID compliant. So, when it comes to data reliability and safe guarantee of performing transactions, SQL databases are still the better bet.

Most of the NoSQL solutions sacrifice ACID compliance for performance and scalability.

What is the difference between redundancy and replication?

(click the triangle to expand and reveal the full answer)

Redundancy is the duplication of critical components or functions of a system with the intention of increasing the reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance. For example, if there is only one copy of a file stored on a single server, then losing that server means losing the file. Since losing data is seldom a good thing, we can create duplicate or redundant copies of the file to solve this problem. Redundancy plays a key role in removing the single points of failure in the system and provides backups if needed in a crisis. For example, if we have two instances of a service running in production and one fails, the system can failover to the other one.

Replication means sharing information to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility. Replication is widely used in many database management systems (DBMS), usually with a master-slave relationship between the original and the copies. The master gets all the updates, which then ripple through to the slaves. Each slave outputs a message stating that it has received the update successfully, thus allowing the sending of subsequent updates.

What are some reasons to use SQL for data storage?

(click the triangle to expand and reveal the full answer)

Here are a few reasons to choose a SQL database:

  1. We need to ensure ACID compliance. ACID compliance reduces anomalies and protects the integrity of your database by prescribing exactly how transactions interact with the database. Generally, NoSQL databases sacrifice ACID compliance for scalability and processing speed, but for many e-commerce and financial applications, an ACID-compliant database remains the preferred option.
  2. Your data is structured and unchanging. If your business is not experiencing massive growth that would require more servers and if you’re only working with data that is consistent, then there may be no reason to use a system designed to support a variety of data types and high traffic volume.

What are some reasons to use NoSQL for data storage?

(click the triangle to expand and reveal the full answer)

When all the other components of our application are fast and seamless, NoSQL databases prevent data from being the bottleneck. Big data is contributing to a large success for NoSQL databases, mainly because it handles data differently than the traditional relational databases. A few popular examples of NoSQL databases are MongoDB, CouchDB, Cassandra, and HBase.

  1. Storing large volumes of data that often have little to no structure. A NoSQL database sets no limits on the types of data we can store together and allows us to add new types as the need changes. With document-based databases, you can store data in one place without having to define what “types” of data those are in advance.
  2. Making the most of cloud computing and storage. Cloud-based storage is an excellent cost-saving solution but requires data to be easily spread across multiple servers to scale up. Using commodity (affordable, smaller) hardware on-site or in the cloud saves you the hassle of additional software and NoSQL databases like Cassandra are designed to be scaled across multiple data centers out of the box, without a lot of headaches.
  3. Rapid development. NoSQL is extremely useful for rapid development as it doesn’t need to be prepped ahead of time. If you’re working on quick iterations of your system which require making frequent updates to the data structure without a lot of downtime between versions, a relational database will slow you down.

What are some examples of NoSQL types?

(click the triangle to expand and reveal the full answer)

Following are the most common types of NoSQL:

Key-Value Stores: Data is stored in an array of key-value pairs. The ‘key’ is an attribute name which is linked to a ‘value’. Well-known key-value stores include Redis, Voldemort, and Dynamo.

Document Databases: In these databases, data is stored in documents (instead of rows and columns in a table) and these documents are grouped together in collections. Each document can have an entirely different structure. Document databases include the CouchDB and MongoDB.

Wide-Column Databases: Instead of ‘tables,’ in columnar databases we have column families, which are containers for rows. Unlike relational databases, we don’t need to know all the columns up front and each row doesn’t have to have the same number of columns. Columnar databases are best suited for analyzing large datasets - big names include Cassandra and HBase.

Graph Databases: These databases are used to store data whose relations are best represented in a graph. Data is saved in graph structures with nodes (entities), properties (information about the entities), and lines (connections between the entities). Examples of graph database include Neo4J and InfiniteGraph.

What is Load Balancing?

(click the triangle to expand and reveal the full answer)

Load Balancer (LB) is another critical component of any distributed system. It helps to spread the traffic across a cluster of servers to improve responsiveness and availability of applications, websites or databases. LB also keeps track of the status of all the resources while distributing requests. If a server is not available to take new requests or is not responding or has elevated error rate, LB will stop sending traffic to such a server.

Typically a load balancer sits between the client and the server accepting incoming network and application traffic and distributing the traffic across multiple backend servers using various algorithms. By balancing application requests across multiple servers, a load balancer reduces individual server load and prevents any one application server from becoming a single point of failure, thus improving overall application availability and responsiveness.

To utilize full scalability and redundancy, we can try to balance the load at each layer of the system. We can add LBs at three places:

  • Between the user and the web server
  • Between web servers and an internal platform layer, like application servers or cache servers
  • Between internal platform layer and database.

What are some Load Balancing Algorithms?

(click the triangle to expand and reveal the full answer)

How does the load balancer choose the backend server? Load balancers consider two factors before forwarding a request to a backend server. They will first ensure that the server they choose is actually responding appropriately to requests and then use a pre-configured algorithm to select one from the set of healthy servers. We will discuss these algorithms shortly.

Health Checks - Load balancers should only forward traffic to “healthy” backend servers. To monitor the health of a backend server, “health checks” regularly attempt to connect to backend servers to ensure that servers are listening. If a server fails a health check, it is automatically removed from the pool, and traffic will not be forwarded to it until it responds to the health checks again.

There is a variety of load balancing methods, which use different algorithms for different needs.

  • Least Connection Method - This method directs traffic to the server with the fewest active connections. This approach is quite useful when there are a large number of persistent client connections which are unevenly distributed between the servers.
  • Least Response Time Method - This algorithm directs traffic to the server with the fewest active connections and the lowest average response time.
  • Least Bandwidth Method - This method selects the server that is currently serving the least amount of traffic measured in megabits per second (Mbps).
  • Round Robin Method - This method cycles through a list of servers and sends each new request to the next server. When it reaches the end of the list, it starts over at the beginning. It is most useful when the servers are of equal specification and there are not many persistent connections.
  • Weighted Round Robin Method - The weighted round-robin scheduling is designed to better handle servers with different processing capacities. Each server is assigned a weight (an integer value that indicates the processing capacity). Servers with higher weights receive new connections before those with less weights and servers with higher weights get more connections than those with less weights.
  • IP Hash - Under this method, a hash of the IP address of the client is calculated to redirect the request to a server.

How do Indexes decrease write performance?

(click the triangle to expand and reveal the full answer)

An index can dramatically speed up data retrieval but may itself be large due to the additional keys, which slow down data insertion & update.

When adding rows or making updates to existing rows for a table with an active index, we not only have to write the data but also have to update the index. This will decrease the write performance. This performance degradation applies to all insert, update, and delete operations for the table. For this reason, adding unnecessary indexes on tables should be avoided and indexes that are no longer used should be removed. To reiterate, adding indexes is about improving the performance of search queries. If the goal of the database is to provide a data store that is often written to and rarely read from, in that case, decreasing the performance of the more common operation, which is writing, is probably not worth the increase in performance we get from reading.

ML Pipeline Construction

Toying with ML algorithms versus everything else in your system

Which processors can be used in machine learning?

(click the triangle to expand and reveal the full answer)

CPU & GPU are general purpose, but they can run into Von Neuman Bottlenecks. TPUs are basically ASICs for the multiplication and addition operations in MAchine learning (basicaly they can convert many matrix operations from O(n3)O(n^3) to O(3n2)O(3n - 2)).

What are Data Collection and Warehousing?

(click the triangle to expand and reveal the full answer)

Data collection is the process of acquiring and formatting (and possibly transforming). Data Warehousing typically refers to storing data in a Data Warehouse, though it can occasionally be used to storing data in a different structure like a Data Lake or a Data Mart.

What is a Data Warehouse?

(click the triangle to expand and reveal the full answer)

A data warehouse usually only stores data that’s already modeled/structured. A Data Warehouse is multi-purpose and meant for all different use-cases. It doesn’t take into account the nuances of requirements from a specific business unit or function. As an example, let’s take a Finance Department at a company. They care about a few metrics, such as Profits, Costs, and Revenues to advise management on decisions, and not about others that Marketing & Sales would care about. Even if there are overlaps, the definitions could be different.

What is a Data Lake?

(click the triangle to expand and reveal the full answer)

A data lake is the place where you dump all forms of data generated in various parts of your business: structured data feeds, chat logs, emails, images (of invoices, receipts, checks etc.), and videos. The data collection routines does not filter any information out; data related to canceled, returned, and invalidated transactions will also be captured, for instance.

A data lake is relevant in two contexts:

  1. Your organization is so big and your product does so many functions that there are many possible ways to analyze data to improve the business. Thus, you need a cheap way to store different types of data in large quantities. E.g., Twitter in the B2C space (They have text (Tweets), Images, Videos, Links, Direct Messages, Live Streams, etc.), and Square (B2B) (Transactions, Returns, Refunds, Customer Signatures, Logon IDs etc.).
  2. You don’t have a plan for what to do with the data, but you have a strong intent to use it at some point. Thus, you collect data first and analyze later. Also, the volume is so high that traditional DBs might take hours if not days to run a single query. So, having it in a Massively Parallel Processor (MPP) infrastructure helps you analyze the data comparatively quickly.

What is a Data Mart?

(click the triangle to expand and reveal the full answer)

While a data-warehouse is a multi-purpose storage for different use cases, a data-mart is a subsection of the data-warehouse, designed and built specifically for a particular department/business function. Some benefits of using a data-mart include:

Isolated Security: Since the data-mart only contains data specific to that department, you are assured that no unintended data access (finance data, revenue data) are physically possible.

Isolated Performance: Similarly, since each data-mart is only used for particular department, the performance load is well managed and communicated within the department, thus not affecting other analytical workloads.

What does Data Collection involve?

(click the triangle to expand and reveal the full answer)

Data collection can be automatic, but more often than not it requires a lot of human involvement. Data collection steps like cleaning, feature selection, labeling can often be redundant and time-consuming. For automating these steps, some work being done with Semi-supervised learning, but high computational costs (the same goes for approaches using GANs, Variational Autoencoders, or Autoregressive models).

What are the differences between a Data Warehouse and a Data mart?

(click the triangle to expand and reveal the full answer)

Data warehouse is an independent application system whereas a data mart is more specific to support decision application system. The data in a data warehouse is stored in a single, centralised archive. Compared to, data mart where data is stored decentrally in different user area. A data warehouse consists of a detailed form of data. Whereas, a data mart consists of a summarized and selected data. The development of data warehouse involves a top-down approach, while a data mart involves a bottom-up approach. A data warehouse is said to be more adjustable, information-oriented and longtime existing. However, with data mart it is said to be restricted, project-oriented and has a shorter existence.

What are the considerations behind one’s choice of data format?

(click the triangle to expand and reveal the full answer)

Data format is vital. In many data collection scenarios, we might have a lot of CSVs, but this might be less than Ideal. For the sake of data reliability, we probably shouldn’t have a massive 6 GB file called the data (perhaps client-side guarantees can be made better by providing them an API to upload the data to).

A good alternative to CSV files would be HDF5 files. If we cannot go to HDF5, we can use an S3 bucket that can provide versioning.

What are the approches for integrating heterogeneous databases?

(click the triangle to expand and reveal the full answer)

To integrate heterogeneous databases, we have two approaches

Query-driven Approach

  • Query-driven approach needs complex integration and filtering processes.
  • This approach is very inefficient, and It is very expensive for frequent queries.
  • This approach is also very expensive for queries that require aggregations.

Update-driven Approach

  • This approach provide high performance.
  • The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance.
  • Query processing does not require an interface to process data at local sources.

What are the steps of an ETL Pipeline?

(click the triangle to expand and reveal the full answer)

ETL stands for “extract, transform, load”.

The Extract step involves reading the data from the source (either a disk, a stream, or a network of peers). Extraction depends heavily on I/O devices (reading from disk, network, etc.);

The Transform step involves preprocessing transformations on the data. Examples of these transformations include scaling for images, or assembly for Genomic data or anonymization. The transformation usually depends on CPU.

The Load step (the final step) bridges between the working memory of the training model and the transformed data. Those two locations can be the same or different depending on what kind of devices we are using for training and transformation. Assuming that we are using accelerated hardware, loading depends on GPU/ASICs.

What must be done to the Model-training part of a pipeline to make it scalable?

(click the triangle to expand and reveal the full answer)

To make a model training scalable, one should take a “divide and conquer” approach to decompose the computations performed in these steps into granular ones that can be run independently of each other, and aggregated later on to get the desired result. Approaches to this can be classified as either functional decomposition and/or data decomposition.

Functional decomposition: Breaking the logic down to distinct and independent functional units, which can later be recomposed to get the results. This is also known as “Model parallelism”.

Data decomposition: Data is divided into chunks, and multiple machines perform the same computations on different data.

Utilizing both kinds of decomposition: Using both kinds of decomposition would involve using an ensemble learning model like random forest, which is conceptually a collection of decision trees. In the case of a random forest, decomposing the model into individual decision trees in functional decomposition. Further training the individual tree in parallel is known as data parallelism. This is also an example of what’s called “embarrassingly parallel tasks”.

Distributed Machine Learning

Describe how MapReduce works.

(click the triangle to expand and reveal the full answer)

MapReduce is based on a “split-apply-combine” strategy. It works in the following stages:

  1. The map function maps the data to zero or more key-value pairs.
  2. The execution framework groups these key-value pairs using a shuffle operation.
  3. The reduce function takes in those key-value groups and aggregates them to get the final result.

MapReduce takes care of running the Map and Reduce functions in a highly optimized and parallelized manner on multiple workers (aka cluster of nodes)

What are the components of a Distributed Machine Learning architecture?

(click the triangle to expand and reveal the full answer)

A distributed machine learning architecture consists of 1) the Data, 2) Partitioned Data, 3) the Driver, and 4) the Node Cluster (with multiple nodes).

The data is partitioned, and the driver node assigns tasks to the nodes in the cluster. The nodes might have to communicate among each other to propagate information, like the gradients. There are various arrangements possible for the nodes, and a couple of extreme ones include Async parameter server and Sync AllReduce.

How does an Async Parameter Server architecture work?

(click the triangle to expand and reveal the full answer)

In the Async parameter server architecture, as the name suggests, the transmission of information in between the nodes happens asynchronously.

You can see how a single worker can have multiple computing devices. The worker, labeled “master”, also takes up the role of the driver. Here the workers communicate the information (e.g. gradients) to the parameter servers, update the parameters (or weights), and pull the latest parameters (or weights) from the parameter server itself. One drawback of this kind of set up is delayed convergence, as the workers can go out of sync.

How does a Sync AllReduce architecture work?

(click the triangle to expand and reveal the full answer)

In a Sync AllReduce architecture, the focus is on the synchronous transmission of information between the cluster node.

Workers are mutually connected via fast interconnects. There’s no parameter server. This kind of arrangement is more suited for fast hardware accelerators. All workers have to be synced before a new iteration, and the communication links need to be fast for it to be effective.

Some distributed machine learning frameworks do provide high-level APIs for defining these arrangement strategies with little effort.

What are some examples of Distributed machine learning frameworks?

(click the triangle to expand and reveal the full answer)

Hadoop is most popular MapReduce tool. Hadoop stores the data in the Hadoop Distributed File System (HDFS) format and provides a Map Reduce API in multiple languages. The scheduler used by Hadoop is called YARN (Yet Another Resource Negotiator), which takes care of optimizing the scheduling of the tasks to the workers based on factors like localization of data.

Apache Spark uses immutable Resilient Distributed Datasets (RDDs) as the core data structure. In Spark, distributed machine learning algorithms can be written either in the MapReduce paradigm in a library like MLlib.

Apache Mahout is more focused on performing distributed linear-algebra computations (This is a great backup if something isn’t supported by Spark MLLib).

Message Passing Interface (MPI) is a paradigm that’s more of a general model and provides a standard for communication between the processes by message-passing. MPI gives more flexibility (and control) over inter-node communication in the cluster. For use cases w/ smaller datasets or more communication amongst the decomposed tasks, libraries based on MPI are better.

Tensorflow & PyTorch have built in APIs for data and model parallelism. Frameworks like Horovod and Elephas provide much higher-level model distribution.

What are some good design principles for deploying a machine learning model to a front-end?

(click the triangle to expand and reveal the full answer)

It’s important to properly manage and schedule ML processes so that there are no lags or bottlenecks. A good strategy is to use a first-in, first-out (FIFO) queue. The backend simply enqueues jobs. Workers pick and process jobs out of the queue, performing training or inference, and storing models or predictions to the database when done. With queueing libraries like Celery & MLQ, the following is literally all you need for a backend web server: an endpoint to enqueue a job, an endpoint to check the progress of a job, and an endpoint to serve up a job’s result if the job has finished. In short, use a queue, and don’t tie up your backend webserver. Separate any ML processes from the act of serving up assets and endpoints. Even better if you ensure that everything is stateless and able to operate in parallel.

If you still need perofrmance enhancements and have exhausted all options of using the backend, consider using frontend deployment with a framework like TensorflowJS.

What is Michelangelo? What is it good for?

(click the triangle to expand and reveal the full answer)

Michelangelo is a distributed deep learning framework created by Uber. The internal motivations at Uber to build the platform were the following : Limited impact of ML due to huge resources needed when translating a local model into production.

  • Unreliable ML and data pipelines
  • Engineering teams had to create a custom serving containers and systems to systems at hand
  • Inability to scale ML projects

High-Level Architecture of Michelangelo

The main steps of Michelangelo are the following:

  1. Data Management (Pallete): Data feature store. Can sort data based on recency, as well as use streaming, batch processing, and online learning.
  2. Model Training (Horovod): distributed model training with specialized tools and model-specific model monitoring
  3. Model evaulaution: Infrastructure for inspection, visualization
  4. Model depolyment: CI/CD, Rollbacks on metrics monitoring, usually served on Uber data clusters
  5. Prediction: Routes prediction data through a DSL detailing the model and input path.
  6. Prediction monitoring: Log predictions and join to actual outcomes, publish error metrics and aggregates, ongoing accuracy messages, data distribution monitoring

Some tradeoffs from the Java packaged (Michelangelo v1) approach vs the democratized hands-on approach with PYML include:

  • Slightly higher latency
  • Quicker prototyping and pilots in production for ML models
  • Delivery friendliness
  • When it makes sense it’s ported to the more highly scalable system
  • End to end data scientist ownership of the deployment process
  • Enables velocity from ideation to production

What is Feast? What is it good for?

(click the triangle to expand and reveal the full answer)

Feast (Feature Store) is a tool for managing and serving machine learning features. Feast is the bridge between models and data. Feast was built with the goals of “Providing a unified means of managing feature data from a single person to large enterprises”, “Providing scalable and performant access to feature data when training and serving models”, Providing consistent and point-in-time correct access to feature data”, and “Enable discovery, documentation, and insights into your features”. Feast is an open-source alternative to Pallete (Uber).

What is Cron? What is it good for?

(click the triangle to expand and reveal the full answer)

Cron is a utility that provides scheduling functionality for machines running the Linux operating system. You can Set up a scheduled task using the crontab utility and assign a cron expression that defines how frequently to run the command. Cron jobs run directly on the machine where cron is utilized, and can make use of the runtimes and libraries installed on the system.

There are a number of challenges with using cron in production-grade systems, but it’s a great way to get started with scheduling a small number of tasks and it’s good to learn the cron expression syntax that is used in many scheduling systems. The main issue with the cron utility is that it runs on a single machine, and does not natively integrate with tools such as version control. If your machine goes down, then you’ll need to recreate your environment and update your cron table on a new machine.

A cron expression defines how frequently to run a command. It is a sequence of 5 numbers that define when to execute for different time granularities, and it can include wildcards to always run for certain time periods.

What is Luigi? What is it good for?

(click the triangle to expand and reveal the full answer)

Luigi is an ETL-pipeline-building tool created by Spotify. Beyond just automating ETL pipelines, Luigi was built with monitoring Cron jobs, transferring data from one place to another, automating DevOps operations, periodically fetching data from websites and update databases, data processing for recommendation-based systems, and machine learning pipelines (to a certain extent). A Luigi pipeline is usually made of the following components:

  • Target: Holds the output of a task; could be a local(e.g: a file), HDFS or RDBMS (MySQL etc)
  • Task: Where the actual work takes place; could be independent or dependent. Example of a dependent task is dumping the data into a file or database. Before loading the data the data must be there by any mean(scraping, API, etc). Each task is represented as a Python Class which contains certain mandatory member functions. A task function contains the following methods:

    • requires(): This member function of the task class contains all the task instances that must be executed before the current task. In the example I shared above, a task, named ScrapeData, will be included in the requires() method, hence make a task a dependant task.
    • output(): This method contains the target where the task output will be stored. This could contain one or more target objects.
    • run(): This method contains the actual logic to run a task.

Luigi doesn’t offer quite the same level of flexibility compared to ETL pipeline tools like Apache Spark, And the dashboard is less intuitive than tools like MLFlow and Airflow. It is stilluseful for building out ETL pipelines with complex functionalities fast.

What is MLflow? What is it good for?

(click the triangle to expand and reveal the full answer)

MLflow is built as a “ML lifecyle management” tool. It is used primarily in cases where models are being deployed from development to production. MLFlow is useful for having integrations for Tensorflow, Pytorch, keras, apache spark, Scikit learn,, python, R, java, MLeap, ONNX, Gluon, XGboost, LightGBM, Conda, Kubernetes, Docker, Amazon Sagemaker, Azure machine Learning, and Google Cloud. It can be a very useful framework for avoiding tangled pipelines that can pile up in machine learning projects.

The main downsides are occasional instability with certain builds (due to some customization of the software by different organizations), and early learning curve for getting around crashes.

What is Pentaho Kettle? What is it good for?

(click the triangle to expand and reveal the full answer)

Pentaho Kettle is Hitachi’s ML workflow framework. It’s main usefulness is its’ usability with less coding. It is entirely possible to manage an ETL pipeline using. The downside is it’s comparative lack of versatility compared to many other ETL pipeline frameworks, at least in part due to it’s proprietary and closed-source nature. New functionailites are even added through a top-down-controlled “store”.

What is Apache Airflow? What is it good for?

(click the triangle to expand and reveal the full answer)

Airflow is an open source workflow tool that was originally developed by Airbnb and publically released in 2015. It helps solve a challenge that many companies face, which is scheduling tasks that have many dependencies. One of the core concepts in this tool is a graph that defines the tasks to perform and the relationships between these tasks.

In Airflow, a graph is referred to as a DAG, which is an acronym for directed acyclic graph. A DAG is a set of tasks to perform, where each task has zero or more upstream dependencies. One of the constraints is that cycles are not allowed, where two tasks have upstream dependencies on each other.

DAGs are set up using Python code, which is one of the differences from other workflow tools such as Pentaho Kettle which is GUI focused. The Airflow approach is called “configuration as code”, because a Python script defines the operations to perform within a workflow graph. Using code instead of a GUI to configure workflows is useful because it makes it much easier to integrate with version control tools such as GitHub.

Which distributed machine learning/Deployment frameworks/tools should you use?

(click the triangle to expand and reveal the full answer)

If you’re deploying a lot of ML models from development to production very frequently, use Michaelangelo.

If you’re trying to create a dashboard for tracking your model progress, use Airflow.

If you’re managing FIFO queues of ML jobs for Flask apps, go with Celery (MLQ if you don’t expect the project to advance beyond prototype).

If you’re integrating your ML models into your JavaScript code, use TensorflowJS

Open-ended questions

Not far off from what your actual system design interviews will feel like

Ultimately a systems design interview will come down to designing a full machine learning pipeline. There are many ways of approaching each question, and you will have to weigh the different options. Ultimately, you will need to stick with one choice rather than dwelling on it forever (this is one of the few times where being opinionated almost to the point of being arrogant helps more than it hurts).

Here are a few curated questions asked at actual companies, along with sample answers (which you won’t find anywhere else).

Duolingo is a platform for language learning. When a student is learning a new language, Duolingo wants to recommend increasingly difficult stories to read. How would you measure the difficulty level of a story?

(click the triangle to expand and reveal the full answer)

We can interpret this language difficulty problem as a text complexity measurement problem, i.e., how difficult it is to understand a document. Text complexity can be defined by two concepts: legibility and readability.

  • Legibility ranges from character perception to all the nuances of formatting such as bold, font style, font size, italic, word spacing, etc.
  • Readability focuses on textual content such as lexical, semantical, syntactical and discourse cohesion analysis. It is usually computed as an approximation from other metrics, such as using average sentence length (in characters or words) and average word length (in characters or syllables) in sentences.

There are other text complexity features that do not depend on the text itself

  • Reader’s intent (homework, learning, leisure, etc.)
  • Cognitive focus (which can be influenced by ambient noise, stress level, or any other type of distraction).

WIth the text dependent features, we can frame this as a supervised NLP problem. For each story, we can clean and tokenize the text into sentences and words. From this, we can create feature vectors that represent higher-level qualities of the text such as

  • Mean number of words per sentence
  • Mean number of syllables per word
  • Number of polysyllables (more than 3 syllables)
  • Part-of-Speech (POS) tags count per book
  • Mean number of words considered “difficult” in a sentence (a word is “difficult” if it is not part of an “easy” words reference list)
  • Readability formulas such as Flesch-Kincaid

All of these features would be scaled to the same [1,+1][-1, +1] range. Further feature selection would be done through a combination of LASSO regression and k-fold Cross-validation. For each feature set a regression function is fitted using our training data. Then a correlation is computed (using a metric such as Person correlation coefficient, Kendall-Tau, or Chi-Square test between each set’s regression function and the readability score. Feature sets are ranked by correlation performance and the best one is selected.

In addition to accuracy of the model, we want humans to be able to easily understand exactlywhy a text is difficult to read. With that in mind a decision-tree-based algorithm like XGBoost regression would be an ideal choice. Once the model outputs a readability score, the readability can be converted to a more meaningful grade-level representation representaiton using the formula lg=exp(sr)0.002l_g = \exp(s_r)^{0.002}, where lg=grade levell_g = \text{grade level} and sr=readability scores_r = \text{readability score}. The progress of students’ reading comprehension in a given language could then be split into levels based on grade levels (or some similar staging strategy).

Overview and examples of the proposed workflow, alongside example grade levels and corresponding scores

This supervised approach should be able to get high accuracy given the wide availability of language corpuses. That being said, language agnostic unsupervised approaches also exist for evaluating text readability such as using neural language models as comprehension systems to infill Cloze tests (text chunks with blank words).

Given a Duolingo story, how would you edit it to make it easier or more difficult?

(click the triangle to expand and reveal the full answer)

Given certain phrases or words, we could set up a graph-specific structure (such as one based on Word2Vec) to find words/phrases with similar meanings. These could then be replaced with other words or phrases that result in changes in features like syllablecount or wordcount. This could also be combined with specific reference lists of “difficult” or “easy” words (which DuoLingo likely has). If we use a decision tree-based model for the difficulty score, we could be recommended a few branches where changing one feature of the language changes the readability more to our liking.

Given a dataset of credit card purchases information, each record is labelled as fraudulent or safe, how would you build a fraud detection algorithm?

(click the triangle to expand and reveal the full answer)

In this problem, we want to take in details of transactions such as amount, merchant, location, time and other features, and output a “safe” or “fraud” label. Hackers and crooks are fast enough in updating their techniques to render hard-coded rule-based systems obselete before they get to market, hence why we’re choosing a learning system.

While it’s straightforward to identify this problem as one of binary classification, it does suffer from an enormous imbalance between “safe” and “fraud” classes. This makes the problem much more difficult. Suppose we have less than 0.1% of the dataset being “fraud” labelled. We have three main methods at our disposal: oversampling, undersampling, and combined class methods.

oversampling refers to artificially creating observations in our data set belonging to the class that is under represented in our data. One common technique for this is SMOTE (Synthetic Minority Over-sampling Technique), which does the following:

  1. Finding the k-nearest-neighbors for minority class observations (finding similar observations)
  2. Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.

Undersampling refers to randomly selecting a handful of samples from the class that is overrepresented. imbalanced-learn’s RandomUnderSampler class works by performing k-means clustering on the majority class and removing data points from high-density centroids.

One of the issues with SMOTE is that it can generate noisy samples by interpolating new points between marginal outliers and inliers. This issue can be solved by cleaning the resulted space obtained after over-sampling. To do this, we can use SMOTE together with edited nearest-neighbours (ENN). ENN is used as the cleaning method after SMOTE has been run. This technique can be run with imblearn’s SMOTEENN class.

For the model itself, we can use a technique like a random forest or neural network. Regardless of our choice, it should be incredibly easy to achieve 100% precision for the imbalanced negative class label. Recall of course looks much worse (i.e., predicting the positive “fraud” labels).

This was expected since we’re dealing with imbalanced data, so for the model it’s easy to notice that predicting everything as negative class will reduce the error. Precision obviously would look good, but recall will be much worse than precision for the positive class (fraud). We can measure additional performance characteristics of our model such as the Area Under the Receiver-Operating Characteristic (AUROC) metric. Depending on how these metrics perform in the classification, we can improve the performance further with wither SMOTE, RandomUnderSampler, or SMOTE + ENN. Which one is better will depend heavily on the specific situation.

You run an e-commerce website. Sometimes, users want to buy an item that is no longer available. Build a recommendation system to suggest replacement items.

(click the triangle to expand and reveal the full answer) It's not enough for a recommender system to be able to recommend the most relevant item. A simple solution to this problem would be to make a recommender system that returns the top N items. In addition to this list of N items being calculated by a filtering technique, new features need to be calculated for all items with a rating above a certain threshold (e.g., semantic similarity, likability, overlap in users, variance in ratings). This new feature calculation then re-ranks the top candidates, with only the top $N$ items being recommended.

With an item-replacement technique, we need to have some control of the number of distinct items (i.e., diversity) of the list of top NN items LN(u)L_N(u) (uu representing a user, UU being the set of all users), while also accurately predicting relevant items. The two optimization metrics for our system are accuracy (uUcorrect(LN(u))uU(LN(u))\frac{\sum_{u \in U} |\text{correct}(L_N(u))| }{\sum_{u \in U}|(L_N(u))|}) and diversity (uULN(u)\bigcup_{u \in U} L_N(u)).

The basic idea behind our iterative Item Replacement recommender system is as follows. First, we use a standard-ranking approach (algorithm could be anything from neighbor-based to probabilistic matrix factorization). Once this is done, one of the already-recommended itms is replaced by an item that has not been recommended to any other user. The item that will be replaced (ioldi_\text{old}) will be an item with a known rating from the user (umaxu_\text{max}, or whomever rated the item most highly) that will be replaced by a comparatively unseen item inewi_\text{new} with a predicted rating. Not only does this increase the predicted rating for the new item, it also increases the diversity score by exactly one unit. This continues until the diversity reaches a desired threshold or until there are no more items that can be recommended for replacement.

Our general architecture is as follows: We have a database (or multiple) with all the products, along with details about all the registered users on the site. The user details include information like user ID, login info, interests, and so on. For the sake of scaling this example, we can specify that our relevance-prediciton algorithm R(u,i)R*(u, i) is a neighbor-based algorithm, since this scales better than probabilistic matrix factorization. Since we’re more concerned with Consistency and Partition tolerance for our database, we can imagine our architecture making use of a set of CasandraDB databases with load balancers.

general architecture

The interactions between the users, item-replacement expert system, and database(s) are as follows:

  1. Initializing top-NN recommendation list
  2. Replacing already recommended item ioldi_\text{old} with the never recommended item inewi_\text{new}
  3. Repeat steps 1 and 2 until maximum diversity is reached.

In a nutshell, our first recommendation of N items is generated using the Standard Ranking Algorithm (Rank standard (i)=R(u,i)1\text{Rank standard }(i) = R*(u,i)^{-1} where RR represents ratings, uu user, II item), and items are gradually replaced by infrequently-recommended items. Not only is the user recommended items similar to the one missing, but many of the items will not be advertised as perfect copies (i.e., less likely for the list to be populated with counterfeits).

For any user on Twitter, how would you suggest who they should follow? What do you do when that user is new? What are some of the limitations of data-driven recommender systems?

(click the triangle to expand and reveal the full answer)

We have the choice between a collaborative filtering recommenders system (recommends based on user history) or a product-focused (using demographic data about users) recommender system. This strategy works best when there is a lot of data on a given user. If the user is new, one possible strategy would be to suggest some of the most-followed users on the cite (such as celebrities or public figures). Not only does this give a starting point for the user to start using the site, but it also gives you a starting point for figuring out consumer habits or opinions that would shape their behavior on the site. Given that we have less of a history to work with, a product-focused recommender system (using data like location) might be more useful here.

Data Driven recommender systems have a variety of flaws. Some of the top immediately obvious ones are the following:

  • The cold-start problem: Collaborative filtering systems are based on the action of available data from similar users. If you are building a brand new recommendation system, you would have no user data to start with. You can use content-based filtering first and then move on to the collaborative filtering approach.
  • Scalability: As the number of users grow, the algorithms suffer scalability issues. If you have 10 million customers and 100,000 movies, you would have to create a sparse matrix with one trillion elements.
  • The lack of right data: Input data may not always be accurate because humans are not perfect at providing ratings. User behavior is more important than ratings. Item-based recommendations provide a better answer in this case.
(click the triangle to expand and reveal the full answer)

Much like related questions on other sites, a related searches list would be a clustering (or topic modelling) problem. For a site as large as google, we also need to consider approaches that aren’t completely compute- and memory-inefficient. As a modelling technique, we want to be able to pick one that can make clusters of similar queries AND is highly scalable.

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.

Latent semantic indexing (a type of multivariate correspondence analysis used to create “contingency tables”) from documents. It’s called ”latent semantic indexing” because of its ability to correlate semantically related terms that are latent in a collection of text. The method, also called latent semantic analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results don’t share a specific word or words with the search criteria. There are a ton of benefits to LSI, including but not limited to the following:

  • LSI is also used to perform automated document categorization. In fact, several experiments have demonstrated that there are a number of correlations between the way LSI and humans process and categorize text. Document categorization is the assignment of documents to one or more predefined categories based on their similarity to the conceptual content of the categories. LSI uses example documents to establish the conceptual basis for each category. During categorization processing, the concepts contained in the documents being categorized are compared to the concepts contained in the example items, and a category (or categories) is assigned to the documents based on the similarities between the concepts they contain and the concepts that are contained in the example documents.
  • LSI helps overcome synonymy by increasing recall, one of the most problematic constraints of Boolean keyword queries and vector space models. Synonymy is often the cause of mismatches in the vocabulary used by the authors of documents and the users of information retrieval systems. As a result, Boolean or keyword queries often return irrelevant results and miss information that is relevant.
  • Dynamic clustering based on the conceptual content of documents can also be accomplished using LSI. Clustering is a way to group documents based on their conceptual similarity to each other without using example documents to establish the conceptual basis for each cluster. This is very useful when dealing with an unknown collection of unstructured text.
  • Because it uses a strictly mathematical approach, LSI is inherently independent of language. This enables LSI to elicit the semantic content of information written in any language without requiring the use of auxiliary structures, such as dictionaries and thesauri. LSI can also perform cross-linguistic concept searching and example-based categorization. For example, queries can be made in one language, such as English, and conceptually similar results will be returned even if they are composed of an entirely different language or of multiple languages.
  • LSI is not restricted to working only with words. It can also process arbitrary character strings. Any object that can be expressed as text can be represented in an LSI vector space. For example, tests with MEDLINE abstracts have shown that LSI is able to effectively classify genes based on conceptual modeling of the biological information contained in the titles and abstracts of the MEDLINE citations.
  • LSI automatically adapts to new and changing terminology, and has been shown to be very tolerant of noise (i.e., misspelled words, typographical errors, unreadable characters, etc.). This is especially important for applications using text derived from Optical Character Recognition (OCR) and speech-to-text conversion. LSI also deals effectively with sparse, ambiguous, and contradictory data.
  • Text does not need to be in sentence form for LSI to be effective. It can work with lists, free-form notes, email, Web-based content, etc. As long as a collection of text contains multiple terms, LSI can be used to identify patterns in the relationships between the important terms and concepts contained in the text.
  • LSI has proven to be a useful solution to a number of conceptual matching problems. The technique has been shown to capture key relationship information, including causal, goal-oriented, and taxonomic information

Build a system that return images associated with a query like in Google Images.

(click the triangle to expand and reveal the full answer)

The image search system would behave very similarly to document searches in Google. However, the categories that would be used for doing the search would be different. Many images will have enough associated metadata (or even surrounding text) to be placed within a graph created by a word clustering tool like word2vec or GloVe. Many of these images will have tags added onto them through this surrounding context information. Still, for some images Google could also run label-detection on many images, and then cache them for easier search. Said images would likely have multiple categories and labels sorted by specificity. This would allow a heirarchal structure to the image-references, allowing for more specific or more general queries for images.

(click the triangle to expand and reveal the full answer)

Our goal is to provide a dashboard widget that recommends content on Twitter (the microblogging site if you somehow didn’t know already) that is becoming popular in a given window of time (however small that may be). Given the usefulness of this tool to marketting and advertising, the company also wants to be able to predict popular hashtags in the near future. This problem can be modelled as a classification task, where the goal is to classify trending and non-trending hashtags with high precision and recall.

For data, you could likely use a small subset of tweets within a certain region (such as 0.5%) over the course of 1.25 days and use that for training and test data. After all at Twitter’s scale, it would be impractical to use the entiredy of the site’s content for training data. Once this is done, and the trending and non-trending hashtags are to be identified, the trending hashtags from the first 24 hours can be used as training data to predict the trending hashtags in the first hours of the second day. To increase the predictive power of an arbitrary hashtag hh in the first day, the following features will be added to the feature vectors:

  • lifetime\text{lifetime}: The lifetime of a hashtag hh is the number of hours during which the hashtag hh appeared in at least one tweet posted during that hour.
  • ntweetn_\text{tweet}: This is the total number of tweets containing the particular hashtag hh during the lifetime of hh.
  • nusern_\text{user}: This is the total number of distinct users who posted at least one tweet containing hashtag h during the lifetime of hh.

The first three properties – lifetime\text{lifetime}, ntweetn_\text{tweet}, and nusern_\text{user} – help to estimate the global level of popularity of a hashtag.

  • velocity\text{velocity}: This feature captures the rate of change of the number of tweets containing a particular hashtag. Let the number of tweets containing hashtag hh that are posted during a particular hour kk be denoted by NkhN_k^h . Then the hh velocity of hashtag hh during the hour kk is Nk+1NkhN_{k+1} - N_k^h, i.e., the difference between the number of tweets containing hh posted during hour kk and the next hour. Note that velocity can be negative as well – for instance, if a hashtag is included in lesser number of tweets during hour k+1k + 1 than during hour kk, then its velocity during hour kk will be negative.
  • acceleration\text{acceleration}: This is analogous to the rate of change of velocity at a particular hour. This is computed in a way similar to the Velocity feature described above. Let the velocity of hashtag hh during the hour kk be denoted by VkhV_k^h (computed in the way described above). Then the acceleration of hh during hh hour kk is computed as Vk+1VkhV_{k+1} - V_k^h. Similar to velocity, acceleration can be negative as well.

The last two properties, velocity and acceleration, help to estimate how quickly a hashtag is gaining (or losing) popularity. Even if the hashtag has a short lifespan, it shouls still be considered trending if its popularity is rapidly rising during that period. Maximum values of the velocity and acceleration will also help distinguish trending from non-trending hashtags. In short, this gives us a new feature vector of length 5 ([lifetime[\text{lifetime}, ntweetn_\text{tweet}, nusern_\text{user}, velocitymax\text{velocity}_\text{max}, accelerationmax]\text{acceleration}_\text{max}]) for a given hour during the hashtag’s lifetime.

There are a variety of classifiers that can be used for this, but this seems like the kind of task for which a neural network or tree-based model (e.g., random forest or boosted tree) would be most approprate, rather than an SVM. For a service as large as Twitter, we probably need to use a tool like Apache Spark (or a proprietary DataBricks version of Spark) or Michaelangeo to scale up the learning,

Each question on Quora often gets many different answers. How do you create a model that ranks all these answers? How computationally intensive is this model?

(click the triangle to expand and reveal the full answer)

We want to rank answers on a page according to how helpful they are. The goal of ranking is to put the most comprehensive, trustworthy answers at the top of the page, so that they are easily accessible for people who have a question. We have a variety of factors we can use to judge how helpful questions are:

  • upvotes and downvotes on the answer, in part based on how trustworthy voters are in a given topic
  • the previous answers written by the author
  • whether the author is an expert in the subject
  • the contents of the answer, including the type and quality of content (words, photos, links)
  • other signals, including ones that help us prevent gaming of ranking through votes

One way of combining these would be to normalize and scale (i.e., on a log scale) certain features, combine them into one large feature vector, and reduce this vector to a scalar score through dimensionality reduction (e.g., via PCA or t-SNE or UMAP)

Once the scores are generated for each answer, the sorting system arranges them from highest to lowest score (i.e, worst-case runtime complexity of O(nlogn)O(n \log n)). Our of all the score-generating methods at our disposal, UMAP has the advantage of being less computationally intensive and more reproducible.

How to you build a system to display top 10 results when a user searches for rental listings in a certain location on Airbnb?

(click the triangle to expand and reveal the full answer)

What we need is functionally a location-based recommender system. This is more complex than just listing places that are geometrically close (like one would with the start of a Lyft or Uber recommender system), but also needs to take into account rental listing ratings. We can imagine the ranking system for a given location (be it a new zipcode the service has expanded to or even a new country) as follows:

In the early stage the main priority would be to collect as much relevant user data. The best data would be in the form of search logs from users that make bookings. The training data would then need to be labelled according to which listings were booked fully, and which were clicked but ultimately not booked. For improving predictions for this classification task, relevant features would need to be aggregated from each booking and listing (e.g., duration of stay, price, price/hour, reviews, numbers of bookings, occupancy rate, maximum number of occupants, and CTR). This should be enough to train a machine learning model such as a neural network (MLP) or a tree-based model (like a gradient-boosted tree), though many features will need to be converted to ratios rather than raw counts before being entered (e.g., fractions of bookings, relative to listing views). The performance on held-out test data can be measured with AUC and NDGC. Given the simplicity of such a setup using only standard listing vectors for all users, it’s possible to use a service like Apache Airflow to gt all the rankings for a given region in a short time period.

Once the first stage is working, the next task would be to add personallization to it. This involves switching to a user-based recommendation system. We can add new features to our feature vectors (Booked Home location, Trip dates, rip length, Number of guests, Trip price relative to market, Type of trip, First trip or returning to location, Domestic / International trip, Lead days) to improve how customizable the search rankings are. Given a user’s short search history, it should also be possible to predict the categories of stays to recommend, as well as the user’s time-of-day availability. The personalization features would increase the computational load on the machine learning pipeline. With Apache Airflow, the pipeline could experience up to one-day latency (which is acceptable for a booking site), but it would eventually need to be moved from offline to online learning.

Moving from offline to online learning increases the number of features that can be added to the input data for the ML pipeline. These features mainly refer to those of the real-time user queries (location, number of guests, and dates. For example, we can create a feature from the distance between the entered location and the listing location, and factor that into the rankings. Ultimately this brings the number of total features for our model up to nearly 100, and this data will be spread across thousands of listings and millions of users. With this scale, it would make sense to train multiple ML models: a model for logged-in users (makes use of query features for user-personalization), and a model for logged-out traffic (also uses query and location features, but not user-features. Online logging allows us not only to train both types of models, but to also train on the logged-in user data far more frquently (no more need for pre-computed personallized rankings)personalization signals were available for a particular user id, else we fall back to using logged-out model. At this scale, our testing on held-out data would more resemble what’s typically called A/B testing.

The actual infrastructure for this behemoth ML pipeline will be divided into 3 components: 1) getting model input from various places in real time, 2) model deployment to production, and 3) model scoring. The models make use of three types of signals: those for listings, users, and user-queries. Since there are millions of users, user features need to be stored in an online key-value store and search-server boxes. listing-features are only in the thousands, and can easily be stored in the search-server boxes. These two features will be updated daily in the Airflow pipeline. query-features do not need to be stored at all, as they can just be acted on as they come in. The model files themselves can be converted to an H5 file format, that can be deployed to a spark or Java architecture. This will reduce the load-time for deploying the model. Which model to be used will be quickly determined by whether the user is logged in or not. This will determine which features to bring together for a final ranking of the listings in a given area.

For the monitoring of this system (mandatory for being able to give hosts feedback as well as helping the engineering team), one can use tools like Apache Superset and Airflow to create two dashboards:

  1. Dashboard that tracks rankings of specific listings in their market over time, as well as values of features used by the ML model.
  2. Dashboard that shows overall ranking trends for different regions or groups of listings (e.g. how 5-star listings rank in their market).

Autocompletion: how would you build an algorithm to finish your sentence when you text?

(click the triangle to expand and reveal the full answer)

Autocomplete is commonly found on search engines and messaging apps as a tool that speeds up interactions by suggesting what you’ll search. It must be fast and update the list of suggestions immediately after the user types the next letter. Autocomplete can only suggest words it knows about, whether these be English dictionary words or distinct words on Wikipedia (or words it’s learned). This search space grows larger if it incorporates multi-word phrases or entity names. The search space is reduced by the fact that the Autocomplete takes in a prefix and finds kk words in the vocabulary that have it.

One of the more useful data structures we can use for this problem is a Trie data retrieval data structure. It reduces search complexities and improves optimality and speed. Trie can search the key in O(M)O(M) time as long as it has proper data storage, which can be in-memory cache (Redis or Memcached), a database, or even a file. Assume SS is a set of kk strings. In other words, S=s1,s2,...,skS = {s_1, s_2, ..., s_k}. We model set SS as a rooted tree TT in such a way that each path from the root of TT to any of its nodes corresponds to a prefix of at least one string of SS. Consider an example set where S={het,hel,hi,car,cat}S = \{ het, hel, hi, car, cat \} and ϵ\epsilon corresponds to an empty string. Then a trie tree for SS looks like this:

Trie tree for S

Each edge of an internal node to its child is marked by a letter denoting the extension of the represented string. Each node corresponding to a string from SS is marked with color. If hehe is a prefix of another string from SS, then a node corresponding to hehe is an internal node of TT, else it’s a leaf. Sometimes, it is useful to have all nodes corresponding to strings from SS as leaves, and it is very common to append to each string from SS a character that is guaranteed not to be in Σ\Sigma. In our case, denote $ as this special character. Duch a modified trie looks like this:

Our modified Trie

Since now there is no string in SS, which is a prefix and another string from SS, all nodes in TT corresponding to strings from SS are leaves. In order to create trie TT, we just start with one node as a root representing the empty string ϵ\epsilon. If you want to insert a string ee to TT, start at the root of TT and consider the first character hh of ee. If there is an edge marked with hh from the current node to any of its children, you consume the character hh and get down to this child. If at some point there is no such an edge and child, you have to create them, consume hh, and continue the process until the whole ee is done.

Now, we can simply search a string ee in TT. We just have to iterate through the characters of ee and follow corresponding edges to these characters in TT starting from the root. If at some point there is no transition to children or if you consume all the letters of e, but the node in which you end the process does not correspond to any string from SS, then ee does not belong to SS either. Otherwise, you end the process in a node corresponding to ee, so ee belongs to SS.

Another version of the Trie approach is the Suffix-Tree Algorithm. A suffix-tree is a compressed trie of all the suffixes of a given string. Suffix trees are useful in solving a lot of string-related problems like pattern matching, finding distinct substrings in a given string, finding the longest palindrome, etc.. A suffix tree TT is an improvement over trie data structure used in pattern matching problems. The one defined over a set of substrings of a string ss. This is a trie that can have a long path without branches. The better approach is reducing these long paths into one path, thereby greatly reducing the size of the trie significantly. Consider the suffix tree TT for a string s=abakans = abakan. A word abakanabakan has 6 suffixes{abakan,bakan,akan,kan,an,n}\{ abakan, bakan, akan, kan, an, n \} and its suffix tree looks like this:

Esko Ukkonen demonstrated in 1995 that it’s possible to make a suffix tree for s in linear time

Suffix trees can solve many complicated problems because it contains so many data about the string itself. For example, in order to know how many times a pattern PP occurs in ss, it is sufficient to find PP in TT and return the size of a subtree corresponding to its node. Another well-known application is finding the number of distinct substrings of ss, and it can be solved easily with a suffix tree. The completion of a prefix is found by first following the path defined by the letters of the prefix. This will end up in some inner node. For example, in the pictured prefix tree, the prefix corresponds to the path of taking the left edge from the root and the sole edge from the child node. The completions can then be generated by continuing the traversal to all leaf nodes that can be reached from the inner node. Searching in a prefix tree is extremely fast. The number of needed comparison steps to find a prefix is the same as the number of letters in the prefix. Particularly, this means that the search time is independent of the vocabulary size. Therefore, prefix trees are suitable even for large vocabularies. Prefix trees provide substantial speed improvements to over-ordered lists. The improvement is realized because each comparison is able to prune a much larger fraction of the search space.

Can we do better than prefix trees? Prefix trees handle common prefixes efficiently, but other shared word parts are still stored separately in each branch. For example, suffixes, such as -ing\text{-ing} and -ion\text{-ion}, are common in the English language. Fortunately, there is an approach to save shared word parts more efficiently. A prefix tree is an instance of a class of more general data structures called acyclic deterministic finite automata (DFA). There are algorithms for transforming a DFA into an equivalent DFA with fewer nodes. Minimizing a prefix tree DFA reduces the size of the data structure. A minimal DFA fits in the memory even when the vocabulary is large. Avoiding expensive disk accesses is key to lightning-fast autocomplete.

The Myhill-Nerode theorem gives us a theoretical representation of the minimal DFA in terms of string equivalence classes. Saying that two states are indistinguishable means that they both run to final states or both to non-final states for all strings. Obviously, we do not test all the strings. The idea is to compute the indistinguishability equivalence classes incrementally. We say pp and qq are kk-distinguishable if they are distinguishable by a string of lengthk\text{length} ≤ k

It’s easy to understand the inductive property of this relation: pp and qq are kk-distinguishable if they are (k1)(k-1)-distinguishable, or δ(p,σ)\delta(p, \sigma) and δ(q,σ)\delta(q, \sigma) are (k1k-1)-distinguishable for some symbol σΣ\sigma \in \Sigma

The construction of the equivalence classes starts like this: pp and qq are 00-indistinguishable if they are both final or both non-final. So we start the algorithm with the states divided into two partitions: final and non-final.

Within each of these two partitions, pp and qq are 1-distinguishable if there is a symbol σ\sigma so that δ(p,σ)\delta(p, \sigma) and δ(q,σ)\delta(q, \sigma) are 0-distinguishable. For example, one is final and the other is not. By doing so, we further partition each group into sets of 1-indistinguishable states.

The idea then is to keep splitting these partitioning sets as follows: pp and qq within a partition set are k-distinguishable if there is a symbol σ\sigma so that δ(p,σ)\delta(p, \sigma) and δ(q,σ)\delta(q, \sigma) are (k1)(k-1)-distinguishable.

At some point, we cannot subdivide the partitions further. At that point, terminate the algorithm because no further step can produce any new subdivision. When we terminate, we have the indistinguishability equivalence classes, which form the states of the minimal DFA. The transition from one equivalence class to another is obtained by choosing an arbitrary state in the source class, applying the transition, and then taking the entire state:

State diagram for minimal DFA

Start by distinguishing final and non-final: {q1,q3}\{q_1,q_3\}, {q2,q4,q5,q6}\{q_2,q_4,q_5,q_6\} Distinguish states within the groups, to get the next partition: {q1,q3}\{q_1,q_3\}, {q4,q6}\{q_4,q_6\}, {q5}\{q_5\}, {q2}\{q_2\}

  • bb distinguishes q2,q4:δ(q2,b){q1,q3}q_2, q_4: \delta(q_2,b) \in \{q_1,q_3\}, δ(q4,b){q2,q4,q5,q6}\delta(q_4,b) \in \{q_2,q_4,q_5,q_6\}
  • bb distinguishes q2,q5:δ(q2,b){q1,q3}q_2, q_5: \delta(q_2,b) \in \{q_1,q_3\}, δ(q5,b){q2,q4,q5,q6}\delta(q_5,b) \in \{q_2,q_4,q_5,q_6\}
  • aa distinguishes q4,q5:δ(q4,a){q1,q3}q_4, q_5: \delta(q_4,a) \in \{q_1,q_3\}, δ(q5,a){q2,q4,q5,q6}\delta(q_5,a) \in \{q_2,q_4,q_5,q_6\}
  • neither aa nor bb distinguishes (q4,q6)(q_4,q_6)

We cannot split the two non-singleton groups further; the algorithm terminates. The minimal DFA has start state {q1,q3}\{q_1,q_3\}, single final state {q4,q6}\{q_4,q_6\} with transition function:


new minimal DFA

For designing the actual system, we need to consider components beyond just the basic trie structure. We need to design a system that is fast and scalable. At a high level, our autocomplete gets the request of a prefix and then sends it to the API. In front of the API server, we need to have a load balancer. The load balancer distributes the prefix to a node. The node is a microservice that is responsible for checking cache if related data of the prefix is there or not. If yes, then return back to the API, else check zookeeper in order to find a right suffix-tree server. Zookeeper defines the availability of the suffix-tree server. For example, in zookeeper, we define a $ s-1 . It means for aa to $ that indicates the end of suffix, server s1 is responsible.Background processors will be needed to take a bunch of strings and aggregate them in databases in order to apply on suffix-tree servers. In these, we get streams of phrases and weights (these streams can be obtained from a glossary, dictionary, etc.). The weights are provided based on data mining techniques that can be different in each system. We need to hash our weight and phrase and then send them to aggregators that aggregate the database on their similar terms, created time, and the sum of their weights to databases. The advantage of this approach is that we can recommend data based on relevance and weights.

Initial system setup

The current system is also poorly equiped to handle more than a thousand simultaneous requests. We need to improve it with vertical scaling but all while reduce the number of single failure points (i.e., we need horizontal scaling too). A round-robin algorithm is recommended for equally distributing traffic between systems. We also need to add more cache servers while guaranteeing that the data on each is equal. We can simply use a hashing algorithm to decide which data should be inserted in which server (though this system could break if one server fails). WIth this design requirement, we need to use consistent hashing to allow servers and objects to scale without affecting the overall system. Also, our zookeeper needs some changes, and as we add more servers of suffix-tree, we need to change the definition of zookeeper to a-k s1 l-m s2 m-z s3 $. This will help node servers to fetch the right data of suffix-tree.

The full autocomplete system

These data structures can be extended in many ways to improve performance, as there are often many more possible completions than can be displayed by the user interface. It’s also important that the existence of search engines like Elasticsearch & [Solr](( obviate building your own autocomplete system from scratch each time.

When you type a question on StackOverflow, you’re shown a list of similar questions to make sure that your question hasn’t been asked before. How do you build such a system?

(click the triangle to expand and reveal the full answer)

Clustering (or unsupervised learning in general), would be a useful tool for building this system. Each question becomes a part of a cluster, and other questions in the same cluster (probably in the order of similarity measure\text{similarity measure} distance) are displayed as similar questions. There are many features we can use for the clustering, including but not limited to question tags, words in the heading, words in the text (with less weighting than the heading), and/or hyperlinks to other questions or pages. Other features could be generated from the data using techniques like text summarization or sentiment analysis.

From this clustering, we could select the NN top closest questions by feature-vector-closeness and display those.

How would you design an algorithm to match pool riders for Lyft or Uber?

(click the triangle to expand and reveal the full answer)

In Uber pool or Lyft Line, we want to match riders on similar routes or similar destinations. The challenge is to create a matching system that would result in we accurately predicting whether two, three, or four passengers together would enjoy the selected route.

An initial naive approach could be to try matching riders to other nearby riders, and see who matches in under 1 minute before the ride arrives. There could also be a bucketed matching system that asigns riders to smaller clusters based on which 10-minute period of the day they called the ride in. Unfortunately, this would cause a supply shock, as it would require drivers being available for every single 10-minute interval in a 24-hour day (and it’s not guaranteed that there would be any requests in many of these intervals).

For a matching system, we could start with a a simple greedy haversine matchmaking system. Haversine distances are straight-line distances between two points and are multiplied by the region’s average speed to get a time estimate. A haversine implementation is quick to implement and with tight enough constraints (e.g. detours, time until pickup, etc.) would make sure users had a good match. Such a system would compare each rider with every other rider in the system (O(n2)O(n^2)), and whichever riders were the immediate closest that satisfied certain constraints (e.g., rating, maximum distance) would be paired. This in turn would be combined with an interval-matching problem making use of the departing and arriving distances/times. The problem is that this could result in drivers taking a lot of detours (since this is evaluating passengers based on straight-line distances). This distance judgement could be improved with geohashing (location-based bucketing), and adding a modified A* algorithm to more accurately measure distances to the passengers’ desired routes. More data collection from ride histories would eventually allow for optimized geohash sizes, based on relative demand in each one.

Geohashing example

This is still a very simplified system that would need to be optimized further to handle triple-matching (this makes the interval-matching problem a much harder combinatorial explosion), longitudinal sorting (going east-west vs. north-sourth may be more difficult in some regions than others, such as in the SF bay area), and rerouting (would require many calculations to be redone).

On social networks like Facebook, users can choose to list their high schools. Can you estimate what percentage of high schools listed on Facebook are real? How do we find out, and deploy at scale, a way of finding invalid schools?

(click the triangle to expand and reveal the full answer)

We have three main approaches we can take: finding out from user data, finding out from available public data, and finding out from user votes or reports.

If we have a lot of user data already, we can find similar users based on places born, grow up, street names, tags, also their friends. If in a circle no one or very few people have listed that high school name, it is likely a fake high school. If we need to use an algorithm, we can use K-Nearest-Neighbors, find all the similar users with the person with potentially fake high school, then list out and rank of similar users’ high school.

If we have easy access to public data (in a country like the United States, for example), we can get a list of high school names (from government or trust worthy organizations) and match high school names.

If we already have an abuse-reporting or downvoting system in place, we can root out the fake schools from people who report scams or downvote fake high school names.

How would you build a trigger word detection algorithm to spot the word “activate” in a 10 second long audio clip?

(click the triangle to expand and reveal the full answer)

We want to set up a model that can do a batch-evaluation of whether a 10-second audio clip contains the trigger word (in our case, the word “activate”). Ideally the training data would contain both instances of the trigger word and other words (along with background noise and silence), to resemble the production environment. While more labelled training data helps, it should be possible to artificially generate training data by combining recordings of different background audios, 1-second recordings of the trigger word in different tones or voices (all given positive labels), and 1-second recordings of non-trigger words. The negatively and positively labeled words can then be overlaid onto the background audio clips (with about 0-5 trigger word examples and 0-3 negative word examples per clip). Overlaying gives us more realistic data since we want our model to identify when someone has finished saying the trigger word, not just whether it exists in the sound clip.

We first initialize all timesteps of the output labels to 0s. Then for each “activate” we overlayed, we also update the target labels by assigning the subsequent 50 timesteps to 1s. The reason we have 50 timesteps 1s is because if we only set 1 timestep after the “activate” to 1, there will be too many 0s in the target labels. It creates a very imbalanced training set. It is a little bit of a hack to have 50 1 but could make them a little bit easy to train the model. Here is an illustration to show you the idea.

target label diagram

For a clip which we have inserted “activate”, “innocent”, activate”, “baby.” Note that the positive labels 1 are associated only with the positive words. In the spectrogram that represents our input data (the frequency representation of the audio wave over time), he x-axis is the time and y-axis is frequencies. Bright color indicates how loud a certain frequency is at that time. bright the color is the more certain frequency is active (loud). For reading this input data, we have the following model architecture:

trigger word model

The 1D convolutional step inputs 5511 timesteps of the spectrogram (10 seconds), outputs a 1375 step output. It extracts low-level audio features similar to how 2D convolutions extract image features. This also speeds up the model by reducing the number of timesteps. The two GRU layers read the sequence of inputs from left to right, then ultimately use a dense plus a sigmoid layer to make a prediction. The final sigmoid layer outputs the range of each label between 0 and 1, with 1 corresponding to the user having just said the trigger word.

If you were to build a Netflix clone, how would you build a system that predicts when a user stops watching a TV show, whether they are tired of that show or they’re just taking a break?

(click the triangle to expand and reveal the full answer)

This is a modified customer chrun modelling problem.

Churn is when a user stops using the platform - in the case of Robinhood this could be defined as when a user’s account value falls below a minimum threshold for some number of days (say 28). We can build a classifier to predict user churn by incorporating features such as time horizon of the user (i.e. long term investor or short term investor), how much they initially deposited, time spent on the app, etc.

To assess feature importance, we can either run a model like logistic regression and look for weights with a large magnitude, or we can run a decision tree or random forest and look at the feature importance there (which is calculated as the decrease in node impurity weighed by the probability of reaching that node).

Facebook would like to develop a way to estimate the month and day of people’s birthdays, regardless of whether people give us that information directly. What methods would you propose, and data would you use, to help with that task?

(click the triangle to expand and reveal the full answer)

In some cases, this might be as simple of a task as looking at a person’s timeline and seeing when people start sending tons of congratulatory messages. This will be especially telling if these are people that they person does not normally interact with (or family members) that are making the posts. Pictures of birthday celebrations are also a helpful indicator.These are all useful if the person actively chooses not to put their birthday details on Facebook.

It is unlikely that trying to cluster people by interests would yield much correlation with birth month or birth day. An exception to this might be people who indicate preferences correlating with a strong belief in the efficacy of astrology. Such individuals may have self-reinforcing preferences based around their self image of how they should act based on their zodiac sign (in whatever zodiac system they choose to believe in). If the users are initially filtered based on their astrology beliefs, it should be possible to make reliable month estimates for at least a subset of users based on interest groups.

If it’s just a matter of the person forgetting to put the information there, this could be acquired through design testing, a birthday wishes function (designing a box of birthday wishes and friends can anonymously send gifts), questionaires, or sending a friendly reminder. If all else fails some of this information could be imported from another site.

Build a system to predict the language a text is written in.

(click the triangle to expand and reveal the full answer)

This would most likely involve building a classifier system that takes in text, and then outputs a class label corresponding to several of the most likely languages to be used in an app (e.g., 0 for English, 1 for French, 2 for Spanish, etc.). As a supervised learning problem, there is a relatively straightforward process to creating training data, with just samples of text along with a few hundred thousand (at least) examples in each language category.

One way to do this would be to take metadata such as relative character frequencies and develop a simple multi-layer perceptron that can do the classification. Howerver, we may want to try an approach that would work better with shorter lengths of text. This would require using a character-level tokenizer, which we could then feed into a network such as a tree-based model (like random forest or XGBoost) or neural network model (Stacked LSTM/GRU, Bi-directional LSTM, transformer-based model like BERT).

If building a system from scratch isn’t an option due to time constraints, the library langdetect (ported from Google’s language-detection) is available on pypi. It works on 55 different languages.

Predict the house price for a property listed on Zillow. Use that system to predict whether we invest on buying more properties in a certain city.

(click the triangle to expand and reveal the full answer)

For training, we can create a training dataset of properties with known prices, along with a feature vector containing information on known home. We can look at a distributon of house price vs livable surface area, and use this to eliminate any outliers (e.g., condemned buildings) from the data. We then want to create a probability plot for the sale price. Since this is likely going to be right-skwed (e.g., prices between more expensive houses being more different than prices between less expensive houses), we need to do a log-transform on the target variable. We also want to tailor our data based on how many features are missing from large chunks of the listed homes (e.g., unless otherwise specified, many listings are probably not listing whether or not they have a pool if they don’t have one). For this missing data, we will just set the label to ‘None’ (e.g., ‘no alley access’, ‘no fence’, ‘no misc feature’. A lot of features that are numerical (e.g., such as year of construction) will actually be categorical instead of continuous, so they will need to be formatted as such. Highly skewed features will need to undergo a box-cox transformation before any model training is done.

Speaking of modelling, we would be wise to choose from a variety of model types (such as elasticNet regression, XGboost regression, and LightGBM). These models would be trained on the data and evaluated based on root mean squared error (RMSE). With our base model scores we can ensemble them by averaging the scores, though we can do better by leveraging encapsulation. We can then build a powerful stacked regressor as our final trained model.

For prediting whether to invest in a certain city, this combined the model outputs with future forecasting. Since not everything is known about the future, some of the details of the future could be modeled with random variables. These random variables could then be fit into the model across hundreds of samples representing a distribution of outcomes for a certain city. Some features, such as the floor size, are unlikely to change, and will be set as constants for given houses. Other details, such as demand and city legal changes, will need to be factored in as samples from bernoilli or categorical distributions. A hamiltonian monte carlo method could then be used to feed these distributions into .

Given the difficulty of scaling up a hamiltonian monte carlo with compute equipment, such batch processing would need to be repeated separately on different devices for different cities. Such compute management could be managed with a service like Horovod.

Predicting the house prices accurately is not enough. We also want to use this tool to decide whether or not to invest in properties in a certain city. Once the algorithm runs correctly, we can track differences in projected home prices vs actual real-world home prices, and use a positive delta beyond a certain theshold as a ‘buy’ signal.

Imagine you were working on iPhone. Everytime users open their phones, you want to suggest one app they are most likely to open first with 90% accuracy. How would you do that?

(click the triangle to expand and reveal the full answer)

We want to be able to predict, given a period of app usage, what the first app label within that usage log will be. We also want to do this with 90% Accuracy. Ultimately we want to build a recommender system with this information.

Your would likely have lists of apps that are used over the course of the day. The phone will probably log some kinf of information related to when an app was opened, closed, running in the background, or whether the phone was closed or turned off. This logging has a time-series element to it, but it could also be featurized into other descriptors such as morning vs afternoon, time from wakeup hour or meal, or day of the week. We could also add features such as previously-used-5-apps. All of this information wold then be combined into a classification network (with the ranks of the logits being used to identify the top NN most likely next apps if needed, though a single output will meet our functional requirements in this case).

If we have no other information about the user (e.g., the user just got their phone), then we could draw upon user-specific data for our recommender system. If the user registered their phone, there is at least name, address, and payment information. We can use these information to find out what the most probable popular apps will be (by usage) among certain demographics (provided those apps are installed on their phone).

For the sake of memory and CPU-time conservation, such a recommendation system could likely be constructed from a random forest model. In fact, some parts of the resulting decision tree could even be hard-coded to take into account what to do given a lack of information. While tools like XGBoost are great for constructing such trees in normal circumstances, Apple’s own CoreML framework can be leveraged for this kind of decision tree work. If we wanted to leverage greater compute power, we could use federated learning (only recently been implemented in CoreML). This would involve updating a model partly on a phone’s on-device data, sending part of the model (either trees in a random forest or weights from a neural network) to a central oracle when the phone is not in use (such as at night). This would allow a centralized model to learn from all the app usage data while preserving user privacy (an important selling point for the iPhone).

How do you map nicknames (Pete, Andy, Nick, Rob, etc) to real names?

(click the triangle to expand and reveal the full answer)

At a high level, we want to extract data from various sources and create a statistical approach. We need to think about how this will be used, and design an experiment to test it. Assuming the goal is to map nicknames to real names for social media users, our possible approaches could resemble one or more of the following:

  • Use an open-source corpus to map names (many NLP libraries may have something like this).
  • Set up quick questionnaires for people seemingly using nicknames (e.g. is this your real name Andy?) or users connections.
  • Extract data from comments within a profile.
  • Detect how a users’ friends friends or connections them in comments, posts, and birthday greetings.
  • From comments in his network, detect from comments how their close connections or friends address them.
  • Extract data from conversations (assuming this method will be harder, since people who interact with the users are more likely to be friends and family members that will call the nicknames).
  • Extract data from external sources (other social media network or websites). If such information is provided during sign-up, this can be found using first names, last names, and profile pics to search for external sources for possible real names.

Building on the methods above, we would want to do A/B testing and see people’s reactions to our attempts at mapping nicknames to real names.

An e-commerce company is trying to minimize the time it takes customers to purchase their selected items. As a machine learning engineer, what can you do to help them?

(click the triangle to expand and reveal the full answer)

It doesn’t matter if you own an e-commerce or a supermarket. It doesn’t matter if it is a small shop or a huge company such as Amazon or Netflix, it’s better to know your customers. In this case, we want to be able to know our customers’ behavior during the checkout process (we’re referring to selected items and not searching for items), and figure out the bottlenecks.

First, we want to make sure we’re measuring the time it takes for orders to be processed. This will be the time from the clicking of the checkout button, to the completion of the payment and shipping details. An e-comerce system will probably already have a hashing system for tagging customer receipts. A similar mechanism can be used to create customer-specific hashes corresponding to logged events such as moving from the payment page to the shipping page. It’s entirely possible that some customers may not complete the process, so the information should be stored in a local cache until the checkout is complete.

Given all this information, we can set up an A/B testing strategy. We can create variants of the layouts to be displayed to certain subsets of the population, and use this information to figure out which layouts make navigation and payment easier. Once the time-stamps are collected for the interval lengths for each page (as well as the total time it takes to go through checkout), this can be reduced to a time-series problem. Given the limited size of the website features that would be incorporated into an A/B test, a relatively simple model like random forest regression should be able to identify the factors on the website that are contributing to longer checkout times.

Build a chatbot to help people book hotels.

(click the triangle to expand and reveal the full answer)

The word “chatbot” is so ubiquitous that we should quickly stop to define what we’re building. We’re referring to a machine learning system that is used to generate natural language to simulate conversations with end users. This output may come through websites, messages on mobile apps, or through voice over phone calls. The system takes in input from a user, processes the input (such as by extracting questions from the input), and delivers the most relevant response to the user.

Any hotel-booking chatbot will need to be trained with actual chat data. Companies which already make use of human-powered chat have logs of the previous conversations (these can be structured as pairs of text with the users’ words being labelled as the context, and the bots’ response being the correct label). At least several thousand logs will be needed to properly train the network.

These logs are used to analyze what people are asking and what they intend to convey. With a combination of machine learning models (including models based off of transformer models like BERT) and an ensemble of other tools (the training may involve a multi-input network that takes in both tokenized text and conversation metadata), the user’s questions are matched with their intents. Models are trained in such a way that the bot is able to connect similar questions to a correct intent and produce the correct answer.

At the end of each conversation with a chatbot, the question, ‘Did we help you?’ is asked and if we answer ‘No,’ the conversation is forwarded to human support. The same way when a chatbot is unable to understand a question provided by a human, it will forward the conversation by itself to human customer support. The chatbot will need to be able to continuously learn throughout its lifetime (and not drift into inappropriate behavior)

Once the bot is set-up and starts interacting with customers, smart feedback loops are implemented. When customers ask questions, chatbots give them certain options in order to understand their questions and needs more clearly. This information is used to retrain the model thereby ensuring the accuracy of the chatbot. There are also guarding systems which have been implemented to make sure that the bot doesn’t change based on a few replies. This is also ensured by making sure that the bot simply just doesn’t rephrase what people tell it but is actually taught to answer questions the way its owner (the company) wants it to answer. The engineers at the developing company can decide the extent to which it wants to expand or shrink its understanding.

Example high-level framework for our chatbot

How would you design a question answering system that can extract an answer from a large collection of documents given a user query?

(click the triangle to expand and reveal the full answer)

The basic idea here will be to treat a document set covering many different topics as a kind of knowledge base to which we can submit questions and receive exact answers. The fact that the documents may have varying levels of authenticity or objectivity is irrelevant.

First, we would want to build a preprocessing system for the documents. If these are physical documents we would set up an OCR system. If this is just text, we skip the OCR and move onto removing headers, footers, and quotes from the documents. We can do this easily with a package like sklearn.

With a dataset in hand, we will first need to create a search index. The search index will allow us to quickly and easily retrieve documents that contain words present in the question. Such documents are likely to contain the answer and can be analyzed further to extract candidate answers. We will first initialize the search index and then add documents from the Python list to the index. For document sets that are too large to be loaded into a Python list, we can set our indexer to crawl a folder and index all plain text documents found.

Next, we will create a QA instance, which is largely a wrapper around a pretrained BertForQuestionAnswering model from the excellent transformers library.. We can train this using utilities from the ktrain python library

And that’s all that’s needed for the basic functionality. In roughly 3 lines of code, one can construct an end-to-end Question-Answering system that is ready to receive questions.

Even though the text may be structured, the’s no set limit to how complex the questions can get. Facebook Research open-sourced their code for decomposing large, harder questions into smaller, easier to answer sub-questions. This is an unsupervised technique that starts by learning to generate useful decompositions from questions with unsupervised sequence-to-sequence learning.

How would you train a model to predict whether the word “jaguar” in a sentence refers to the animal or the car?

(click the triangle to expand and reveal the full answer)

This is a classic Entity Disambiguation problem. A system could conceivably take the following high-level steps (borrowed heavily from OpenAI’s research into this problem):

  1. Extract every Wikipedia-internal link to determine, for each word, the set of conceivable entities it can refer to. For example, when encountering the link [jaguar]( in a Wikipedia page, we conclude that is one of the meanings of jaguar.
  2. Walk the Wikipedia category tree (using the Wikidata knowledge graph) to determine, for each entity, the set of categories it belongs to. For example, at the bottom of’s Wikipedia page, are the following categories (which themselves have their own categories, such as Automobiles):jaguar

  1. Pick a list of ~100 categories to be your “type” system, and optimize over this choice of categories so that they compactly express any entity. We know the mapping of entities to categories, so given a type system, we can represent each entity as a ~100-dimensional binary vector indicating membership in each category.
  2. Using every Wikipedia-internal link and its surrounding context, produce training data mapping a word plus context to the ~100-dimensional binary representation of the corresponding entity, and train a neural network to predict this mapping. This chains together the previous steps: Wikipedia links map a word to an entity, we know the categories for each entity from step 2, and step 3 picked the categories in our type system.
  3. At test time, given a word and surrounding context, our neural network’s output can be interpreted as the probability that the word belongs to each category. If we knew the exact set of category memberships, we would narrow down to one entity (assuming well-chosen categories). But instead, we must play a probabilistic 20 questions: use Bayes’ theorem to calculate the chance of the word disambiguating to each of its possible entities.

OpenAI’s Paper “DeepType: Multilingual Entity Linking by Neural Type System Evolution” goes into more detail on how this can be done.

Suppose you’re building a software to manage the stock portfolio of your clients. You manage X amount of money. Imagine that you’ve converted all that amount into stocks, and find a stock that you definitely must buy. How do you decide which of your currently owned stocks to drop so that you can buy this new stock?

(click the triangle to expand and reveal the full answer)

It’s not as simple as just picking the lowest-trading stock. If we were deciding whether or not to sell a stock without being forced, there are a variety of different options we could choose. For example, we could sell shares of a stock when its 50-day moving average goes below the 200-day moving average.

However in this case, we’re trying to decide which of our current portfolio to trade for another stock. We could choose modified versions of several algorithmic trading strategies.

  • Trend following strategy is one of the more common algorithmic stock-trading strategies. We could simply select which of the current stocks of ours has the downward-projected trend with the greatest magnitude (if a different stock is a must-buy, it’s a safe assumption this means it will be a better performer than our current worst-performing stock).
  • Mean reversion strategy is based on the concept that the high and low prices of an asset are a temporary phenomenon that revert to their mean value (average value) periodically. Identifying and defining a price range and implementing an algorithm based on it allows trades to be placed automatically when the price of an asset breaks in and out of its defined range. In this case, we could determine which of the stocks in our portfolio have had the smallest ranges around the 50-day moving averages (compared to their all-time largest ranges).

How would you create a model to recognize whether an image is a triangle, a circle, or a square?

(click the triangle to expand and reveal the full answer)

There are a few approaches to this problem. Many people’s first instinct would be to reply “classify with a Convolutional Neural Network”. This process would most likely involve gathering a bunch of training data for the various shapes, and training a multi-class classifier on the data. Some may change this approach by recognizing that that there might be multiple shapes in one image, and their instinct would be to switch to a semgentation architecture like PSPNet or YOLO.

It should be worth noting that for a shape as simple as a Triangle, Circle, or square, a technique like a Haar Cascade could be used. Such a computer vision technique would be much easier and faster to train (and depending on the implementation, could even be built from scratch). Like the segmentation network, this would also have the benefit of identifying multiple shapes in one image.

Another approach is to calculate the best-fit bounding rectangle of each object, such that the bounding rectangle can be oriented at an arbitrary angle. We can take the following approach:

  1. Identify the boundary (i.e. perimeter) of each object.
  2. Calculate the smallest rectangular bounding box that contains this perimeter.
  3. Calculate the width, height, and area of each bounding rectangle. The aspect ratio (ratio between the width and height) can be used to determine whether it is a square (aspect ratio 1.0\sim 1.0) or rectangle.
  4. For a rectangle or square, the filled area of the object (from regionprops()) should be almost the same as the area of its bounding rectangle, whereas for a triangle it should be substantially less.
  5. For the circularity condition, one can use the ratio between perimeter and area.

It should be noted that OpenCV has many shape-detection tools. All one would need to do is convert the image to greyscale, smooth the image with a gaussian blur, threshold it, and use the built-in shape detection utility to fin the contours.

Given only CIFAR-10 dataset, how to build a model to recognize if an image is in the 10 classes of CIFAR-10 or not?

(click the triangle to expand and reveal the full answer)

See the original Bayesian Neural Network post for an in-depth code walkthrough

Typical image-recognition neural networks are great at assigning correct classifications on datasets guaranteed to contain data of one of the classification categories. However at the moment what we need is a network that, when given data far outside the distribution of data it was trained on, refuse to make a classification decision. The network in question would be a Bayesian Convolutional Neural Network. Much of the network architecture would be superficially similar (2D Convolution and Dense layers replaced with 2D Convolutional flipout and dense flipout layers). The main differences are that the weights and biases in these layers would be prior distirbutions instead of set initialized values. As the network is trained via variational Bayes, the priors update. To get a classification evaluation, all the distributions in the network are sampled (let’s say 100 times for our purposes) based on the input data. Each class will get a set classification probability, but they will not necessarily add up to a certain amount. All the probabilities can be incredibly small. We can set a probability boundary (like 0.2) where if none of the class probabilities are above this, the network will refuse to make a decision (i.e., it’s not part of a CIFAR-10 class in this case) rather than producing a label from the argmax\arg \max of the logits.

Let’s say that you are the first person working on the Facebook News Feed. What metrics would you track and how would you improve those metrics?

(click the triangle to expand and reveal the full answer)

Being the first person on the project, you would have two tasks: figuring out what metrics to track and how to improve them, and which items to promote in the beginning with little to no data.

The goal of this optimization is increasing user engagement (i.e., we want users to stay engaged more), which can be defined as time spent on the newsfeed and/or actions on the feed (e.g., likes, posting activity, shares, reactions, commenting). Some of the features that would be relevant to keeping the newsfeed up-to-date and personalized include but are not limited to the following: reaction types, placement in various ranking models that can more accurately predict how likely a user is to engage with posts, etc. All of these can be either A/B tested or Blue/Green tested for their effects on the engagement metrics (e.g., number of interactions or length of engagement). It is of course important to not optimize for only these metrics, as it could incentivize promoting clickbait content.

For someone new to the news feed, some of the initial choices of posts could be pulled from posts that friends in the users’ network (with the posts) have been engaging in.

References and Resources

In the ML community, it’s important to give credit where credit is due, and not just slap our name on someone else’s hard work cough.. Here is a list of the references and resources I used for various equations, concepts, explanations, examples, and inspiration for visualizations:

Cited as:

  title   = "Deploying and Scaling ML",
  author  = "McAteer, Matthew",
  journal = "",
  year    = "2019",
  url     = ""

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.