Private ML Marketplaces

Fixing tradeoffs between various private ML strategies

Introduction

One of the most exciting areas in machine learning right now is Private Machine learning. At its core, private machine learning is concerned with balancing two competing problems:

  1. ML model owners and developers want further improvements with additional trainined data. This is what machine learning engineers do for a living. This is the main strategy for AI-as-a-service (AIaaS) companies. They develop a model based on some sort of data, and then use the proprietary model in some sort of product.
  2. Data owners want the data to be used fairly. If they do not directly benefit from the model, or if they would face some sort of risk in revealing the data for free, they have no incentive to cooprate with the ML model owners. In other words, the data providers want to be compensated fairly.

This post summarizes and discusses some of the various approaches previously proposed in this space (e.g., smart contracts, data encryption, transformation, and approimation, and federated learning). In addition, thi post proposes a way to address this balance using a model-data efficacy approach based on model approximation (and give an example using Model Extraction). This approach has 3 main objectives:

1. Transaction additional data between model and data owner.
2. Fairly pricing the transaction
3. Preserving the model and data details

Problem Setup

To define a common nomenclature throughout this post, let’s define the problem in more detail beyond our two points above:

  • TT is a trained model with parameter Θ\Theta owned by the model\text{model}-owner.
  • Details regarding TT and Θ\Theta are valuable.
  • The data\text{data}-owner(s) own additional training data, DD, that may or may not improve TT.
  • The data\text{data}-owner(s) wants to protect data details, lest they be shared.
  • ΔΘT(D)\Delta \Theta_T(D), the resulting update, is a proxy for the benefits TT gets from additional training data DD.

These are the main properties shared among all the approches explored, and each approach’s pros and cons can be framed in terms of this problem setup.

Protecting models with Homomorphic Encryption

The AIaaS approach is dependent on some sort of model staying proprietary. For example, many drug discovery companies publish high-level details of their models as press releases, but few of those documents contain implementable details. One of the proposed approaches for protecing these models in use-cases like on-device machine learning is homomorphic encryption.

Consider the inference process II with respect to model TT, ITI_T. Homomorphic encryption takes the operations that make up the model, and maps them into a seperate but analogous algebraic group (i.e., a homomorphism). Encrypting all operations like thisconceals the model; H(IT)\mathcal{H}(I_T) can perform inference on the data (DD) and updates on the model. A fully-homomorphic encription H\mathcal{H} on II preserves the compiuational correctness without revealling model details, at the expense of efficiency: H(IT)(D)=IT(D)\mathcal{H}(I_T)(D)=I_T(D) Additionally a scaling function on IT(D)I_T(D) can be overlaid to facilitate fair pricing and secure transaction (i.e., the OpenMined protocol approach):

H(P(IT(D)))=P(IT(D))\mathcal{H}(P(I_T(D)))=P(I_T(D))

In principle, this all sounds pretty useful. The caveat? The encryption and computation are still too slow to be practival. Outside of being used for incredibly simple models, this would likely be dependent on foundational advances in encryption and possibly specialized ASICs.

Data: Encryption or Approximation

Homomorphic encryption can also work the other way: mapping the data itself into some other algrbraic group using encryption. Unlike the case of model operation encryption, we also have the option of substituting with an approximation of the data (i.e., Differential privacy):

DDΔΘT(D)ΔΘT(D)D' \sim D \mid \Delta \Theta_T(D') \sim \Delta \Theta_T(D)

In cases where compliance is a concern, differential privacy and/or homomorphically-encrypted data are usually seen as the ideal. Unlike the model-encryption, this approach is also performing pure inference.

However, the privacy this approach offers for black-box models tends to break down if the model updates are visible. If one can track the updates to the model, even if the training is distributed, one can easily reconstruct the data. The approximation strategy, while technically simpler than the encryption, also requires specialized network architectures that are custom-built for the task (e.g., Cleverhans’ PATE).

Example reconstruction of an image using a variant of a final-layer attack
Example reconstruction of an image using a variant of a final-layer attack

Federated Machine Learning

One of the more popular approaches to separation of models and data is Federated Machine learning. This is a training strategy that features distributed, collaborative learning across multiple nodes in a network, with one or multiple models being updated in pieces. It is even possible to combine the update aggregation step with differential privacy. This one has already been deployed in real-world products, such as in Android devices sharing location data with models designed to learn traffic patterns.

It’s not a free lunch, though. Out of all the approaches described so far, this may be one of the most complicated to set up. Even without the differential privacy on top, integrating multiple classifiers and regressors necessitates customized protocol design and optimization. There’s also a reason why it’s rarely used outside of large companies like Alphabet: It requires many many users for privacy to be enforced (using algorithms like random rotation, for example, which cannot be done effectively for privacy purposes with less than a hundred users).

Rough overview of most Federated machine learning strategies. Model architectures, updating algorithms, and aggregation strategies can vary. Mostly useful for simpler architectures like tree-based models
Rough overview of most Federated machine learning strategies. Model architectures, updating algorithms, and aggregation strategies can vary. Mostly useful for simpler architectures like tree-based models

Model approximation

Model approximation is often researched in terms of it’s security risks. There are plenty of categories of attacks that can be used to steal information on black-box models, given access to information like the state of the last layer logits. However, model approximation also provides an opportunity to address the data pricing problem. The following relationship details how we price the data at a high level.

TT ΔΘT(D)ΔΘT(D)T' \sim T \mid \ \Delta \Theta_T(D') \sim \Delta \Theta_T(D)

A Pricing Function $P(T') : D \longrightarrow \mathbb{R^+}$ is composed
A Pricing Function $P(T') : D \longrightarrow \mathbb{R^+}$ is composed

Below is the pseudocode for using model-extraction as a Model-Data-efficacy (MDE) strategy. This approach can draw properties on any model. Black-box-models can be handled in escrow.


Data:

  • black-box TT
  • Θ\Theta
  • MDE f\text{MDE } f
  • data DtrainD_{train}
  • data DtestD_{test}
  • additional data DD
  • ideal model size tt.

Algorithm: Let Tf(T)T' \leftarrow f(T) // learn a decision tree in [2] while not (d[i]DtrainΔLtest(d[i],T)(∀d^{[i]}\in D_{\text{train}}\Delta \mathcal{L}_{\text{test}}(d^{[i]}, T) do

ΘΘ+ΔΘT(d)\Theta \leftarrow \Theta + \Delta \Theta_T (d) // Train with dDtraind \in D_{\text{train}}

end while size of(T)>τ\text{size of}(T') > \tau do

trim or compress T\text{trim or compress }T' // for optional encryption

end

Result:

  • Price of DD w.r.t. TT

Some of the properties of this approach are that it can apply to both interpretability and model testing. The pricing algorithm trades accuracy for size. If the model is sufficiently tiny, it can be encrypted as well.

Extension to Data Marketplaces

So, we’ve discussed a variety of approaches to the tradeoff problem described at the beginning. On the plus side we can clearly make a case that Data that is useful can be priced, and vice versa.

One downside of this approach is that due to mismatches in representation between training data and test data (e.g., insfficient data), we could easily end up with a market in which duplicate data may be priced for reducing error.

Solution

d[i]DtrainΔLtest(d[i],T)<ϵ∀d^{[i]}∈ D_{\text{train}}\Delta \mathcal{L}_{\text{test}}(d^{[i]}, T) < \epsilon.

That is, we can overfit to TT with DtrainD_{\text{train}} until the resulting approximation does not price duplicate data.

In summary, for trading additional trading data fairly and practically, Mode-Data Efficacy approaches based on model approximation of black-box models can be used. More specifically, these approaches price the data without training it on the original model.

Approximating the effect of data on the model through model approximation (Model-Data Efficacy) is a moderately practical solution to preserve model and data privacy. Model extraction, for example, can be used for fair pricing. That is, useless data can be priced minimally whil useful data can be priced high.

Approach DD Leakage TT leakage Practicality Fairness Examples
Giving up data High Low High Low Default ML
Giving up model Low High High Low Most Academic Researchers
Escrow smart contract Medium Medium Low High Numerai, Enigma
Encrypting the Model High Low Low N/A Corti, PySyft
Encrypting the Data Medium Low Low Medium Microsoft SEAL
Federated Learning Low Low Low High Google (for Android Data)
Model-Data Efficacy Low Low Medium High DeMoloch

Against black-box models, eencryptiong or approximating data have flaws regarding privacy. While federated learning with differential privacy achieves privacy for both model owner and data owner, it is less practical for one-time transactions.

Future work

There are still plenty of ways to refine this approach. For example, pre-training data synthesized from existing DtrainD_{\text{train}} would eliminate tuning on DtestD_{\text{test}}. This would also have the added benefit of refining usefulness\text{usefulness} into a metric for novelty.

As with any ML approach (especially one in which markets would be involved), there will need to be some kind of security against adversarial attacks against the model owner. From the ML perspective, there are a variety of tools for adding adversarial robustness (e.g., Cleverhans, Mr. Ed, model pruning). One of the best forms of security, however, would be to add some sort of transactional security that makes adversarial attacks prohibitively expensive to carry out in most instances.

References

  1. Aono, Yoshinori, et al. “Privacy-preserving deep learning via additively homomorphic encryption.” IEEE Transactions on Information Forensics and Security 13.5 (2017): 1333-1345.
  2. Bastani, Osbert, Carolyn Kim, and Hamsa Bastani. “Interpretability via model extraction.” arXiv preprint arXiv:1706.09773 (2017).

Cited as:

@article{mcateer2019pmlmarket,
  title   = "Private ML Marketplaces",
  author  = "McAteer, Matthew",
  journal = "matthewmcateer.me",
  year    = "2019",
  url     = "https://matthewmcateer.me/blog/private-ml-marketplaces/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I would be very happy to correct them right away!

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!



This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.



This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.