import os
import tqdm
import numpy as np
import pandas as pd
import torch
from sklearn.metrics import average_precision_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from xgboost.sklearn import XGBClassifier
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
The other day, I was very inspired by the blog post Sentiment Analysis on Encrypted Data with Homomorphic Encryption co-written by Zama and HuggingFace. Zama has created an excellent encrypted machine learning library, Concrete-ML, based on fully homomorphic encryption (FHE). Concrete-ML enables data scientists to easily turn their machine learning models into an homomorphic equivalent in order to perform inference on encrypted data. In the blog post, the authors demonstrate how you can easily perform sentiment analysis on encrypted data with this library. As you can imagine, sometimes you will need to perform sentiment analysis on text containing sensitive information. With FHE, the data always remains encrypted during computation, which enables data scientists to provide a machine learning service to a user while maintaining data confidentiality.
The last several years, I was very fortunate to also work at the intersection of machine learning and cryptography. One of my collaborations with Morten Dahl, Jason Mancuso, Dragos Roturu and Lex Verona that I am very excited about is Moose. Moose is a distributed dataflow framework for encrypted machine learning and data processing. Moose’s cryptographic protocol is based on secure multi-party-computation (MPC). Depending on the scenario, FHE and MPC have different pros and cons. Currently MPC generally tends to be more performant, however the protocol requires 2 or 3 non-colluding parties (e.g a data owner and a data scientist) willing to perform computations together. If you want to learn about MPC in the context of machine learning, I highly recommend this very comprehensive blog post where Morten implements an MPC protocol from scratch for Deep Learning.
In the rest of this blog post, I will show how you can perform encrypted inference with Moose using the sentiment analysis use case from Zama and HuggingFace’s blog post.
Model Training
The sentiment analysis model will be trained on the Twitter US Airline Sentiment dataset from Kaggle. To train the model, we will use the code provided in the blog post. The sentiment model consists of a RoBERTa (Liu et al, 2019) transformer to extract features from the text, and an XGBoost model on top of it to classify the tweets into positive, negative, or neutral classes.
Let’s first load the dataset.
if not os.path.isfile("local_datasets/twitter-airline-sentiment/Tweets.csv"):
raise ValueError("Please launch the `download_data.sh` script to get datasets")
= pd.read_csv("local_datasets/twitter-airline-sentiment/Tweets.csv", index_col=0)
train = train["text"], train["airline_sentiment"]
text_X, y = y.replace(["negative", "neutral", "positive"], [0, 1, 2])
y
= y.value_counts()[2] / y.value_counts().sum()
pos_ratio = y.value_counts()[0] / y.value_counts().sum()
neg_ratio = y.value_counts()[1] / y.value_counts().sum()
neutral_ratio
print(f"Proportion of positive examples: {round(pos_ratio * 100, 2)}%")
print(f"Proportion of negative examples: {round(neg_ratio * 100, 2)}%")
print(f"Proportion of neutral examples: {round(neutral_ratio * 100, 2)}%")
Proportion of positive examples: 16.14%
Proportion of negative examples: 62.69%
Proportion of neutral examples: 21.17%
As you can see the tweets are classified into three categories: positive, negative and neutral.
For the feature extractor, in the blog post, the authors use a RoBerta transformer pre-trained on Tweets.
= "cuda:0" if torch.cuda.is_available() else "cpu"
device
# Load the tokenizer (converts text to tokens)
= AutoTokenizer.from_pretrained(
tokenizer "cardiffnlp/twitter-roberta-base-sentiment-latest"
)
# Load the pre-trained model
= AutoModelForSequenceClassification.from_pretrained(
transformer_model "cardiffnlp/twitter-roberta-base-sentiment-latest"
)
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The function below will be responsible for extracting the features from the tweets.
# Function that transforms a list of texts to their representation
# learned by the transformer.
def text_to_tensor(
list,
list_text_X_train:
transformer_model: AutoModelForSequenceClassification,
tokenizer: AutoTokenizer,str,
device: -> np.ndarray:
) # Tokenize each text in the list one by one
= []
tokenized_text_X_train_split for text_x_train in list_text_X_train:
tokenized_text_X_train_split.append(="pt")
tokenizer.encode(text_x_train, return_tensors
)
# Send the model to the device
= transformer_model.to(device)
transformer_model = []
output_hidden_states_list
for tokenized_x in tqdm.tqdm(tokenized_text_X_train_split):
# Pass the tokens through the transformer model and get the hidden states
# Only keep the last hidden layer state for now
= transformer_model(
output_hidden_states =True
tokenized_x.to(device), output_hidden_states1][-1]
)[# Average over the tokens axis to get a representation at the text level.
= output_hidden_states.mean(dim=1)
output_hidden_states = output_hidden_states.detach().cpu().numpy()
output_hidden_states
output_hidden_states_list.append(output_hidden_states)
return np.concatenate(output_hidden_states_list, axis=0)
We are now ready to run the feature extractor on the training and testing set, then train the XGBoost model on the feature extractor’s output.
# Split in train test
= train_test_split(
text_X_train, text_X_test, y_train, y_test =0.1, random_state=42
text_X, y, test_size
)
# Let's vectorize the text using the transformer
= text_X_train.tolist()
list_text_X_train = text_X_test.tolist()
list_text_X_test
= text_to_tensor(
X_train_transformer
list_text_X_train, transformer_model, tokenizer, device
)= text_to_tensor(
X_test_transformer
list_text_X_test, transformer_model, tokenizer, device
)
# Let's build our model
= XGBClassifier()
model
# A gridsearch to find the best parameters
= {
parameters "max_depth": [1],
"n_estimators": [10, 30, 50],
"n_jobs": [-1],
}
# Now we have a representation for each tweet, we can train a model on these.
= GridSearchCV(model, parameters, cv=3, n_jobs=1, scoring="accuracy")
grid_search
grid_search.fit(X_train_transformer, y_train)
# Check the accuracy of the best model
print(f"Best score: {grid_search.best_score_}")
# Check best hyperparameters
print(f"Best parameters: {grid_search.best_params_}")
# Extract best model
= grid_search.best_estimator_
best_model
# Compute the metrics for each class
= best_model.predict_proba(X_test_transformer)
y_proba
# Compute the accuracy
= np.argmax(y_proba, axis=1)
y_pred = np.mean(y_pred == y_test)
accuracy_transformer_xgboost print(f"Accuracy: {accuracy_transformer_xgboost:.4f}")
= y_proba[:, 2]
y_pred_positive = y_proba[:, 0]
y_pred_negative = y_proba[:, 1]
y_pred_neutral
= average_precision_score(
ap_positive_transformer_xgboost == 2), y_pred_positive
(y_test
)= average_precision_score(
ap_negative_transformer_xgboost == 0), y_pred_negative
(y_test
)= average_precision_score((y_test == 1), y_pred_neutral)
ap_neutral_transformer_xgboost
print(
f"Average precision score for positive class: "
f"{ap_positive_transformer_xgboost:.4f}"
)print(
f"Average precision score for negative class: "
f"{ap_negative_transformer_xgboost:.4f}"
)print(
f"Average precision score for neutral class: "
f"{ap_neutral_transformer_xgboost:.4f}"
)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|██████████| 13176/13176 [11:32<00:00, 19.02it/s]
100%|██████████| 1464/1464 [01:17<00:00, 18.78it/s]
Best score: 0.844869459623558
Best parameters: {'max_depth': 1, 'n_estimators': 50, 'n_jobs': -1}
Accuracy: 0.8559
Average precision score for positive class: 0.9015
Average precision score for negative class: 0.9675
Average precision score for neutral class: 0.7517
Excellent, we have a sentiment analysis model with an 85% accuracy. We can run the model on a sample tweet.
= ["AirFrance is awesome, almost as much as Zama!"]
tested_tweet = text_to_tensor(tested_tweet, transformer_model, tokenizer, device)
X_tested_tweet "data/x_tested_tweet.npy", X_tested_tweet)
np.save(= best_model.predict_proba(X_tested_tweet)
clear_proba print(f"Proba prediction in plaintext {clear_output}")
100%|██████████| 1/1 [00:00<00:00, 10.14it/s]
Clear_proba [[0.02582786 0.02599407 0.94817805]]
Encrypted Inference with Moose
Now that we have a model trained, we are ready to serve encrypted inference with Moose. For simplicity, we will start by locally prototyping this computation happening between the different parties using the pm.LocalMooseRuntime
.
To serve encrypted inference, we will have to perform the following steps: - Convert the trained model to ONNX format. - Convert the model from ONNX to a Moose computation. - Run encrypted inference by evaluating the Moose computation.
Let’s get started!
from onnxmltools.convert import convert_xgboost
from skl2onnx.common import data_types as onnx_dtypes
import pymoose as pm
Convert to ONNX
We can convert the XGBoost model into an ONNX proto using the convert_xgboos
method from the onnxmltools.
= X_test_transformer[0].shape[0]
n_features = ("float_input", onnx_dtypes.FloatTensorType([None, n_features]))
initial_type = convert_xgboost(best_model, initial_types=[initial_type]) onnx_proto
Convert ONNX to Moose Predictor
PyMoose provides several predictor classes to translate an ONNX model into a PyMoose DSL program. Because the trained model is an XGBoost model, we can use the class tree_ensemble.TreeEnsembleClassifier
. The class has a method from_onnx
which will parse the ONNX file. The returned object is callable. When called, it will compute the forward pass of the XGBoost model.
= pm.predictors.TreeEnsembleClassifier.from_onnx(onnx_proto) predictor
Define Moose Computation
To express this computation, Moose offers a Python DSL (internally referred to as the eDSL, i.e. “embedded” DSL). As you will notice, the syntax is very similar to the scientific computation library Numpy.
The main difference is the notion of placements. There are two types of placements: host placement and replicated placement. With Moose, every operation under a host placement context is computed on plaintext values (not encrypted). Every operation under a replicated placement is performed on secret shared values (encrypted).
We will compute the inference between three different players, each of them representing a host placement: a data owner, a data scientist, and a third party. The three players are grouped under the replicated placement to perform the encrypted computation. Currently, the MPC protocol of Moose expects three parties, but other MPC schemes can expect two parties. practice, the third party could be a secure enclave that the data scientist and data owner can’t access.
When we have instantiated the pm.predictors.TreeEnsembleClassifier
class, under the hood three host placements have been instiated: alice, bob and carole. For our use case, alice will represent the data owner, bob the model owner and carole the third party.
The Moose computation below performs the following steps:
- Loads the tweet (after running the feature extractor) in plaintext from alice’s (data owner) storage.
- Secret share (encrypts) the data.
- Computes XGBoost inference on secret shared data.
- Reveals the prediction only to alice (the data owner) and saves it into its storage.
@pm.computation
def moose_predictor_computation():
# Alice (data owner) load their data in plaintext
# Then the data gets converted from float to fixed-point
with predictor.alice:
= pm.load("x", dtype=pm.float64)
x = pm.cast(x, dtype=pm.predictors.predictor_utils.DEFAULT_FIXED_DTYPE)
x_fixed # The data gets secret shared when moving from host placement
# to replicated placement.
# Then compute the logistic regression on secret shared data
with predictor.replicated:
= predictor(x_fixed, pm.predictors.predictor_utils.DEFAULT_FIXED_DTYPE)
y_pred
# The predictions gets revealed only to Alice (the data owner)
# Convert the data from fixed-point to floats and save the data in the storage
with predictor.alice:
= pm.cast(y_pred, dtype=pm.float64)
y_pred = pm.save("y_pred", y_pred)
y_pred return y_pred
Evaluate the computation
For simplicity, we will use pm.LocalMooseRuntime
to locally simulate this computation running across hosts. To do so, we need to provide: the Moose computation, the list of host identities to simulate, and a mapping of the data stored by each simulated host.
Since the data owner is represented by alice, we will place the patients’ data in alice’s storage.
Once you have instantiated the pm.LocalMooseRuntime
with the identities and additional storage mapping and the runtime set as default, you can simply call the Moose computation to evaluate it.
= {
executive_storage "alice": {"x": X_tested_tweet.astype(np.float64)},
"bob": {},
"carole": {},
}= [plc.name for plc in predictor.host_placements]
identities
= pm.LocalMooseRuntime(identities, storage_mapping=executive_storage)
runtime
runtime.set_default()
= moose_predictor_computation() _
Once the computation is done, we can extract the results. The predictions have been stored in alice’s storage. We can extract the value from the storage with read_value_from_storage
.
= runtime.read_value_from_storage("alice", "y_pred") y_pred
print(f"Plaintext Prediction: {y_pred}")
print(f"Moose Prediction: {y_pred}")
Plaintext Prediction: [[0.02581358 0.02598119 0.94782831]]
Moose Prediction: [[0.02581358 0.02598119 0.94782831]]
Excellent! As you can see Moose returns the same prediction as XGBoost. However, with Moose, we were able to compute the inference on the data owner’s data while keeping the data encrypted during the entire process!
If you want to learn about how to run Moose over the network with gRPC, you can check out this tutorial.
Conclusion
I hope that thanks to this tutorial you have a better idea of how you can perform encrypted inference with Moose. Thanks to libraries like Concrete-ML and Moose, we’re entering an exciting time where data scientists and machine learning engineers can maintain the confidentiality of sensitive datasets using encryption, without having to become experts in cryptography.
Thank you to the Moose team for this amazing contribution and reviewing this blog post.