Literature parsing for mutation classification
Automating the work of Grad Students
Originally posted on Kaggle
Background
Ever since the completion of human genome project, many have been projecting the advent of an era where personalized medicine will be able to beat cancer. It sounds simple enough; by knowing precisely the mutations one carries in their now sequencable genome, one could understand right down to the molecular level how the cancer came about, how it works, and how to fight it.
Unfortunately, there are still two major obstacles to this vision: The genomic diversity of cancer, and the manpower limitations of the current research.
Cancer arises when mutations occur that disrupt growth inhibition, promote telomerase immortalization, confer angiogenic ability, etc. The problem is that it’s not just cancer cells that accumulate mutations. Our bodies’ cells to not retain completely unchanged copies of the 3-billion-bp genome throughout our lives, they have countless harmless (or neutral) mutations that do not show up.
The default method for categorizing these methods is very time-inefficient. With few exceptions, most mutation interpreatations are done manually. This requires a researcher to manually identify, review, and classify any and all mutations based on increasingly verbose and SEO-optimized scientific papers.
Given this massive state space of unique mutations, it makes sense that the manual approach has not yielded this Holy Grail of personalized medicine yet.
Memorial Sloan Kettering Cancer Center (MSKCC) has asked for help to develop a machine learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations. MSKCC is making available an expert-annotated knowledge base where world-class researchers and oncologists have manually annotated thousands of mutations.
The Dataset
This dataset is from the MSK Redefining Cancer Treatment Kaggle Competition.
Walkthrough
First, let’s start with the requisite iimports
from sklearn import preprocessing, pipeline, feature_extraction, decomposition, model_selection, metrics, cross_validation, svm
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import normalize, Imputer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
import sklearn
import pandas as pd
import numpy as np
import xgboost as xgb
import datetime
Then, we can import the dataset:
train = pd.read_csv('../input/training_variants')
test = pd.read_csv('../input/test_variants')
trainx = pd.read_csv('../input/training_text', sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])
testx = pd.read_csv('../input/test_text', sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])
train = pd.merge(train, trainx, how='left', on='ID').fillna('')
y = train['Class'].values
train = train.drop(['Class'], axis=1)
test = pd.merge(test, testx, how='left', on='ID').fillna('')
pid = test['ID'].values
train.head()
ID | Gene | Variation | Text | |
---|---|---|---|---|
0 | 0 | FAM58A | Truncating Mutations | Cyclin-dependent kinases (CDKs) regulate a var… |
1 | 1 | CBL | W802* | Abstract Background Non-small cell lung canc… |
2 | 2 | CBL | Q249E | Abstract Background Non-small cell lung canc… |
3 | 3 | CBL | N454D | Recent evidence has demonstrated that acquired… |
4 | 4 | CBL | L399V | Oncogenic mutations in the monomeric Casitas B… |
For our testing output, we get something like this:
y
array([1, 2, 2, ..., 1, 4, 4])
ID | Gene | Variation | Text | |
---|---|---|---|---|
0 | 0 | ACSL4 | R570S | 2. This mutation resulted in a myeloproliferat… |
1 | 1 | NAGLU | P521L | Abstract The Large Tumor Suppressor 1 (LATS1)… |
2 | 2 | PAH | L333F | Vascular endothelial growth factor receptor (V… |
3 | 3 | ING1 | A148D | Inflammatory myofibroblastic tumor (IMT) is a … |
4 | 4 | TMEM216 | G77A | Abstract Retinoblastoma is a pediatric retina… |
pid
array([ 0, 1, 2, ..., 5665, 5666, 5667])
df_all = pd.concat((train, test), axis=0, ignore_index=True)
df_all['Gene_Share'] = df_all.apply(lambda r: sum([1 for w in r['Gene'].split(' ') if w in r['Text'].split(' ')]), axis=1)
df_all['Variation_Share'] = df_all.apply(lambda r: sum([1 for w in r['Variation'].split(' ') if w in r['Text'].split(' ')]), axis=1)
df_all.head()
ID | Gene | Variation | Text | Gene_Share | Variation_Share | |
---|---|---|---|---|---|---|
0 | 0 | FAM58A | Truncating Mutations | Cyclin-dependent kinases (CDKs) regulate a var… | 1 | 1 |
1 | 1 | CBL | W802* | Abstract Background Non-small cell lung canc… | 1 | 1 |
2 | 2 | CBL | Q249E | Abstract Background Non-small cell lung canc… | 1 | 1 |
3 | 3 | CBL | N454D | Recent evidence has demonstrated that acquired… | 1 | 1 |
4 | 4 | CBL | L399V | Oncogenic mutations in the monomeric Casitas B… | 1 | 1 |
gen_var_lst = sorted(list(train.Gene.unique()) + list(train.Variation.unique()))
print(len(gen_var_lst))
3260
gen_var_lst = [x for x in gen_var_lst if len(x.split(' '))==1]
print(len(gen_var_lst))
i_ = 0
3091
for c in df_all.columns:
if df_all[c].dtype == 'object':
if c in ['Gene','Variation']:
lbl = preprocessing.LabelEncoder()
df_all[c+'_lbl_enc'] = lbl.fit_transform(df_all[c].values)
df_all[c+'_len'] = df_all[c].map(lambda x: len(str(x)))
df_all[c+'_words'] = df_all[c].map(lambda x: len(str(x).split(' ')))
elif c != 'Text':
lbl = preprocessing.LabelEncoder()
df_all[c] = lbl.fit_transform(df_all[c].values)
if c=='Text':
df_all[c+'_len'] = df_all[c].map(lambda x: len(str(x)))
df_all[c+'_words'] = df_all[c].map(lambda x: len(str(x).split(' ')))
train = df_all.iloc[:len(train)]
test = df_all.iloc[len(train):]
train.head()
ID | Gene | Variation | Text | Gene_Share | Variation_Share | Genelblenc | Gene_len | Gene_words | Variationlblenc | Variation_len | Variation_words | Text_len | Text_words | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | FAM58A | Truncating Mutations | Cyclin-dependent kinases (CDKs) regulate a var… | 1 | 1 | 447 | 6 | 1 | 7654 | 20 | 2 | 39672 | 6105 |
1 | 1 | CBL | W802* | Abstract Background Non-small cell lung canc… | 1 | 1 | 216 | 3 | 1 | 8255 | 5 | 1 | 36691 | 5783 |
2 | 2 | CBL | Q249E | Abstract Background Non-small cell lung canc… | 1 | 1 | 216 | 3 | 1 | 5191 | 5 | 1 | 36691 | 5783 |
3 | 3 | CBL | N454D | Recent evidence has demonstrated that acquired… | 1 | 1 | 216 | 3 | 1 | 4572 | 5 | 1 | 36238 | 5625 |
4 | 4 | CBL | L399V | Oncogenic mutations in the monomeric Casitas B… | 1 | 1 | 216 | 3 | 1 | 3958 | 5 | 1 | 41308 | 6248 |
test.head()
ID | Gene | Variation | Text | Gene_Share | Variation_Share | Genelblenc | Gene_len | Gene_words | Variationlblenc | Variation_len | Variation_words | Text_len | Text_words | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3321 | 0 | ACSL4 | R570S | 2. This mutation resulted in a myeloproliferat… | 0 | 1 | 28 | 5 | 1 | 6404 | 5 | 1 | 49829 | 7495 |
3322 | 1 | NAGLU | P521L | Abstract The Large Tumor Suppressor 1 (LATS1)… | 0 | 1 | 852 | 5 | 1 | 5005 | 5 | 1 | 31326 | 4762 |
3323 | 2 | PAH | L333F | Vascular endothelial growth factor receptor (V… | 0 | 1 | 950 | 3 | 1 | 3915 | 5 | 1 | 75282 | 11191 |
3324 | 3 | ING1 | A148D | Inflammatory myofibroblastic tumor (IMT) is a … | 0 | 1 | 657 | 4 | 1 | 85 | 5 | 1 | 53996 | 8439 |
3325 | 4 | TMEM216 | G77A | Abstract Retinoblastoma is a pediatric retina… | 0 | 1 | 1376 | 7 | 1 | 2780 | 4 | 1 | 76967 | 11226 |
train.shape, test.shape
((3321, 14), (5668, 14))
Training
class cust_regression_vals(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, x):
x = x.drop(['Gene', 'Variation','ID','Text'],axis=1).values
return x
class cust_txt_col(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, x):
return x[self.key].apply(str)
print('Pipeline...')
fp = pipeline.Pipeline([
('union', pipeline.FeatureUnion(
n_jobs = -1,
transformer_list = [
('standard', cust_regression_vals()),
('pi1', pipeline.Pipeline([('Gene', cust_txt_col('Gene')),
('count_Gene', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))),
('tsvd1', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
('pi2', pipeline.Pipeline([('Variation', cust_txt_col('Variation')),
('count_Variation', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))),
('tsvd2', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
#('pi3', pipeline.Pipeline([('Text', cust_txt_col('Text')), ('tfidf_Text', feature_extraction.text.TfidfVectorizer(ngram_range=(1, 2))), ('tsvd3', decomposition.TruncatedSVD(n_components=50, n_iter=25, random_state=12))]))
])
)])
train = fp.fit_transform(train); print(train.shape)
test = fp.transform(test); print(test.shape)
Pipeline...
(3321, 50)
(5668, 50)
print('Pipeline...')
fp = pipeline.Pipeline([
('union', pipeline.FeatureUnion(
n_jobs = -1,
transformer_list = [
('standard', cust_regression_vals()),
('pi1', pipeline.Pipeline([('Gene', cust_txt_col('Gene')),
('count_Gene', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))),
('tsvd1', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
('pi2', pipeline.Pipeline([('Variation', cust_txt_col('Variation')),
('count_Variation', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))),
('tsvd2', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
#commented for Kaggle Limits
('pi3', pipeline.Pipeline([('Text', cust_txt_col('Text')),
('hv', feature_extraction.text.HashingVectorizer(decode_error='ignore', n_features=2 ** 16, non_negative=True, ngram_range=(1, 3))),
('tfidf_Text', feature_extraction.text.TfidfTransformer()),
('tsvd3', decomposition.TruncatedSVD(n_components=300, n_iter=25, random_state=12))]))
])
)])
train = fp.fit_transform(train)
print (train.shape)
test_t = np.empty([0, train.shape[1]])
step = 200
for i in range(0, len(test), step):
step_end = i+step
step_end = step_end if step_end < len(test) else len(test)
_test = fp.transform(test.iloc[i:step_end])
test_t = np.vstack((test_t, _test))
test = test_t
print (test.shape)
Validation
Now we want to see how ou training actually did. First, let’s fix our training data
y = y - 1 #fix for zero bound array
denom = 0
fold = 5
for i in range(fold):
params = {
'eta': 0.02,
'max_depth': 6,
'objective': 'multi:softprob',
'eval_metric': 'mlogloss',
'num_class': 9,
'seed': i,
'silent': True
}
x1, x2, y1, y2 = model_selection.train_test_split(train, y, test_size=0.15, random_state=i)
watchlist = [(xgb.DMatrix(x1, y1), 'train'), (xgb.DMatrix(x2, y2), 'valid')]
model = xgb.train(params, xgb.DMatrix(x1, y1), 1000, watchlist, verbose_eval=50, early_stopping_rounds=100)
score1 = metrics.log_loss(y2, model.predict(xgb.DMatrix(x2), ntree_limit=model.best_ntree_limit), labels = list(range(9)))
print(score1)
#if score < 0.9:
if denom != 0:
pred = model.predict(xgb.DMatrix(test), ntree_limit=model.best_ntree_limit+80)
preds += pred
else:
pred = model.predict(xgb.DMatrix(test), ntree_limit=model.best_ntree_limit+80)
preds = pred.copy()
denom += 1
submission = pd.DataFrame(pred, columns=['class'+str(c+1) for c in range(9)])
submission['ID'] = pid
submission.to_csv('submission_xgb_fold_' + str(i) + '.csv', index=False)
preds /= denom
submission = pd.DataFrame(preds, columns=['class'+str(c+1) for c in range(9)])
submission['ID'] = pid
submission.to_csv('submission_xgb.csv', index=False)
[0] train-mlogloss:2.15782 valid-mlogloss:2.16479
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.
Will train until valid-mlogloss hasn't improved in 100 rounds.
[50] train-mlogloss:1.24063 valid-mlogloss:1.47325
[100] train-mlogloss:0.912934 valid-mlogloss:1.25467
[150] train-mlogloss:0.734402 valid-mlogloss:1.1651
[200] train-mlogloss:0.613828 valid-mlogloss:1.11797
[250] train-mlogloss:0.532413 valid-mlogloss:1.09372
[300] train-mlogloss:0.459742 valid-mlogloss:1.0782
[350] train-mlogloss:0.40078 valid-mlogloss:1.07044
[400] train-mlogloss:0.352744 valid-mlogloss:1.06854
[450] train-mlogloss:0.31163 valid-mlogloss:1.06999
[500] train-mlogloss:0.276759 valid-mlogloss:1.07347
Stopping. Best iteration:
[402] train-mlogloss:0.35102 valid-mlogloss:1.06827
1.06826797168
[0] train-mlogloss:2.15929 valid-mlogloss:2.16676
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.
Will train until valid-mlogloss hasn't improved in 100 rounds.
[50] train-mlogloss:1.25902 valid-mlogloss:1.45637
[100] train-mlogloss:0.927344 valid-mlogloss:1.22604
[150] train-mlogloss:0.747399 valid-mlogloss:1.12135
[200] train-mlogloss:0.621636 valid-mlogloss:1.07068
[250] train-mlogloss:0.536327 valid-mlogloss:1.04974
[300] train-mlogloss:0.466379 valid-mlogloss:1.03689
[350] train-mlogloss:0.407413 valid-mlogloss:1.03094
[400] train-mlogloss:0.359413 valid-mlogloss:1.02829
[450] train-mlogloss:0.319446 valid-mlogloss:1.02665
[500] train-mlogloss:0.284909 valid-mlogloss:1.02779
[550] train-mlogloss:0.254682 valid-mlogloss:1.03102
Stopping. Best iteration:
[460] train-mlogloss:0.312175 valid-mlogloss:1.0263
1.02630396074
[0] train-mlogloss:2.15922 valid-mlogloss:2.16414
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.
Will train until valid-mlogloss hasn't improved in 100 rounds.
[50] train-mlogloss:1.24657 valid-mlogloss:1.43581
[100] train-mlogloss:0.915174 valid-mlogloss:1.20246
[150] train-mlogloss:0.724838 valid-mlogloss:1.10001
[200] train-mlogloss:0.61148 valid-mlogloss:1.04971
[250] train-mlogloss:0.526611 valid-mlogloss:1.02218
[300] train-mlogloss:0.459829 valid-mlogloss:1.00654
[350] train-mlogloss:0.403125 valid-mlogloss:0.994575
[400] train-mlogloss:0.356882 valid-mlogloss:0.988575
[450] train-mlogloss:0.314969 valid-mlogloss:0.983562
[500] train-mlogloss:0.276921 valid-mlogloss:0.983204
[550] train-mlogloss:0.247534 valid-mlogloss:0.983713
[600] train-mlogloss:0.220643 valid-mlogloss:0.986902
Stopping. Best iteration:
[517] train-mlogloss:0.26601 valid-mlogloss:0.982651
0.98265060772
[0] train-mlogloss:2.15922 valid-mlogloss:2.16511
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.
Will train until valid-mlogloss hasn't improved in 100 rounds.
[50] train-mlogloss:1.24867 valid-mlogloss:1.44642
[100] train-mlogloss:0.916804 valid-mlogloss:1.22938
[150] train-mlogloss:0.738456 valid-mlogloss:1.13725
[200] train-mlogloss:0.617969 valid-mlogloss:1.08818
[250] train-mlogloss:0.53522 valid-mlogloss:1.06121
[300] train-mlogloss:0.464503 valid-mlogloss:1.04386
[350] train-mlogloss:0.40767 valid-mlogloss:1.03352
[400] train-mlogloss:0.361219 valid-mlogloss:1.02747
[450] train-mlogloss:0.320972 valid-mlogloss:1.02707
[500] train-mlogloss:0.286176 valid-mlogloss:1.03045
Stopping. Best iteration:
[422] train-mlogloss:0.342379 valid-mlogloss:1.02614
1.02614357069
[0] train-mlogloss:2.15883 valid-mlogloss:2.16265
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.
Will train until valid-mlogloss hasn't improved in 100 rounds.
[50] train-mlogloss:1.25701 valid-mlogloss:1.41193
[100] train-mlogloss:0.926052 valid-mlogloss:1.18233
[150] train-mlogloss:0.739056 valid-mlogloss:1.08336
[200] train-mlogloss:0.60938 valid-mlogloss:1.03559
[250] train-mlogloss:0.520795 valid-mlogloss:1.01434
[300] train-mlogloss:0.449827 valid-mlogloss:1.00456
[350] train-mlogloss:0.3923 valid-mlogloss:0.999805
[400] train-mlogloss:0.342235 valid-mlogloss:0.999826
[450] train-mlogloss:0.301923 valid-mlogloss:1.00284
[500] train-mlogloss:0.266097 valid-mlogloss:1.00746
Stopping. Best iteration:
[408] train-mlogloss:0.334443 valid-mlogloss:0.999366
0.999366185342
0.99 validation loss, that’s pretty good (though it’s not as stellar as the ~0.33 training loss)
Since we’re training a decision tree, we actually have an interpretable model. We can visualize how much the model weighed the improtance of each feature:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = (9.0, 9.0)
xgb.plot_importance(booster=model)
plt.show()
We’ve now got a great visualization of the relative importance of the features. What does this decision tree actually look like, then?
plt.rcParams['figure.figsize'] = (50.0, 50.0)
xgb.plot_tree(booster=model)
plt.show()
beautiful
Test Evaluation
So how did this do on the Kaggle leaderboards?
Future directions
This model certainly has use beyond Kaggle. Here are some of my ideas for expansion beyond the scope of this competition:
- Incorporating text data from EncyclopediaofMolecularCellBiologyandMolecularMedicine16volumesWiley2006, Cell-Molecular-Biology-Concepts-Experiments 7th edition, [ctdbase](http://ctdbase.org/), and the Comparative Toxicogenomics Database.
- Allowing pre-trained Bio-NER taggers in the task, including tmVar , TaggerOne, GNormPlus
- Mapping the variations to RSIDs and search related articles in PubMed or PMC.
- Adding on functionality for themygene api, which contains even more information about genes (such as family, overall description, etc)
- Incorporating aminoacid sequences for the genes in the original dataset, in FASTA
- Including allele frequency information from gnomAD or ExAC
- Guiding Named Entity Recognition modles could be improved (Just for corpus for feature generation and training specific NLP tools.) by adding data from the following: ICD, SNOMED CT,NCI Thesaurus, CPT, MedDRA, SNOMED CT.
- Incorporating pretrained word2vec models from Biomedical natural language processing
Cited as:
@article{mcateer2017cancermut,
title = "Literature parsing for mutation classification",
author = "McAteer, Matthew",
journal = "matthewmcateer.me",
year = "2017",
url = "https://matthewmcateer.me/blog/literature-parsing-for-mutation-classification/"
}
If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me]
and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.
See you in the next post 😄