Literature parsing for mutation classification

Automating the work of Grad Students

Originally posted on Kaggle

Background

Ever since the completion of human genome project, many have been projecting the advent of an era where personalized medicine will be able to beat cancer. It sounds simple enough; by knowing precisely the mutations one carries in their now sequencable genome, one could understand right down to the molecular level how the cancer came about, how it works, and how to fight it.

Unfortunately, there are still two major obstacles to this vision: The genomic diversity of cancer, and the manpower limitations of the current research.

Cancer arises when mutations occur that disrupt growth inhibition, promote telomerase immortalization, confer angiogenic ability, etc. The problem is that it’s not just cancer cells that accumulate mutations. Our bodies’ cells to not retain completely unchanged copies of the 3-billion-bp genome throughout our lives, they have countless harmless (or neutral) mutations that do not show up.

The default method for categorizing these methods is very time-inefficient. With few exceptions, most mutation interpreatations are done manually. This requires a researcher to manually identify, review, and classify any and all mutations based on increasingly verbose and SEO-optimized scientific papers.

Given this massive state space of unique mutations, it makes sense that the manual approach has not yielded this Holy Grail of personalized medicine yet.

Memorial Sloan Kettering Cancer Center (MSKCC) has asked for help to develop a machine learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations. MSKCC is making available an expert-annotated knowledge base where world-class researchers and oncologists have manually annotated thousands of mutations.

The Dataset

This dataset is from the MSK Redefining Cancer Treatment Kaggle Competition.

Walkthrough

First, let’s start with the requisite iimports

from sklearn import preprocessing, pipeline, feature_extraction, decomposition, model_selection, metrics, cross_validation, svm
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import normalize, Imputer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
import sklearn
import pandas as pd
import numpy as np
import xgboost as xgb
import datetime

Then, we can import the dataset:

train = pd.read_csv('../input/training_variants')
test = pd.read_csv('../input/test_variants')
trainx = pd.read_csv('../input/training_text', sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])
testx = pd.read_csv('../input/test_text', sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])

train = pd.merge(train, trainx, how='left', on='ID').fillna('')
y = train['Class'].values
train = train.drop(['Class'], axis=1)

test = pd.merge(test, testx, how='left', on='ID').fillna('')
pid = test['ID'].values
train.head()
ID Gene Variation Text
0 0 FAM58A Truncating Mutations Cyclin-dependent kinases (CDKs) regulate a var…
1 1 CBL W802* Abstract Background Non-small cell lung canc…
2 2 CBL Q249E Abstract Background Non-small cell lung canc…
3 3 CBL N454D Recent evidence has demonstrated that acquired…
4 4 CBL L399V Oncogenic mutations in the monomeric Casitas B…

For our testing output, we get something like this:

y
array([1, 2, 2, ..., 1, 4, 4])
ID Gene Variation Text
0 0 ACSL4 R570S 2. This mutation resulted in a myeloproliferat…
1 1 NAGLU P521L Abstract The Large Tumor Suppressor 1 (LATS1)…
2 2 PAH L333F Vascular endothelial growth factor receptor (V…
3 3 ING1 A148D Inflammatory myofibroblastic tumor (IMT) is a …
4 4 TMEM216 G77A Abstract Retinoblastoma is a pediatric retina…
pid
array([   0,    1,    2, ..., 5665, 5666, 5667])
df_all = pd.concat((train, test), axis=0, ignore_index=True)
df_all['Gene_Share'] = df_all.apply(lambda r: sum([1 for w in r['Gene'].split(' ') if w in r['Text'].split(' ')]), axis=1)
df_all['Variation_Share'] = df_all.apply(lambda r: sum([1 for w in r['Variation'].split(' ') if w in r['Text'].split(' ')]), axis=1)
df_all.head()
ID Gene Variation Text Gene_Share Variation_Share
0 0 FAM58A Truncating Mutations Cyclin-dependent kinases (CDKs) regulate a var… 1 1
1 1 CBL W802* Abstract Background Non-small cell lung canc… 1 1
2 2 CBL Q249E Abstract Background Non-small cell lung canc… 1 1
3 3 CBL N454D Recent evidence has demonstrated that acquired… 1 1
4 4 CBL L399V Oncogenic mutations in the monomeric Casitas B… 1 1
gen_var_lst = sorted(list(train.Gene.unique()) + list(train.Variation.unique()))
print(len(gen_var_lst))
3260
gen_var_lst = [x for x in gen_var_lst if len(x.split(' '))==1]
print(len(gen_var_lst))
i_ = 0
3091
for c in df_all.columns:
    if df_all[c].dtype == 'object':
        if c in ['Gene','Variation']:
            lbl = preprocessing.LabelEncoder()
            df_all[c+'_lbl_enc'] = lbl.fit_transform(df_all[c].values)  
            df_all[c+'_len'] = df_all[c].map(lambda x: len(str(x)))
            df_all[c+'_words'] = df_all[c].map(lambda x: len(str(x).split(' ')))
        elif c != 'Text':
            lbl = preprocessing.LabelEncoder()
            df_all[c] = lbl.fit_transform(df_all[c].values)
        if c=='Text': 
            df_all[c+'_len'] = df_all[c].map(lambda x: len(str(x)))
            df_all[c+'_words'] = df_all[c].map(lambda x: len(str(x).split(' '))) 

train = df_all.iloc[:len(train)]
test = df_all.iloc[len(train):]
train.head()
ID Gene Variation Text Gene_Share Variation_Share Genelblenc Gene_len Gene_words Variationlblenc Variation_len Variation_words Text_len Text_words
0 0 FAM58A Truncating Mutations Cyclin-dependent kinases (CDKs) regulate a var… 1 1 447 6 1 7654 20 2 39672 6105
1 1 CBL W802* Abstract Background Non-small cell lung canc… 1 1 216 3 1 8255 5 1 36691 5783
2 2 CBL Q249E Abstract Background Non-small cell lung canc… 1 1 216 3 1 5191 5 1 36691 5783
3 3 CBL N454D Recent evidence has demonstrated that acquired… 1 1 216 3 1 4572 5 1 36238 5625
4 4 CBL L399V Oncogenic mutations in the monomeric Casitas B… 1 1 216 3 1 3958 5 1 41308 6248
test.head()
ID Gene Variation Text Gene_Share Variation_Share Genelblenc Gene_len Gene_words Variationlblenc Variation_len Variation_words Text_len Text_words
3321 0 ACSL4 R570S 2. This mutation resulted in a myeloproliferat… 0 1 28 5 1 6404 5 1 49829 7495
3322 1 NAGLU P521L Abstract The Large Tumor Suppressor 1 (LATS1)… 0 1 852 5 1 5005 5 1 31326 4762
3323 2 PAH L333F Vascular endothelial growth factor receptor (V… 0 1 950 3 1 3915 5 1 75282 11191
3324 3 ING1 A148D Inflammatory myofibroblastic tumor (IMT) is a … 0 1 657 4 1 85 5 1 53996 8439
3325 4 TMEM216 G77A Abstract Retinoblastoma is a pediatric retina… 0 1 1376 7 1 2780 4 1 76967 11226
train.shape, test.shape
((3321, 14), (5668, 14))

Training

class cust_regression_vals(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        x = x.drop(['Gene', 'Variation','ID','Text'],axis=1).values
        return x

class cust_txt_col(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        return x[self.key].apply(str)
print('Pipeline...')
fp = pipeline.Pipeline([
    ('union', pipeline.FeatureUnion(
        n_jobs = -1,
        transformer_list = [
            ('standard', cust_regression_vals()),
            ('pi1', pipeline.Pipeline([('Gene', cust_txt_col('Gene')), 
                                       ('count_Gene', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd1', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            ('pi2', pipeline.Pipeline([('Variation', cust_txt_col('Variation')), 
                                       ('count_Variation', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd2', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            #('pi3', pipeline.Pipeline([('Text', cust_txt_col('Text')), ('tfidf_Text', feature_extraction.text.TfidfVectorizer(ngram_range=(1, 2))), ('tsvd3', decomposition.TruncatedSVD(n_components=50, n_iter=25, random_state=12))]))
        ])
    )])

train = fp.fit_transform(train); print(train.shape)
test = fp.transform(test); print(test.shape)
Pipeline...
(3321, 50)
(5668, 50)
print('Pipeline...')
fp = pipeline.Pipeline([
    ('union', pipeline.FeatureUnion(
        n_jobs = -1,
        transformer_list = [
            ('standard', cust_regression_vals()),
            ('pi1', pipeline.Pipeline([('Gene', cust_txt_col('Gene')), 
                                       ('count_Gene', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd1', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            ('pi2', pipeline.Pipeline([('Variation', cust_txt_col('Variation')), 
                                       ('count_Variation', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd2', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            #commented for Kaggle Limits
             ('pi3', pipeline.Pipeline([('Text', cust_txt_col('Text')), 
                                        ('hv', feature_extraction.text.HashingVectorizer(decode_error='ignore', n_features=2 ** 16, non_negative=True, ngram_range=(1, 3))),
                                        ('tfidf_Text', feature_extraction.text.TfidfTransformer()), 
                                        ('tsvd3', decomposition.TruncatedSVD(n_components=300, n_iter=25, random_state=12))]))
        ])
    )])

train = fp.fit_transform(train)
print (train.shape)

test_t = np.empty([0, train.shape[1]])
step = 200
for i in range(0, len(test), step):
    step_end = i+step
    step_end = step_end if step_end < len(test) else len(test)
    _test = fp.transform(test.iloc[i:step_end])
    test_t = np.vstack((test_t, _test))
test = test_t
print (test.shape)

Validation

Now we want to see how ou training actually did. First, let’s fix our training data

y = y - 1 #fix for zero bound array
denom = 0
fold = 5 
for i in range(fold):
    params = {
        'eta': 0.02,
        'max_depth': 6, 
        'objective': 'multi:softprob',
        'eval_metric': 'mlogloss',
        'num_class': 9,
        'seed': i,
        'silent': True
    }
    x1, x2, y1, y2 = model_selection.train_test_split(train, y, test_size=0.15, random_state=i)
    watchlist = [(xgb.DMatrix(x1, y1), 'train'), (xgb.DMatrix(x2, y2), 'valid')]
    model = xgb.train(params, xgb.DMatrix(x1, y1), 1000,  watchlist, verbose_eval=50, early_stopping_rounds=100)
    score1 = metrics.log_loss(y2, model.predict(xgb.DMatrix(x2), ntree_limit=model.best_ntree_limit), labels = list(range(9)))
    print(score1)
    #if score < 0.9:
    if denom != 0:
        pred = model.predict(xgb.DMatrix(test), ntree_limit=model.best_ntree_limit+80)
        preds += pred
    else:
        pred = model.predict(xgb.DMatrix(test), ntree_limit=model.best_ntree_limit+80)
        preds = pred.copy()
    denom += 1
    submission = pd.DataFrame(pred, columns=['class'+str(c+1) for c in range(9)])
    submission['ID'] = pid
    submission.to_csv('submission_xgb_fold_'  + str(i) + '.csv', index=False)
preds /= denom
submission = pd.DataFrame(preds, columns=['class'+str(c+1) for c in range(9)])
submission['ID'] = pid
submission.to_csv('submission_xgb.csv', index=False)
[0]	train-mlogloss:2.15782	valid-mlogloss:2.16479
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]	train-mlogloss:1.24063	valid-mlogloss:1.47325
[100]	train-mlogloss:0.912934	valid-mlogloss:1.25467
[150]	train-mlogloss:0.734402	valid-mlogloss:1.1651
[200]	train-mlogloss:0.613828	valid-mlogloss:1.11797
[250]	train-mlogloss:0.532413	valid-mlogloss:1.09372
[300]	train-mlogloss:0.459742	valid-mlogloss:1.0782
[350]	train-mlogloss:0.40078	valid-mlogloss:1.07044
[400]	train-mlogloss:0.352744	valid-mlogloss:1.06854
[450]	train-mlogloss:0.31163	valid-mlogloss:1.06999
[500]	train-mlogloss:0.276759	valid-mlogloss:1.07347
Stopping. Best iteration:
[402]	train-mlogloss:0.35102	valid-mlogloss:1.06827

1.06826797168
[0]	train-mlogloss:2.15929	valid-mlogloss:2.16676
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]	train-mlogloss:1.25902	valid-mlogloss:1.45637
[100]	train-mlogloss:0.927344	valid-mlogloss:1.22604
[150]	train-mlogloss:0.747399	valid-mlogloss:1.12135
[200]	train-mlogloss:0.621636	valid-mlogloss:1.07068
[250]	train-mlogloss:0.536327	valid-mlogloss:1.04974
[300]	train-mlogloss:0.466379	valid-mlogloss:1.03689
[350]	train-mlogloss:0.407413	valid-mlogloss:1.03094
[400]	train-mlogloss:0.359413	valid-mlogloss:1.02829
[450]	train-mlogloss:0.319446	valid-mlogloss:1.02665
[500]	train-mlogloss:0.284909	valid-mlogloss:1.02779
[550]	train-mlogloss:0.254682	valid-mlogloss:1.03102
Stopping. Best iteration:
[460]	train-mlogloss:0.312175	valid-mlogloss:1.0263

1.02630396074
[0]	train-mlogloss:2.15922	valid-mlogloss:2.16414
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]	train-mlogloss:1.24657	valid-mlogloss:1.43581
[100]	train-mlogloss:0.915174	valid-mlogloss:1.20246
[150]	train-mlogloss:0.724838	valid-mlogloss:1.10001
[200]	train-mlogloss:0.61148	valid-mlogloss:1.04971
[250]	train-mlogloss:0.526611	valid-mlogloss:1.02218
[300]	train-mlogloss:0.459829	valid-mlogloss:1.00654
[350]	train-mlogloss:0.403125	valid-mlogloss:0.994575
[400]	train-mlogloss:0.356882	valid-mlogloss:0.988575
[450]	train-mlogloss:0.314969	valid-mlogloss:0.983562
[500]	train-mlogloss:0.276921	valid-mlogloss:0.983204
[550]	train-mlogloss:0.247534	valid-mlogloss:0.983713
[600]	train-mlogloss:0.220643	valid-mlogloss:0.986902
Stopping. Best iteration:
[517]	train-mlogloss:0.26601	valid-mlogloss:0.982651

0.98265060772
[0]	train-mlogloss:2.15922	valid-mlogloss:2.16511
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]	train-mlogloss:1.24867	valid-mlogloss:1.44642
[100]	train-mlogloss:0.916804	valid-mlogloss:1.22938
[150]	train-mlogloss:0.738456	valid-mlogloss:1.13725
[200]	train-mlogloss:0.617969	valid-mlogloss:1.08818
[250]	train-mlogloss:0.53522	valid-mlogloss:1.06121
[300]	train-mlogloss:0.464503	valid-mlogloss:1.04386
[350]	train-mlogloss:0.40767	valid-mlogloss:1.03352
[400]	train-mlogloss:0.361219	valid-mlogloss:1.02747
[450]	train-mlogloss:0.320972	valid-mlogloss:1.02707
[500]	train-mlogloss:0.286176	valid-mlogloss:1.03045
Stopping. Best iteration:
[422]	train-mlogloss:0.342379	valid-mlogloss:1.02614

1.02614357069
[0]	train-mlogloss:2.15883	valid-mlogloss:2.16265
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.
  
Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]	train-mlogloss:1.25701	valid-mlogloss:1.41193
[100]	train-mlogloss:0.926052	valid-mlogloss:1.18233
[150]	train-mlogloss:0.739056	valid-mlogloss:1.08336
[200]	train-mlogloss:0.60938	valid-mlogloss:1.03559
[250]	train-mlogloss:0.520795	valid-mlogloss:1.01434
[300]	train-mlogloss:0.449827	valid-mlogloss:1.00456
[350]	train-mlogloss:0.3923	valid-mlogloss:0.999805
[400]	train-mlogloss:0.342235	valid-mlogloss:0.999826
[450]	train-mlogloss:0.301923	valid-mlogloss:1.00284
[500]	train-mlogloss:0.266097	valid-mlogloss:1.00746
Stopping. Best iteration:
[408]	train-mlogloss:0.334443	valid-mlogloss:0.999366

0.999366185342

0.99 validation loss, that’s pretty good (though it’s not as stellar as the ~0.33 training loss)

Since we’re training a decision tree, we actually have an interpretable model. We can visualize how much the model weighed the improtance of each feature:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rcParams['figure.figsize'] = (9.0, 9.0)
xgb.plot_importance(booster=model)
plt.show()

We’ve now got a great visualization of the relative importance of the features. What does this decision tree actually look like, then?

plt.rcParams['figure.figsize'] = (50.0, 50.0)
xgb.plot_tree(booster=model)
plt.show()

decision tree beautiful

Test Evaluation

So how did this do on the Kaggle leaderboards?

Future directions

This model certainly has use beyond Kaggle. Here are some of my ideas for expansion beyond the scope of this competition:

  1. Incorporating text data from EncyclopediaofMolecularCellBiologyandMolecularMedicine16volumesWiley2006, Cell-Molecular-Biology-Concepts-Experiments 7th edition, [ctdbase](http://ctdbase.org/), and the Comparative Toxicogenomics Database.
  2. Allowing pre-trained Bio-NER taggers in the task, including tmVar , TaggerOne, GNormPlus
  3. Mapping the variations to RSIDs and search related articles in PubMed or PMC.
  4. Adding on functionality for themygene api, which contains even more information about genes (such as family, overall description, etc)
  5. Incorporating aminoacid sequences for the genes in the original dataset, in FASTA
  6. Including allele frequency information from gnomAD or ExAC
  7. Guiding Named Entity Recognition modles could be improved (Just for corpus for feature generation and training specific NLP tools.) by adding data from the following: ICD, SNOMED CT,NCI Thesaurus, CPT, MedDRA, SNOMED CT.
  8. Incorporating pretrained word2vec models from Biomedical natural language processing

Subscribe to know whenever I post new content. I don't spam!


At least this isn't a full screen popup

That would be more annoying. Anyways, if you like what you're reading, consider subscribing to my newsletter! I'll notify you when I publish new posts - no spam.