Literature parsing for mutation classification

Automating the work of Grad Students

Originally posted on Kaggle

Background

Ever since the completion of human genome project, many have been projecting the advent of an era where personalized medicine will be able to beat cancer. It sounds simple enough; by knowing precisely the mutations one carries in their now sequencable genome, one could understand right down to the molecular level how the cancer came about, how it works, and how to fight it.

Unfortunately, there are still two major obstacles to this vision: The genomic diversity of cancer, and the manpower limitations of the current research.

Cancer arises when mutations occur that disrupt growth inhibition, promote telomerase immortalization, confer angiogenic ability, etc. The problem is that it’s not just cancer cells that accumulate mutations. Our bodies’ cells to not retain completely unchanged copies of the 3-billion-bp genome throughout our lives, they have countless harmless (or neutral) mutations that do not show up.

The default method for categorizing these methods is very time-inefficient. With few exceptions, most mutation interpreatations are done manually. This requires a researcher to manually identify, review, and classify any and all mutations based on increasingly verbose and SEO-optimized scientific papers.

Given this massive state space of unique mutations, it makes sense that the manual approach has not yielded this Holy Grail of personalized medicine yet.

Memorial Sloan Kettering Cancer Center (MSKCC) has asked for help to develop a machine learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations. MSKCC is making available an expert-annotated knowledge base where world-class researchers and oncologists have manually annotated thousands of mutations.

The Dataset

This dataset is from the MSK Redefining Cancer Treatment Kaggle Competition.

Walkthrough

First, let’s start with the requisite iimports

from sklearn import preprocessing, pipeline, feature_extraction, decomposition, model_selection, metrics, cross_validation, svm
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import normalize, Imputer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
import sklearn
import pandas as pd
import numpy as np
import xgboost as xgb
import datetime

Then, we can import the dataset:

train = pd.read_csv('../input/training_variants')
test = pd.read_csv('../input/test_variants')
trainx = pd.read_csv('../input/training_text', sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])
testx = pd.read_csv('../input/test_text', sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])

train = pd.merge(train, trainx, how='left', on='ID').fillna('')
y = train['Class'].values
train = train.drop(['Class'], axis=1)

test = pd.merge(test, testx, how='left', on='ID').fillna('')
pid = test['ID'].values
train.head()
ID Gene Variation Text
0 0 FAM58A Truncating Mutations Cyclin-dependent kinases (CDKs) regulate a var…
1 1 CBL W802* Abstract Background Non-small cell lung canc…
2 2 CBL Q249E Abstract Background Non-small cell lung canc…
3 3 CBL N454D Recent evidence has demonstrated that acquired…
4 4 CBL L399V Oncogenic mutations in the monomeric Casitas B…

For our testing output, we get something like this:

y
array([1, 2, 2, ..., 1, 4, 4])
ID Gene Variation Text
0 0 ACSL4 R570S 2. This mutation resulted in a myeloproliferat…
1 1 NAGLU P521L Abstract The Large Tumor Suppressor 1 (LATS1)…
2 2 PAH L333F Vascular endothelial growth factor receptor (V…
3 3 ING1 A148D Inflammatory myofibroblastic tumor (IMT) is a …
4 4 TMEM216 G77A Abstract Retinoblastoma is a pediatric retina…
pid
array([   0,    1,    2, ..., 5665, 5666, 5667])
df_all = pd.concat((train, test), axis=0, ignore_index=True)
df_all['Gene_Share'] = df_all.apply(lambda r: sum([1 for w in r['Gene'].split(' ') if w in r['Text'].split(' ')]), axis=1)
df_all['Variation_Share'] = df_all.apply(lambda r: sum([1 for w in r['Variation'].split(' ') if w in r['Text'].split(' ')]), axis=1)
df_all.head()
ID Gene Variation Text Gene_Share Variation_Share
0 0 FAM58A Truncating Mutations Cyclin-dependent kinases (CDKs) regulate a var… 1 1
1 1 CBL W802* Abstract Background Non-small cell lung canc… 1 1
2 2 CBL Q249E Abstract Background Non-small cell lung canc… 1 1
3 3 CBL N454D Recent evidence has demonstrated that acquired… 1 1
4 4 CBL L399V Oncogenic mutations in the monomeric Casitas B… 1 1
gen_var_lst = sorted(list(train.Gene.unique()) + list(train.Variation.unique()))
print(len(gen_var_lst))
3260
gen_var_lst = [x for x in gen_var_lst if len(x.split(' '))==1]
print(len(gen_var_lst))
i_ = 0
3091
for c in df_all.columns:
    if df_all[c].dtype == 'object':
        if c in ['Gene','Variation']:
            lbl = preprocessing.LabelEncoder()
            df_all[c+'_lbl_enc'] = lbl.fit_transform(df_all[c].values)  
            df_all[c+'_len'] = df_all[c].map(lambda x: len(str(x)))
            df_all[c+'_words'] = df_all[c].map(lambda x: len(str(x).split(' ')))
        elif c != 'Text':
            lbl = preprocessing.LabelEncoder()
            df_all[c] = lbl.fit_transform(df_all[c].values)
        if c=='Text': 
            df_all[c+'_len'] = df_all[c].map(lambda x: len(str(x)))
            df_all[c+'_words'] = df_all[c].map(lambda x: len(str(x).split(' '))) 

train = df_all.iloc[:len(train)]
test = df_all.iloc[len(train):]
train.head()
ID Gene Variation Text Gene_Share Variation_Share Genelblenc Gene_len Gene_words Variationlblenc Variation_len Variation_words Text_len Text_words
0 0 FAM58A Truncating Mutations Cyclin-dependent kinases (CDKs) regulate a var… 1 1 447 6 1 7654 20 2 39672 6105
1 1 CBL W802* Abstract Background Non-small cell lung canc… 1 1 216 3 1 8255 5 1 36691 5783
2 2 CBL Q249E Abstract Background Non-small cell lung canc… 1 1 216 3 1 5191 5 1 36691 5783
3 3 CBL N454D Recent evidence has demonstrated that acquired… 1 1 216 3 1 4572 5 1 36238 5625
4 4 CBL L399V Oncogenic mutations in the monomeric Casitas B… 1 1 216 3 1 3958 5 1 41308 6248
test.head()
ID Gene Variation Text Gene_Share Variation_Share Genelblenc Gene_len Gene_words Variationlblenc Variation_len Variation_words Text_len Text_words
3321 0 ACSL4 R570S 2. This mutation resulted in a myeloproliferat… 0 1 28 5 1 6404 5 1 49829 7495
3322 1 NAGLU P521L Abstract The Large Tumor Suppressor 1 (LATS1)… 0 1 852 5 1 5005 5 1 31326 4762
3323 2 PAH L333F Vascular endothelial growth factor receptor (V… 0 1 950 3 1 3915 5 1 75282 11191
3324 3 ING1 A148D Inflammatory myofibroblastic tumor (IMT) is a … 0 1 657 4 1 85 5 1 53996 8439
3325 4 TMEM216 G77A Abstract Retinoblastoma is a pediatric retina… 0 1 1376 7 1 2780 4 1 76967 11226
train.shape, test.shape
((3321, 14), (5668, 14))

Training

class cust_regression_vals(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        x = x.drop(['Gene', 'Variation','ID','Text'],axis=1).values
        return x

class cust_txt_col(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        return x[self.key].apply(str)
print('Pipeline...')
fp = pipeline.Pipeline([
    ('union', pipeline.FeatureUnion(
        n_jobs = -1,
        transformer_list = [
            ('standard', cust_regression_vals()),
            ('pi1', pipeline.Pipeline([('Gene', cust_txt_col('Gene')), 
                                       ('count_Gene', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd1', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            ('pi2', pipeline.Pipeline([('Variation', cust_txt_col('Variation')), 
                                       ('count_Variation', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd2', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            #('pi3', pipeline.Pipeline([('Text', cust_txt_col('Text')), ('tfidf_Text', feature_extraction.text.TfidfVectorizer(ngram_range=(1, 2))), ('tsvd3', decomposition.TruncatedSVD(n_components=50, n_iter=25, random_state=12))]))
        ])
    )])

train = fp.fit_transform(train); print(train.shape)
test = fp.transform(test); print(test.shape)
Pipeline...
(3321, 50)
(5668, 50)
print('Pipeline...')
fp = pipeline.Pipeline([
    ('union', pipeline.FeatureUnion(
        n_jobs = -1,
        transformer_list = [
            ('standard', cust_regression_vals()),
            ('pi1', pipeline.Pipeline([('Gene', cust_txt_col('Gene')), 
                                       ('count_Gene', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd1', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            ('pi2', pipeline.Pipeline([('Variation', cust_txt_col('Variation')), 
                                       ('count_Variation', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd2', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            #commented for Kaggle Limits
             ('pi3', pipeline.Pipeline([('Text', cust_txt_col('Text')), 
                                        ('hv', feature_extraction.text.HashingVectorizer(decode_error='ignore', n_features=2 ** 16, non_negative=True, ngram_range=(1, 3))),
                                        ('tfidf_Text', feature_extraction.text.TfidfTransformer()), 
                                        ('tsvd3', decomposition.TruncatedSVD(n_components=300, n_iter=25, random_state=12))]))
        ])
    )])

train = fp.fit_transform(train)
print (train.shape)

test_t = np.empty([0, train.shape[1]])
step = 200
for i in range(0, len(test), step):
    step_end = i+step
    step_end = step_end if step_end < len(test) else len(test)
    _test = fp.transform(test.iloc[i:step_end])
    test_t = np.vstack((test_t, _test))
test = test_t
print (test.shape)

Validation

Now we want to see how ou training actually did. First, let’s fix our training data

y = y - 1 #fix for zero bound array
denom = 0
fold = 5 
for i in range(fold):
    params = {
        'eta': 0.02,
        'max_depth': 6, 
        'objective': 'multi:softprob',
        'eval_metric': 'mlogloss',
        'num_class': 9,
        'seed': i,
        'silent': True
    }
    x1, x2, y1, y2 = model_selection.train_test_split(train, y, test_size=0.15, random_state=i)
    watchlist = [(xgb.DMatrix(x1, y1), 'train'), (xgb.DMatrix(x2, y2), 'valid')]
    model = xgb.train(params, xgb.DMatrix(x1, y1), 1000,  watchlist, verbose_eval=50, early_stopping_rounds=100)
    score1 = metrics.log_loss(y2, model.predict(xgb.DMatrix(x2), ntree_limit=model.best_ntree_limit), labels = list(range(9)))
    print(score1)
    #if score < 0.9:
    if denom != 0:
        pred = model.predict(xgb.DMatrix(test), ntree_limit=model.best_ntree_limit+80)
        preds += pred
    else:
        pred = model.predict(xgb.DMatrix(test), ntree_limit=model.best_ntree_limit+80)
        preds = pred.copy()
    denom += 1
    submission = pd.DataFrame(pred, columns=['class'+str(c+1) for c in range(9)])
    submission['ID'] = pid
    submission.to_csv('submission_xgb_fold_'  + str(i) + '.csv', index=False)
preds /= denom
submission = pd.DataFrame(preds, columns=['class'+str(c+1) for c in range(9)])
submission['ID'] = pid
submission.to_csv('submission_xgb.csv', index=False)
[0] train-mlogloss:2.15782  valid-mlogloss:2.16479
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.24063  valid-mlogloss:1.47325
[100]   train-mlogloss:0.912934 valid-mlogloss:1.25467
[150]   train-mlogloss:0.734402 valid-mlogloss:1.1651
[200]   train-mlogloss:0.613828 valid-mlogloss:1.11797
[250]   train-mlogloss:0.532413 valid-mlogloss:1.09372
[300]   train-mlogloss:0.459742 valid-mlogloss:1.0782
[350]   train-mlogloss:0.40078  valid-mlogloss:1.07044
[400]   train-mlogloss:0.352744 valid-mlogloss:1.06854
[450]   train-mlogloss:0.31163  valid-mlogloss:1.06999
[500]   train-mlogloss:0.276759 valid-mlogloss:1.07347
Stopping. Best iteration:
[402]   train-mlogloss:0.35102  valid-mlogloss:1.06827

1.06826797168
[0] train-mlogloss:2.15929  valid-mlogloss:2.16676
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.25902  valid-mlogloss:1.45637
[100]   train-mlogloss:0.927344 valid-mlogloss:1.22604
[150]   train-mlogloss:0.747399 valid-mlogloss:1.12135
[200]   train-mlogloss:0.621636 valid-mlogloss:1.07068
[250]   train-mlogloss:0.536327 valid-mlogloss:1.04974
[300]   train-mlogloss:0.466379 valid-mlogloss:1.03689
[350]   train-mlogloss:0.407413 valid-mlogloss:1.03094
[400]   train-mlogloss:0.359413 valid-mlogloss:1.02829
[450]   train-mlogloss:0.319446 valid-mlogloss:1.02665
[500]   train-mlogloss:0.284909 valid-mlogloss:1.02779
[550]   train-mlogloss:0.254682 valid-mlogloss:1.03102
Stopping. Best iteration:
[460]   train-mlogloss:0.312175 valid-mlogloss:1.0263

1.02630396074
[0] train-mlogloss:2.15922  valid-mlogloss:2.16414
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.24657  valid-mlogloss:1.43581
[100]   train-mlogloss:0.915174 valid-mlogloss:1.20246
[150]   train-mlogloss:0.724838 valid-mlogloss:1.10001
[200]   train-mlogloss:0.61148  valid-mlogloss:1.04971
[250]   train-mlogloss:0.526611 valid-mlogloss:1.02218
[300]   train-mlogloss:0.459829 valid-mlogloss:1.00654
[350]   train-mlogloss:0.403125 valid-mlogloss:0.994575
[400]   train-mlogloss:0.356882 valid-mlogloss:0.988575
[450]   train-mlogloss:0.314969 valid-mlogloss:0.983562
[500]   train-mlogloss:0.276921 valid-mlogloss:0.983204
[550]   train-mlogloss:0.247534 valid-mlogloss:0.983713
[600]   train-mlogloss:0.220643 valid-mlogloss:0.986902
Stopping. Best iteration:
[517]   train-mlogloss:0.26601  valid-mlogloss:0.982651

0.98265060772
[0] train-mlogloss:2.15922  valid-mlogloss:2.16511
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.24867  valid-mlogloss:1.44642
[100]   train-mlogloss:0.916804 valid-mlogloss:1.22938
[150]   train-mlogloss:0.738456 valid-mlogloss:1.13725
[200]   train-mlogloss:0.617969 valid-mlogloss:1.08818
[250]   train-mlogloss:0.53522  valid-mlogloss:1.06121
[300]   train-mlogloss:0.464503 valid-mlogloss:1.04386
[350]   train-mlogloss:0.40767  valid-mlogloss:1.03352
[400]   train-mlogloss:0.361219 valid-mlogloss:1.02747
[450]   train-mlogloss:0.320972 valid-mlogloss:1.02707
[500]   train-mlogloss:0.286176 valid-mlogloss:1.03045
Stopping. Best iteration:
[422]   train-mlogloss:0.342379 valid-mlogloss:1.02614

1.02614357069
[0] train-mlogloss:2.15883  valid-mlogloss:2.16265
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.
  
Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.25701  valid-mlogloss:1.41193
[100]   train-mlogloss:0.926052 valid-mlogloss:1.18233
[150]   train-mlogloss:0.739056 valid-mlogloss:1.08336
[200]   train-mlogloss:0.60938  valid-mlogloss:1.03559
[250]   train-mlogloss:0.520795 valid-mlogloss:1.01434
[300]   train-mlogloss:0.449827 valid-mlogloss:1.00456
[350]   train-mlogloss:0.3923   valid-mlogloss:0.999805
[400]   train-mlogloss:0.342235 valid-mlogloss:0.999826
[450]   train-mlogloss:0.301923 valid-mlogloss:1.00284
[500]   train-mlogloss:0.266097 valid-mlogloss:1.00746
Stopping. Best iteration:
[408]   train-mlogloss:0.334443 valid-mlogloss:0.999366

0.999366185342

0.99 validation loss, that’s pretty good (though it’s not as stellar as the ~0.33 training loss)

Since we’re training a decision tree, we actually have an interpretable model. We can visualize how much the model weighed the improtance of each feature:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rcParams['figure.figsize'] = (9.0, 9.0)
xgb.plot_importance(booster=model)
plt.show()

We’ve now got a great visualization of the relative importance of the features. What does this decision tree actually look like, then?

plt.rcParams['figure.figsize'] = (50.0, 50.0)
xgb.plot_tree(booster=model)
plt.show()

beautiful

Test Evaluation

So how did this do on the Kaggle leaderboards?

Future directions

This model certainly has use beyond Kaggle. Here are some of my ideas for expansion beyond the scope of this competition:

  1. Incorporating text data from EncyclopediaofMolecularCellBiologyandMolecularMedicine16volumesWiley2006, Cell-Molecular-Biology-Concepts-Experiments 7th edition, [ctdbase](http://ctdbase.org/), and the Comparative Toxicogenomics Database.
  2. Allowing pre-trained Bio-NER taggers in the task, including tmVar , TaggerOne, GNormPlus
  3. Mapping the variations to RSIDs and search related articles in PubMed or PMC.
  4. Adding on functionality for themygene api, which contains even more information about genes (such as family, overall description, etc)
  5. Incorporating aminoacid sequences for the genes in the original dataset, in FASTA
  6. Including allele frequency information from gnomAD or ExAC
  7. Guiding Named Entity Recognition modles could be improved (Just for corpus for feature generation and training specific NLP tools.) by adding data from the following: ICD, SNOMED CT,NCI Thesaurus, CPT, MedDRA, SNOMED CT.
  8. Incorporating pretrained word2vec models from Biomedical natural language processing

Cited as:

@article{mcateer2017cancermut,
  title   = "Literature parsing for mutation classification",
  author  = "McAteer, Matthew",
  journal = "matthewmcateer.me",
  year    = "2017",
  url     = "https://matthewmcateer.me/blog/literature-parsing-for-mutation-classification/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I would be very happy to correct them right away!

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.