Literature parsing for mutation classification

Automating the work of Grad Students

Originally posted on Kaggle

Background

Ever since the completion of human genome project, many have been projecting the advent of an era where personalized medicine will be able to beat cancer. It sounds simple enough; by knowing precisely the mutations one carries in their now sequencable genome, one could understand right down to the molecular level how the cancer came about, how it works, and how to fight it.

Unfortunately, there are still two major obstacles to this vision: The genomic diversity of cancer, and the manpower limitations of the current research.

Cancer arises when mutations occur that disrupt growth inhibition, promote telomerase immortalization, confer angiogenic ability, etc. The problem is that it’s not just cancer cells that accumulate mutations. Our bodies’ cells to not retain completely unchanged copies of the 3-billion-bp genome throughout our lives, they have countless harmless (or neutral) mutations that do not show up.

The default method for categorizing these methods is very time-inefficient. With few exceptions, most mutation interpreatations are done manually. This requires a researcher to manually identify, review, and classify any and all mutations based on increasingly verbose and SEO-optimized scientific papers.

Given this massive state space of unique mutations, it makes sense that the manual approach has not yielded this Holy Grail of personalized medicine yet.

Memorial Sloan Kettering Cancer Center (MSKCC) has asked for help to develop a machine learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations. MSKCC is making available an expert-annotated knowledge base where world-class researchers and oncologists have manually annotated thousands of mutations.

The Dataset

This dataset is from the MSK Redefining Cancer Treatment Kaggle Competition.

Walkthrough

First, let’s start with the requisite iimports

from sklearn import preprocessing, pipeline, feature_extraction, decomposition, model_selection, metrics, cross_validation, svm
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import normalize, Imputer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
import sklearn
import pandas as pd
import numpy as np
import xgboost as xgb
import datetime

Then, we can import the dataset:

train = pd.read_csv('../input/training_variants')
test = pd.read_csv('../input/test_variants')
trainx = pd.read_csv('../input/training_text', sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])
testx = pd.read_csv('../input/test_text', sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])

train = pd.merge(train, trainx, how='left', on='ID').fillna('')
y = train['Class'].values
train = train.drop(['Class'], axis=1)

test = pd.merge(test, testx, how='left', on='ID').fillna('')
pid = test['ID'].values
train.head()
IDGeneVariationText
00FAM58ATruncating MutationsCyclin-dependent kinases (CDKs) regulate a var…
11CBLW802*Abstract Background Non-small cell lung canc…
22CBLQ249EAbstract Background Non-small cell lung canc…
33CBLN454DRecent evidence has demonstrated that acquired…
44CBLL399VOncogenic mutations in the monomeric Casitas B…

For our testing output, we get something like this:

y
array([1, 2, 2, ..., 1, 4, 4])
IDGeneVariationText
00ACSL4R570S2. This mutation resulted in a myeloproliferat…
11NAGLUP521LAbstract The Large Tumor Suppressor 1 (LATS1)…
22PAHL333FVascular endothelial growth factor receptor (V…
33ING1A148DInflammatory myofibroblastic tumor (IMT) is a …
44TMEM216G77AAbstract Retinoblastoma is a pediatric retina…
pid
array([   0,    1,    2, ..., 5665, 5666, 5667])
df_all = pd.concat((train, test), axis=0, ignore_index=True)
df_all['Gene_Share'] = df_all.apply(lambda r: sum([1 for w in r['Gene'].split(' ') if w in r['Text'].split(' ')]), axis=1)
df_all['Variation_Share'] = df_all.apply(lambda r: sum([1 for w in r['Variation'].split(' ') if w in r['Text'].split(' ')]), axis=1)
df_all.head()
IDGeneVariationTextGene_ShareVariation_Share
00FAM58ATruncating MutationsCyclin-dependent kinases (CDKs) regulate a var…11
11CBLW802*Abstract Background Non-small cell lung canc…11
22CBLQ249EAbstract Background Non-small cell lung canc…11
33CBLN454DRecent evidence has demonstrated that acquired…11
44CBLL399VOncogenic mutations in the monomeric Casitas B…11
gen_var_lst = sorted(list(train.Gene.unique()) + list(train.Variation.unique()))
print(len(gen_var_lst))
3260
gen_var_lst = [x for x in gen_var_lst if len(x.split(' '))==1]
print(len(gen_var_lst))
i_ = 0
3091
for c in df_all.columns:
    if df_all[c].dtype == 'object':
        if c in ['Gene','Variation']:
            lbl = preprocessing.LabelEncoder()
            df_all[c+'_lbl_enc'] = lbl.fit_transform(df_all[c].values)  
            df_all[c+'_len'] = df_all[c].map(lambda x: len(str(x)))
            df_all[c+'_words'] = df_all[c].map(lambda x: len(str(x).split(' ')))
        elif c != 'Text':
            lbl = preprocessing.LabelEncoder()
            df_all[c] = lbl.fit_transform(df_all[c].values)
        if c=='Text': 
            df_all[c+'_len'] = df_all[c].map(lambda x: len(str(x)))
            df_all[c+'_words'] = df_all[c].map(lambda x: len(str(x).split(' '))) 

train = df_all.iloc[:len(train)]
test = df_all.iloc[len(train):]
train.head()
IDGeneVariationTextGene_ShareVariation_ShareGenelblencGene_lenGene_wordsVariationlblencVariation_lenVariation_wordsText_lenText_words
00FAM58ATruncating MutationsCyclin-dependent kinases (CDKs) regulate a var…11447617654202396726105
11CBLW802*Abstract Background Non-small cell lung canc…1121631825551366915783
22CBLQ249EAbstract Background Non-small cell lung canc…1121631519151366915783
33CBLN454DRecent evidence has demonstrated that acquired…1121631457251362385625
44CBLL399VOncogenic mutations in the monomeric Casitas B…1121631395851413086248
test.head()
IDGeneVariationTextGene_ShareVariation_ShareGenelblencGene_lenGene_wordsVariationlblencVariation_lenVariation_wordsText_lenText_words
33210ACSL4R570S2. This mutation resulted in a myeloproliferat…012851640451498297495
33221NAGLUP521LAbstract The Large Tumor Suppressor 1 (LATS1)…0185251500551313264762
33232PAHL333FVascular endothelial growth factor receptor (V…01950313915517528211191
33243ING1A148DInflammatory myofibroblastic tumor (IMT) is a …01657418551539968439
33254TMEM216G77AAbstract Retinoblastoma is a pediatric retina…011376712780417696711226
train.shape, test.shape
((3321, 14), (5668, 14))

Training

class cust_regression_vals(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        x = x.drop(['Gene', 'Variation','ID','Text'],axis=1).values
        return x

class cust_txt_col(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        return x[self.key].apply(str)
print('Pipeline...')
fp = pipeline.Pipeline([
    ('union', pipeline.FeatureUnion(
        n_jobs = -1,
        transformer_list = [
            ('standard', cust_regression_vals()),
            ('pi1', pipeline.Pipeline([('Gene', cust_txt_col('Gene')), 
                                       ('count_Gene', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd1', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            ('pi2', pipeline.Pipeline([('Variation', cust_txt_col('Variation')), 
                                       ('count_Variation', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd2', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            #('pi3', pipeline.Pipeline([('Text', cust_txt_col('Text')), ('tfidf_Text', feature_extraction.text.TfidfVectorizer(ngram_range=(1, 2))), ('tsvd3', decomposition.TruncatedSVD(n_components=50, n_iter=25, random_state=12))]))
        ])
    )])

train = fp.fit_transform(train); print(train.shape)
test = fp.transform(test); print(test.shape)
Pipeline...
(3321, 50)
(5668, 50)
print('Pipeline...')
fp = pipeline.Pipeline([
    ('union', pipeline.FeatureUnion(
        n_jobs = -1,
        transformer_list = [
            ('standard', cust_regression_vals()),
            ('pi1', pipeline.Pipeline([('Gene', cust_txt_col('Gene')), 
                                       ('count_Gene', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd1', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            ('pi2', pipeline.Pipeline([('Variation', cust_txt_col('Variation')), 
                                       ('count_Variation', feature_extraction.text.CountVectorizer(analyzer=u'char', ngram_range=(1, 8))), 
                                       ('tsvd2', decomposition.TruncatedSVD(n_components=20, n_iter=25, random_state=12))])),
            #commented for Kaggle Limits
             ('pi3', pipeline.Pipeline([('Text', cust_txt_col('Text')), 
                                        ('hv', feature_extraction.text.HashingVectorizer(decode_error='ignore', n_features=2 ** 16, non_negative=True, ngram_range=(1, 3))),
                                        ('tfidf_Text', feature_extraction.text.TfidfTransformer()), 
                                        ('tsvd3', decomposition.TruncatedSVD(n_components=300, n_iter=25, random_state=12))]))
        ])
    )])

train = fp.fit_transform(train)
print (train.shape)

test_t = np.empty([0, train.shape[1]])
step = 200
for i in range(0, len(test), step):
    step_end = i+step
    step_end = step_end if step_end < len(test) else len(test)
    _test = fp.transform(test.iloc[i:step_end])
    test_t = np.vstack((test_t, _test))
test = test_t
print (test.shape)

Validation

Now we want to see how ou training actually did. First, let’s fix our training data

y = y - 1 #fix for zero bound array
denom = 0
fold = 5 
for i in range(fold):
    params = {
        'eta': 0.02,
        'max_depth': 6, 
        'objective': 'multi:softprob',
        'eval_metric': 'mlogloss',
        'num_class': 9,
        'seed': i,
        'silent': True
    }
    x1, x2, y1, y2 = model_selection.train_test_split(train, y, test_size=0.15, random_state=i)
    watchlist = [(xgb.DMatrix(x1, y1), 'train'), (xgb.DMatrix(x2, y2), 'valid')]
    model = xgb.train(params, xgb.DMatrix(x1, y1), 1000,  watchlist, verbose_eval=50, early_stopping_rounds=100)
    score1 = metrics.log_loss(y2, model.predict(xgb.DMatrix(x2), ntree_limit=model.best_ntree_limit), labels = list(range(9)))
    print(score1)
    #if score < 0.9:
    if denom != 0:
        pred = model.predict(xgb.DMatrix(test), ntree_limit=model.best_ntree_limit+80)
        preds += pred
    else:
        pred = model.predict(xgb.DMatrix(test), ntree_limit=model.best_ntree_limit+80)
        preds = pred.copy()
    denom += 1
    submission = pd.DataFrame(pred, columns=['class'+str(c+1) for c in range(9)])
    submission['ID'] = pid
    submission.to_csv('submission_xgb_fold_'  + str(i) + '.csv', index=False)
preds /= denom
submission = pd.DataFrame(preds, columns=['class'+str(c+1) for c in range(9)])
submission['ID'] = pid
submission.to_csv('submission_xgb.csv', index=False)
[0] train-mlogloss:2.15782  valid-mlogloss:2.16479
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.24063  valid-mlogloss:1.47325
[100]   train-mlogloss:0.912934 valid-mlogloss:1.25467
[150]   train-mlogloss:0.734402 valid-mlogloss:1.1651
[200]   train-mlogloss:0.613828 valid-mlogloss:1.11797
[250]   train-mlogloss:0.532413 valid-mlogloss:1.09372
[300]   train-mlogloss:0.459742 valid-mlogloss:1.0782
[350]   train-mlogloss:0.40078  valid-mlogloss:1.07044
[400]   train-mlogloss:0.352744 valid-mlogloss:1.06854
[450]   train-mlogloss:0.31163  valid-mlogloss:1.06999
[500]   train-mlogloss:0.276759 valid-mlogloss:1.07347
Stopping. Best iteration:
[402]   train-mlogloss:0.35102  valid-mlogloss:1.06827

1.06826797168
[0] train-mlogloss:2.15929  valid-mlogloss:2.16676
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.25902  valid-mlogloss:1.45637
[100]   train-mlogloss:0.927344 valid-mlogloss:1.22604
[150]   train-mlogloss:0.747399 valid-mlogloss:1.12135
[200]   train-mlogloss:0.621636 valid-mlogloss:1.07068
[250]   train-mlogloss:0.536327 valid-mlogloss:1.04974
[300]   train-mlogloss:0.466379 valid-mlogloss:1.03689
[350]   train-mlogloss:0.407413 valid-mlogloss:1.03094
[400]   train-mlogloss:0.359413 valid-mlogloss:1.02829
[450]   train-mlogloss:0.319446 valid-mlogloss:1.02665
[500]   train-mlogloss:0.284909 valid-mlogloss:1.02779
[550]   train-mlogloss:0.254682 valid-mlogloss:1.03102
Stopping. Best iteration:
[460]   train-mlogloss:0.312175 valid-mlogloss:1.0263

1.02630396074
[0] train-mlogloss:2.15922  valid-mlogloss:2.16414
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.24657  valid-mlogloss:1.43581
[100]   train-mlogloss:0.915174 valid-mlogloss:1.20246
[150]   train-mlogloss:0.724838 valid-mlogloss:1.10001
[200]   train-mlogloss:0.61148  valid-mlogloss:1.04971
[250]   train-mlogloss:0.526611 valid-mlogloss:1.02218
[300]   train-mlogloss:0.459829 valid-mlogloss:1.00654
[350]   train-mlogloss:0.403125 valid-mlogloss:0.994575
[400]   train-mlogloss:0.356882 valid-mlogloss:0.988575
[450]   train-mlogloss:0.314969 valid-mlogloss:0.983562
[500]   train-mlogloss:0.276921 valid-mlogloss:0.983204
[550]   train-mlogloss:0.247534 valid-mlogloss:0.983713
[600]   train-mlogloss:0.220643 valid-mlogloss:0.986902
Stopping. Best iteration:
[517]   train-mlogloss:0.26601  valid-mlogloss:0.982651

0.98265060772
[0] train-mlogloss:2.15922  valid-mlogloss:2.16511
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.24867  valid-mlogloss:1.44642
[100]   train-mlogloss:0.916804 valid-mlogloss:1.22938
[150]   train-mlogloss:0.738456 valid-mlogloss:1.13725
[200]   train-mlogloss:0.617969 valid-mlogloss:1.08818
[250]   train-mlogloss:0.53522  valid-mlogloss:1.06121
[300]   train-mlogloss:0.464503 valid-mlogloss:1.04386
[350]   train-mlogloss:0.40767  valid-mlogloss:1.03352
[400]   train-mlogloss:0.361219 valid-mlogloss:1.02747
[450]   train-mlogloss:0.320972 valid-mlogloss:1.02707
[500]   train-mlogloss:0.286176 valid-mlogloss:1.03045
Stopping. Best iteration:
[422]   train-mlogloss:0.342379 valid-mlogloss:1.02614

1.02614357069
[0] train-mlogloss:2.15883  valid-mlogloss:2.16265
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.
  
Will train until valid-mlogloss hasn't improved in 100 rounds.
[50]    train-mlogloss:1.25701  valid-mlogloss:1.41193
[100]   train-mlogloss:0.926052 valid-mlogloss:1.18233
[150]   train-mlogloss:0.739056 valid-mlogloss:1.08336
[200]   train-mlogloss:0.60938  valid-mlogloss:1.03559
[250]   train-mlogloss:0.520795 valid-mlogloss:1.01434
[300]   train-mlogloss:0.449827 valid-mlogloss:1.00456
[350]   train-mlogloss:0.3923   valid-mlogloss:0.999805
[400]   train-mlogloss:0.342235 valid-mlogloss:0.999826
[450]   train-mlogloss:0.301923 valid-mlogloss:1.00284
[500]   train-mlogloss:0.266097 valid-mlogloss:1.00746
Stopping. Best iteration:
[408]   train-mlogloss:0.334443 valid-mlogloss:0.999366

0.999366185342

0.99 validation loss, that’s pretty good (though it’s not as stellar as the ~0.33 training loss)

Since we’re training a decision tree, we actually have an interpretable model. We can visualize how much the model weighed the improtance of each feature:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rcParams['figure.figsize'] = (9.0, 9.0)
xgb.plot_importance(booster=model)
plt.show()

We’ve now got a great visualization of the relative importance of the features. What does this decision tree actually look like, then?

plt.rcParams['figure.figsize'] = (50.0, 50.0)
xgb.plot_tree(booster=model)
plt.show()

beautiful

Test Evaluation

So how did this do on the Kaggle leaderboards?

Future directions

This model certainly has use beyond Kaggle. Here are some of my ideas for expansion beyond the scope of this competition:

  1. Incorporating text data from EncyclopediaofMolecularCellBiologyandMolecularMedicine16volumesWiley2006, Cell-Molecular-Biology-Concepts-Experiments 7th edition, [ctdbase](http://ctdbase.org/), and the Comparative Toxicogenomics Database.
  2. Allowing pre-trained Bio-NER taggers in the task, including tmVar , TaggerOne, GNormPlus
  3. Mapping the variations to RSIDs and search related articles in PubMed or PMC.
  4. Adding on functionality for themygene api, which contains even more information about genes (such as family, overall description, etc)
  5. Incorporating aminoacid sequences for the genes in the original dataset, in FASTA
  6. Including allele frequency information from gnomAD or ExAC
  7. Guiding Named Entity Recognition modles could be improved (Just for corpus for feature generation and training specific NLP tools.) by adding data from the following: ICD, SNOMED CT,NCI Thesaurus, CPT, MedDRA, SNOMED CT.
  8. Incorporating pretrained word2vec models from Biomedical natural language processing

Cited as:

@article{mcateer2017cancermut,
  title   = "Literature parsing for mutation classification",
  author  = "McAteer, Matthew",
  journal = "matthewmcateer.me",
  year    = "2017",
  url     = "https://matthewmcateer.me/blog/literature-parsing-for-mutation-classification/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.