Building a better Patent Classifier

Building a classifier to tell you whether your LIDAR patent will be approved


Suppose we wanted to build a classifier that could tell us whether or not a given patent was likely to be approved or denied. Suppose we also had a small subset of patents in the subfield we were interested in (i.e., much less than the millions of patents that the patent office actually has in it’s databases). How would we go about building our classifier?


import numpy as np
import pandas as pd
import re
from pandas import DataFrame
from sqlalchemy import create_engine
from matplotlib import pyplot as plt
import matplotlib as mpl
from sqlalchemy import create_engine
from matplotlib import pyplot as plt
import csv
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn import linear_model, decomposition, datasets
from sklearn import svm
from sklearn.metrics import roc_curve, auc
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from matplotlib import rcParams
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Data Acquisition

Data for this algorithm came from the USPTO. This API provides tabular information on each of the patent invalidation cases brought before the PTAB, with key trial data fields associated with each case. We acquired the data in the form of JSON data containing information on each case, including the patent application number and final patent number (when the patent is granted), the prosecution status, and the filing date. The result was a trial dataset of 358 patents. While this is much smaller than the 2M USPTO datasets out there, it allowed us to focus on a specific training application.

Data downloading

First, we pull down a JSON of patents that have been brought before PTAB, and join them all with patent text.

import json
patent_applications = []
for line in  open('uspto.json', 'r'):
patent_dict = {}
for i in  range(len(patent_applications)):
        'id': patent_applications[i]['object']['id'],
        'title': patent_applications[i]['object']['title'],
        'summary': patent_applications[i]['object']['summary'],
        'status': patent_applications[i]['object']['status'],
        'filingDate': patent_applications[i]['object']['filingDate'],
        'hasPublicationNumber': ('publicationNumber' in patent_applications[i]['object'].keys()),
        'publicationDate': patent_applications[i]['object']['publicationDate'],
        'type': patent_applications[i]['object']['objectType'],
        "actors": patent_applications[i]['actor'],
        "num_actors": len(patent_applications[i]['actor']),
        "action": patent_applications[i]['action']['verb'],
        "provider": patent_applications[i]['provider'],
        "published": (patent_applications[i]['published'] if  'published'  in patent_applications[i].keys() else  "NO PUBLISHED INFO")

For the sake of simplicity and visualization, we’re going to frame our dataset as a Pandas dataframe. Ideally, we would want to construct a MySQL table for productionizing the data.

We can clean this up a little bit. For example, for the provider, action, and type columns, how much do we actually gain from these columns?

import pandas as pd
patent_df = pd.DataFrame.from_dict(patent_dict, orient='index')
for column in ['provider','action','type']:
    print("'{}' Null values: {}, \t Unique values: {}".format(
patent_df.duplicated(subset='id', keep='first').value_counts()
False 358
dtype: int64

This should provide us with a neatly organized dictionary containing all the patent objects itemized by ID. We’ve also confirmed that there are no duplicates.

Our task is to preduct the acceptance of the patents. Each of the patent objects had a “status” tag. What does this look like?

For the denials prediction algorithm, the dependent variable was the prosecution status - whether it is listed as “denied” or not. Many of these do not actually contain the requisite denial status, so we had to use certain hueristics to infer the status.

We could certainly construct a hueristic based off of which of the patents in here have no Published info.

print(patent_df[patent_df['published'] == 'NO PUBLISHED INFO'].shape[0])
print(patent_df[patent_df['publicationDate'] == ''].shape[0])
print(patent_df[patent_df['hasPublicationNumber'] == False].shape[0])

However much more reliable would be to look at the actual couns of the different ‘status’ values. In fact, had we used the published/not-published hueristic, we would have omitted a crucial fact: Some of the accepted patents don’t have published info. In fact, we’re probably better off omitting these columns altogether.

So, with that out of the way, we can define our labelling conversion strategy as follows:

Status Notes in USTPO DataGrantedNon-GrantedNotesCountCode
"Provisional Application Expired"FalseTrueProvisional Doesn’t count0.0
"Patented Case"TrueFalseBase Case for “Granted”1.0
"Application Undergoing Preexam Processing"FalseTrueDoesn’t count. Not granted yet0.0
"Docketed New Case - Ready for Examination"FalseTrueDoesn’t count. Not granted yet0.0
"Patent Expired Due to NonPayment of Maintenance Fees Under 37 CFR 1.362"FalseTrueGoing to classify this as Not granted.0.0
"Application Dispatched from Preexam, Not Yet Docketed"FalseTrueDoesn’t count. Not granted yet0.0
"Awaiting TC Resp., Issue Fee Not Paid"FalseTrueDoesn’t count. Not granted yet0.0
"Publications -- Issue Fee Payment Verified"FalseTrueDoesn’t count. Not granted yet0.0
"Expressly Abandoned -- During Examination"FalseTrueDoesn’t count. Not granted yet0.0
"Examiner's Answer to Appeal Brief Mailed"FalseTrueDoesn’t count. Not granted yet0.0
"Sent to Classification contractor"FalseTrueDoesn’t count. Not granted yet0.0
"Final Rejection Mailed"FalseTrueDoesn’t count. Not granted yet0.0
"Advisory Action Mailed"FalseTrueDoesn’t count. Not granted yet0.0
"Appeal Brief (or Supplemental Brief) Entered and Forwarded to Examiner"FalseTrueDoesn’t count. Not granted yet0.0
"RO PROCESSING COMPLETED-PLACED IN STORAGE"FalseTrueDoesn’t count. Not granted yet0.0
"PCT - International Search Report Mailed to IB"FalseTrueDoesn’t count. Not granted yet0.0
"Abandoned -- Failure to Respond to an Office Action"FalseTrueNot Granted due to action of filer0.0
"Response to Non-Final Office Action Entered and Forwarded to Examiner"FalseTrueOfficial notification of non-granted status0.0
patent_df['granted'] = patent_df['status'].apply(lambda x: 1.0  if x == "Patented Case"  else  0.0)
patent_df['nongranted'] = patent_df['status'].apply(lambda x: 0.0  if x == "Patented Case"  else  1.0)

Nice, what about the authors?

Every single patent has at least 2 actors, and in some cases 10.

all_names = []
all_places = []
for list_item in patent_df['actors'].values.tolist():
for dictionary in list_item:
print("number_of_authors: {}".format(len(set(all_names))))
print("number_of_places: {}".format(len(set(all_places))))
number_of_authors: 620
number_of_places: 375

Feature engineering

For each patent, we can use word frequencies to create thousands of numeric features, and repeat the process for bigrams (ordered pairs of words), trigrams (ordered triplets of words), and tetragrams (ordered quadruplets).

As we just saw each linear increase in the number of words included in the features leads to a corresponding increase in the size of the features, and by extension exponential increase in the number of features—and therefore the complexity of the model.

This “bag of words” is far too naive, as it can give too much weight to common but non-informative words. As such, we can use frequency-inverse document frequency (TF-IDF) to normalize the frequencies. We can also strip out the unnecessary “stop words” and reduce words to their same root by using NLTK.

Latent semantic analysis was used to further reduce the dimensionality dimensionality reduction.

After these featurization techniques, we split the data by setting aside 80% of it as our training dataset and leaving aside the remaining 20% for evaluation (our test set).

patent_df["fulltext"] = patent_df['title'] + ' ' + patent_df['summary']

Removing stop words and applying stemmer

# Making sure to download the stopword first
import nltk"stopwords")
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/
X = patent_df["fulltext"].as_matrix()
y = patent_df['granted'].as_matrix()
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
ps = SnowballStemmer('english')
stop = set(stopwords.words('english'))
X_mod = []
for idx, claim in  enumerate(X):
text = ' '.join([ps.stem(word) for word in claim.split() if word not  in stop])
text = ' '.join([word for word in claim.split() if word not  in stop])

Plotting number of features by featurization method

Since we have so few instances of the data, and so many possible features, one way of prioritizing which features are useful is the use of word groupings (e.g., bigrams, trigrams, tetragrams, etc.)

tfidf_unigram = TfidfVectorizer(ngram_range=(1, 1))
tfidf_bigram = TfidfVectorizer(ngram_range=(1, 2))
tfidf_trigram = TfidfVectorizer(ngram_range=(1, 3))
tfidf_tetragram = TfidfVectorizer(ngram_range=(1, 4))
X_unigram = tfidf_unigram.fit_transform(X_mod)
X_bigram = tfidf_bigram.fit_transform(X)
X_trigram = tfidf_trigram.fit_transform(X)
X_tetragram = tfidf_tetragram.fit_transform(X)

num_features = [feature_matrix.shape[1] for feature_matrix in [X_unigram, X_bigram, X_trigram, X_tetragram]]
# Plotting number of features
from matplotlib import pyplot as plt
pos = list(range(len(num_features)))
width = 0.3
fig, ax = plt.subplots()
fig.tight_layout()[p + width for p in pos],
        label='Training accuracy')
ax.set_ylabel('Number of features (10s of thousands)')
ax.set_xlabel('Featurization method')
ax.set_title('Number of features by featurization method', fontsize=12)
plt.ticklabel_format(style='sci', axis='y', scilimits=(0,0))
ax.set_xticks([p + 1.5 * width for p in pos])
ax.set_xticklabels(['Unigrams', 'Bigrams', 'Trigrams', 'Tetragrams'])

As we can see, depending on the featurization method, our number of features can grow explosively

With this many features, we may want to whittle it down to just the most information-dense features. We can do this with a technique like singular value decomposition

Singular value decomposition: effects of reducing the number of features

import sklearn
from sklearn import linear_model, decomposition, datasets
from sklearn import svm
from matplotlib import pyplot as plt

# Simple function to prettify chart axes
def simpleaxis(ax):
def plot_accuracies(train_acc, test_acc, figure, classifier_name):
    pos = list(range(len(train_acc)))
    width = 0.2
    fig, ax = plt.subplots()
    fig.tight_layout()[p + width for p in pos],
            label='Training accuracy')[p + 2 * width for p in pos],
            label = 'Testing accuracy')
    ax.set_xlabel('Number of features')
    ax.set_title('Training and testing accuracy by number of features, \n{0}'.format(classifier_name), fontsize=12)
    ax.set_xticks([p + 2 * width for p in pos])
    plt.legend(bbox_to_anchor=(1, 1.02), loc='upper left', ncol=1)
num_svd_features = [50, 100, 500, 1000, 1500, 2000, 2500]
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X_unigram, y, test_size=0.2, random_state=20)
svc = svm.LinearSVC(C=10)
clfs = [('support vector classification', svc)]
for idx, (name, clf) in  enumerate(clfs):
    training_accuracies = []
    testing_accuracies = []
    for n_components in num_svd_features:
        print("Working on componenents {0}".format(n_components))
        svd = decomposition.TruncatedSVD(n_components=n_components)
        X_train_transformed = svd.transform(X_train)
        X_test_transformed = svd.transform(X_test), y_train)
        training_accuracies.append(clf.score(X_train_transformed, y_train))
        testing_accuracies.append(clf.score(X_test_transformed, y_test))
    plot_accuracies(training_accuracies, testing_accuracies, idx, name)

Results of Singular Value Decompostion

Primary classification results

Now that we’ve gotten our feature engineering out of the way, we want to test which models are actually useful for this kind of classification. Given the small scope of the data, we can easily turn to non-NN models such as random forests and SVCs.

from sklearn.linear_model import LogisticRegression,RidgeClassifier
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import matplotlib
import csv
# Utility function to test
def train_model(X, y, classifier):
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, test_size=0.2, random_state=20)
    model =, y_train)
    precision = precision_score(y_test, model.predict(X_test))
    recall = recall_score(y_test, model.predict(X_test))
    print("Training accuracy is {0}".format(model.score(X_train, y_train)))
    print("Testing accuracy is {0}".format(model.score(X_test, y_test)))
    print("Precision is {0}".format(precision))
    print("Recall is {0}".format(recall))
    return model, X_test, y_test, X_train, y_train
# Metrics
from sklearn.metrics import roc_curve, auc
# Classifier  
# classification algorithms
classifier_list = [ ("Linear SVC, C=10", SVC(C=10, kernel='linear')),
                    ("Linear SVC, C=1" , SVC(C=1, kernel='linear')),
                    ("Linear SVC, C=0.1", SVC(C=0.1, kernel='linear')),
                    ("Polynomial SVC, C=10", SVC(C=10, kernel='poly')),
                    ("RBF SVC, C=10", SVC(C=10, kernel='rbf')),
                    ("Random forest, 10", RandomForestClassifier(max_features=10, max_depth=10)),
                    ("Random forest, 20", RandomForestClassifier(max_features=10, max_depth=20)),
                    ("Random forest, 30", RandomForestClassifier(max_features=10, max_depth=30)),
                    ("Random forest, 60", RandomForestClassifier(max_features=10, max_depth=60)),
                    ("bagging classifier", BaggingClassifier()),
                    ("gradient boosting", GradientBoostingClassifier()),
                    ("adaboost", AdaBoostClassifier()),
                    ("KNeighborsClassifier", KNeighborsClassifier())]
def get_results(clfs, filename):
    for (name, classifier) in clfs:
        model, X_test, y_test, X_train, y_train = train_model(X_unigram, y, classifier)
        # model, X_test, y_test, X_train, y_train = train_model(X_mod, y, classifier)
        with  open(filename, 'w') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(['Training accuracy', 'Testing accuracy'])
            writer.writerow([model.score(X_train, y_train), model.score(X_test, y_test)])

get_results(classifier_list, "non_svc_classifier_accuracies.csv")

This function call will get us the accuracies for all our models and return the results in a nice and neat little csv.

However, classification accuracy is not enough for a binary classifier, we also want to make sure we take into account the ROC curves:

## ROC Curves
SVC_clfs = [("Linear SVC, C=30", SVC(C=30, kernel='linear')),
            ("Linear SVC, C=27.5", SVC(C=27.5, kernel='linear')),
            ("Linear SVC, C=25", SVC(C=25, kernel='linear')),
            ("Linear SVC, C=22.5", SVC(C=22.5, kernel='linear')),
            ("Linear SVC, C=20", SVC(C=20, kernel='linear')),
            ("Linear SVC, C=15", SVC(C=15, kernel='linear')),
            ("Linear SVC, C=10", SVC(C=10, kernel='linear')),
            ("Linear SVC, C=5", SVC(C=5, kernel='linear')),
            ("Linear SVC, C=1" , SVC(C=1, kernel='linear')),
            ("Linear SVC, C=0.1", SVC(C=0.1, kernel='linear')),
            ("Polynomial SVC, C=10", SVC(C=10, kernel='poly')),
            ("RBF SVC, C=10", SVC(C=10, kernel='rbf'))]
fprs = {}
tprs = {}
for (name, clf) in SVC_clfs:
model, X_test, y_test, X_train, y_train = train_model(X_unigram, y, clf)
y_score = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
fprs[name] = fpr
tprs[name] = tpr

ax = plt.subplot()
for name, _ in SVC_clfs:
    tpr = tprs[name]
    fpr = fprs[name]
    plt.plot(fpr, tpr, label=name)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")

ROC curves for the various models


So now that we’ve gotten accuracies, as well as precision and recall for our models. Which one perofrmed the best out of all the ones we had?

ModelTraining accuracyTesting accuracyPrecisionRecall
Linear SVC, C=300.97550.79160.81810.6207
Linear SVC, C=27.50.97200.79160.81810.6207
Linear SVC, C=250.97200.79160.81810.6207
Linear SVC, C=22.50.97200.79160.79160.6552
Linear SVC, C=200.97200.80550.80.6897
Linear SVC, C=150.97200.80550.80.6897
Linear SVC, C=100.96850.77770.78260.6206
Linear SVC, C=10.92650.76380.8750.4827
Linear SVC, C=0.10.59790.59720.00.0
Polynomial SVC, C=100.59790.59720.00.0
RBF SVC, C=100.59790.59720.00.0
Random forest, 100.79020.68060.750.3103
Random forest, 200.82860.68060.8750.2414
Random forest, 300.89510.68060.80.2758
Random forest, 600.92650.72220.73680.4827
bagging classifier0.97550.76380.83330.5172
gradient boosting0.97900.77770.84210.5517

It may suprise you, but our best model was a simple Support Vector Classifer with a linear kernel.

But how did the classifiers get to these conclusions? What were the most impactful words from the perspectives of the classifiers?

best_clf = SVC(C=10, kernel='linear', probability=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X_mod, y, test_size=0.2, random_state=20)
tfidf = TfidfVectorizer()
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)
model =, y_train)
#vals = df.invalidated.value_counts()
vals = patent_df['nongranted'].value_counts()
names = tfidf.get_feature_names()
coeffs = model.coef_
sorted_labels = [y for (x, y) in  sorted(zip(coeffs.todense(), names))]
sorted_coeffs = sorted(coeffs)
nonzeros = coeffs.nonzero()[1]
out = []
for idx in nonzeros:
    out.append((coeffs[0, idx], names[idx]))
sorted_labels = [y for (x, y) in  sorted(out)]
sorted_coeffs = sorted(coeffs.todense().tolist()[0])

# Plotting
x_vals = sorted_coeffs[-10:-1] + sorted_coeffs[0:9]
least_likely_denied = sorted_labels[-10:-1]
most_likely_denied = sorted_labels[0:9]
from pylab import *
pos = arange(len(x_vals)) # the bar centers on the y axis
f, axarr = plt.subplots(2, sharex=True)
plt.suptitle('Relative influences on probability of denial \n (top and bottom word stems from TFIDF)')
for ax in axarr:
x_1 = sorted_coeffs[-10:-1]
x_1_pos = arange(len(x_1))
a = axarr[0]
a.barh(x_1_pos, x_1)
# Customize minor tick labels
a.set_yticks(x_1_pos + 0.5, minor=True)
a.set_yticklabels(least_likely_denied, minor=True)
a.tick_params(axis='both', which='both',length=0)
a = axarr[1]
x_2 = sorted_coeffs[0:9]
x_2_pos = arange(len(x_2))
a.barh(x_2_pos, x_2)
# Customize minor tick labels
a.set_yticks(x_2_pos + 0.5, minor=True)
a.set_yticklabels(most_likely_denied, minor=True)
a.tick_params(axis='y', which='both',length=0)
plt.xlabel('Regression coefficient \n (positive means word stem improves likelihood of denial, \nnegative means stem hurts likelihood)')

Most influential words


1. How did our best model end up performing?

Not only did this model have top performance across F1-accuracy for the test dataset, but also had one of the highest precision and recall scores. In terms of comparison to State-of-the-Art, this is comparable to a 0.80 micro-averaged F1 score achieved by SVM ensembles using a combination of words and characters as features in Benites et al., (2018).

2. What methodology did we use to evaluate our model.

Evaluation and model search was performed on data containing words and characters, represented as encoded features. For labels, each patent case was given a 0 or 1 representing whether a patent was granted or not.

For the model selection process, a variety of SVM, tree-based, ensemble, and clustering algorithms were evaluated. These were chosen instead of deep learning or neural-network-based methods as, given that there were less than 400 instances in the data, there was unlikely enough data to train a robust neural network.

The models were compared according to several metrics: Training Accuracy, Testing Accuracy, Precision, and Recall.

Across the SVMs, Random Forest models, bagging classifiers, boosting classifiers, and clustering classifiers, Support vector classifiers with linear kernels had the best output. The SVC parameters were chosen according to a simplified grid search.

3. Which metric did we choose to optimize. Why did we choose this metric over others? F1-score for accuracy was chosen as an evaluation metric, as well as precision and recall. Given that the patent approval prediction was reduced to a supervised binary classification problem, precision and recall seemed especially important.

Cited as:

  title   = "Building a Better Patent Classifier",
  author  = "McAteer, Matthew",
  journal = "",
  year    = "2019",
  url     = ""

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.