tf-idf vectorizer fit and transform

I experienced a bit of a breakthrough messing around with the scikit-learn’s tf-idf vectorizer under the guidance of my Springboard mentor, Hobson Lane. In this example I am working with data that I scraped from the Mercatus Center website. I originally scraped a corpus of over 5000 documents containing all published output, excluding charts, including multi-label duplicates. However, in this example I worked with a smaller subset of 27 documents to avoid having to wait 10 minutes for something to finish running.

Repository in case you are interested in running this yourself.

Setting up

The first cell of my jupyter notebook is imports and setting matplotlib inline:

from sklearn.datasets import load_files
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

%matplotlib inline

Then, I used the load_files function to load my multi-label corpus, which is just files placed in directories whose names are the label names:

trainer = load_files('data_short')

You can use the following two attributes of trainer to get an idea of what you have:

print(len(trainer.target_names))
print(len(trainer.filenames))

Finally, we should initialize a dataframe and add a few useful columns to it.

df = pd.DataFrame()
df['filename'] = trainer.filenames
df['text'] = [open(f, 'r').read() for f in trainer.filenames]
df['label'] = [trainer.target_names[x] for x in trainer.target]
df['author'] = [x.split('__')[0].split('--')[0].split('/')[-1] for x in df['filename']]

Fitting the tf-idf vectorizer to your data

Term frequency – inverse document frequency, or tf-idf, is a neat little statistic that tells you how important a particular word is for a document, relative to all the other documents in a corpus. As the name suggests, the formula is approximately the number of times the word appears in the document, divided by the number of times it appears in the entire corpus (the real math is here but you don’t need it). Words like “and” will clearly have very low tf-idf scores, then, while a word like “finance” will have a very high tf-idf score in a finance article if the rest of the corpus is about cooking. This is nice, as it is a quantified way of telling us that there is something very special about this one article’s topic, compared with the rest of the articles!

The first order of business is to initialize a tf-idf vectorizer, which we can then use to attack our corpus.

tfidf = TfidfVectorizer(min_df=3, max_df=0.9)

min_df indicates the minimum number of documents a word must be in to count – this is a way to avoid counting proper nouns and other words that do not tell us much about a document’s topic. max_df indicates the maximum proportion of documents a word can be in before it counts – this is how we avoid counting words like “and”, which would slow down our processing times and contribute little to understanding individual documents.

Now we are ready to “fit” the vectorizer to our list of document texts, as stored in df['text']. Fitting is the process of converting every word in the corpus to a number, and storing these word-number pairs in a Python dictionary.

tfidf = tfidf.fit(df['text'])

This was a bit of a breakthrough for me, as I never understood exactly what “fit” meant, despite probably reading a definition a dozen times. You can see your new dictionary by typing in the following:

tfidf.vocabulary_

Pretty cool stuff!

Using tf-idf transform

My goal here is simply educational, so I haven’t done anything too fancy yet. At the advice of my Springboard mentor, I created some “topic” lists with some key terms that could define each topic, then calculated each document’s cumulative tf-idf score for those topic words. You can then put these list of scores in the dataframe we created and do some cool stuff!

First, creating the topic lists:

finance_words = ['finance', 'economics', 'financial', 'stocks', 'dollars', 'equity', 'bond', 'bonds', 'commodity']
tech_words = ['technology','tech','machine','computer','software','internet','hardware']

These are pretty good guesses of what might be in the average Mercatus paper. I completely made these lists up, so don’t take this as a literal topic mapping!

Next, we need to loop through a transform() of the text column in our dataframe, df['text']. Each x in the tfidf.transform(df['text']) object below represents a document in our corpus, repesented as a matrix mapping numbers in our tfidf.vocabulary_ to tf-idf scores. In this way, each x represents a tf-idf score for each word in a document. Pretty cool!

topic = []
for x in tfidf.transform(df['text']):
 fin = pd.np.sum([x[(0, tfidf.vocabulary_[w])] for w in tfidf.vocabulary_ if w in finance_words])
 tec = pd.np.sum([x[(0, tfidf.vocabulary_[w])] for w in tfidf.vocabulary_ if w in tech_words])
 topic.append([fin, tec])
 
df_topic = pd.DataFrame(topic, columns=['finance','tech'])

The body of this for loop defines a fin and tec variable, each of which are the sum of tf-idf scores for financial and tech words that appear in that particular document. We then take these two numbers and append them as a pair to the topic array that we initialized in the beginning of this code block. Once the for loop runs through every document in our transform, we can make a dataframe with this list of pairs as two columns, 'finance' and 'tech'.

Plotting the results

What we have now is a dataframe called df_topic that has two columns with some very cool data: scores for how related each document is to the topics of finance and technology, as we defined them in our finance_words and tech_words lists above.

This is perfect for a scatterplot! As usual, pandas makes the task very simple:

df_topic.plot.scatter(x='finance', y='tech', xlim=(0,0.31), ylim=(0,0.31))

Here we had to define what the x-axis and y-axis should be, along with some limits on those axes. The result is pretty cool:

As we can see, every paper seems to have at least a little something to say about technology, but many papers have absolutely nothing to do with finance. However, there is definitely a skew towards finance-related papers, with one paper having a large focus on the topic.

Conclusion

Obviously this is only modestly useful in application, but the exercise gave me a sense of the nuts and bolts of tf-idf fits and transforms work. For more insight, I suggest creating a transform() and storing it in a variable so that you can play around with the results. Something like this:

tf_transform = tfidf.transform(df['text'])

will allow you to peak into the x‘s that we iterated through earlier:

for x in tf_transform[26]:
 print(x)
 print(type(x))

Good luck and I hope you enjoy playing around with this stuff as much as I do!