Understanding TF*IDF and Its Role in SEO

Since you're reading this post, you may already be familiar with TF*IDF — at least to some extent. There are plenty of well-informed articles and even some SEO tools out there that are worth your attention. But what's more interesting is the research and the math, and understanding how these fit together.

I want to show you what we've built and how we're using it. But this post isn't just to showcase a tool — it's to start a conversation around optimizing content for SEO by focusing on topic relevance.

Considerations Beyond Information Architecture

I've found that topic modeling and optimizing content to speak to specific concepts holds more sway with Google these days than even URL and information architecture in some respects. Google's approach to ranking pages based on topical relevance and intent has fundamentally changed.

The libraries for TF*IDF, semantic NLP, and even Word2Vec are not new at this point — though still pretty cutting edge when put into practice from an SEO perspective. Most of these beautiful database-driven libraries are not only accessible but freely available for us to use, process, and build upon.

What is TF*IDF?

TF*IDF stands for Term Frequency × Inverse Document Frequency.

It's a numerical statistic used in information retrieval to represent how important a specific word or phrase is to a given document. Wikipedia defines it as:

The TF-IDF value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

TF*IDF is most often used as part of latent semantic indexing (LSI) — a technique for processing language (also called natural language processing, or NLP) that allows systems to rank documents based on relevance against a specific term or topic.

The goal is to make sense of a population of unstructured content to score what it's about and how strongly it represents that topic versus other documents in the sample population.

The Three Factors

TF*IDF places weights on terms in a document as determined by three factors:

Term Frequency

How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than one with just one mention.

tf(t in d) = √frequency

Inverse Document Frequency

How often does the term appear across all documents in the collection? The more often, the lower the weight. Common terms like "and" or "the" contribute little to relevance as they appear everywhere, while uncommon terms like "hreflang" or "attribution" help zoom in on the most relevant documents.

idf(t) = 1 + log( numDocs / (docFreq + 1) )

Field Length

How long is the field? The shorter the field, the higher the weight. A term appearing in a title field carries more relevance signal than the same term buried in a long body section.

norm(d) = 1 / √numTerms

TF*IDF Example

Consider a document containing 100 words wherein the word "SEO" appears 3 times.

Term frequency: 3 / 100 = 0.03
Assume 10 million documents and "SEO" appears in 1,000 of them.
Inverse document frequency: log(10,000,000 / 1,000) = 4
TF*IDF weight: 0.03 × 4 = 0.12

Understanding N-Grams

An N-Gram is a set of co-occurring words within a given population of text, computed by moving one word forward through the text.

For TF*IDF calculations, terms are usually calculated as:

Unigrams — single-word terms
Bigrams — 2-word terms
Trigrams — 3-word terms

For example, the sentence "The SEO needs more links to rank the page" produces these bigrams:

The SEO
SEO needs
needs more
more links
links to
to rank
rank the
the page

When it comes to processing for natural language and SEO, topics seem to be best represented by bigrams and trigrams — so understanding the distinction matters.

Why Are TF*IDF and LSI Important for SEO?

An over-simplified answer: these toolsets are literally the building blocks of search engines. They're how Google is scoring and associating your pages with keywords related to the document's content.

Google has billions of pages to crawl and score for relevance on topics surrounding a user's submitted query. Not all documents will contain all the terms relevant to the query, and some terms are more important than others. The relevance score of a document depends, at least in part, on the weight of each term that appears in it.

What To Do With This Data

We built a tool that analyzes the term population and frequency of the top 20 organic ranking URLs in Google, calculating the TF*IDF score for each term. The purpose is to see how your current content uses the terms that appear on pages Google has deemed worthy of a top ranking for the same target keyword.

From here you can adjust your content to include more of the terms Google may be expecting to see, in the frequency it expects to see them.

If you haven't yet built the page, that's fine too. Just run the report for a target keyword to see what topics you should be addressing in your content before you write a single word.

Ideally you'd have an editor with a live view so you could rework your content to better build out the focus of topics and terms Google is expecting to see. That's exactly the direction this kind of tooling is heading.