Secrets from the Algorithm: Google Search’s Internal Engineering Documentation Has Leaked

Table of Contents

Learn what you always wished you knew about Google’s algorithms

Google, if you’re reading this, the cat’s already out of the bag. 😉 Rolls up sleeves. Let’s dive in.

Internal documentation for Google Search’s Content Warehouse API has leaked, and there’s a treasure trove of information hidden within. This leak gives us an unprecedented look into the inner workings of Google’s ranking systems. Here’s everything you need to know about it.

The Leak: A Brief Overview

Recently, Google’s internal microservices documentation, resembling what Google Cloud Platform offers, was accidentally published to a code repository. This included the deprecated Document AI Warehouse’s internal version, inadvertently exposing details meant for internal eyes only. The documentation was captured by an external automated service before the mistake was corrected.

For liability reasons, I won’t link directly to the leaked documentation. However, since it was published under the Apache 2.0 license, anyone who found it has the right to use, modify, and distribute it. I’ve reviewed the API reference docs and contextualized them with past Google leaks and testimonies, combined with my extensive patent and whitepaper research for my upcoming ebook, The Secrets of SEO.

Key Takeaways from the Leak

The leaked documents do not detail Google’s scoring functions but provide a wealth of information about the data stored for content, links, and user interactions. These insights offer a clearer picture of what factors are considered in Google’s algorithms, which can help SEO practitioners focus their efforts more effectively.

Caveats to Keep in Mind

Before diving into the juicy details, here are a few caveats:

Limited Time and Context: With limited time to analyze the documents, I’ve only scratched the surface. Similar to the Yandex leak last year, this information is incomplete.
No Scoring Functions: We don’t have information on how features are weighted in scoring functions. Some features may be deprecated or used differently than we assume.
Current Information: As of March 2024, the information seems current, but Google could have made changes since then.
Correlation Is Not Causation: While the information is insightful, we must avoid jumping to conclusions without thorough analysis.

The 14,000 Ranking Features

There are 2,596 modules with 14,014 attributes (features) documented in the API. These modules cover components of YouTube, Assistant, Books, video search, links, web documents, crawl infrastructure, an internal calendar system, and the People API.

Key Features and Their Impact

Site Authority: Despite Google’s claims, there is a feature called “siteAuthority” indicating that Google does calculate an overall domain authority.
Clicks and Rankings: Contrary to Google’s public statements, clicks do influence rankings through systems like NavBoost and Glue, which consider user behavior data.
Sandbox: Google’s documentation mentions a “hostAge” attribute used to sandbox fresh spam, confirming the existence of a sandbox.
Chrome Data: Despite Google’s denials, data from Chrome is used in ranking algorithms.

Google’s Ranking Systems Architecture

Conceptually, Google’s ranking systems consist of numerous microservices where features are preprocessed and made available at runtime. Based on the documentation, over a hundred different ranking systems might be in play.

Notable Modules and Their Functions

Here are some key systems identified in the documentation:

Trawler: Web crawling system managing crawl queues and rates.
Alexandria: Core indexing system.
HtmlrenderWebkitHeadless: Renders JavaScript pages.
Mustang: Primary scoring, ranking, and serving system.
Ascorer: Main rankings algorithm.
NavBoost: Re-ranking system based on user clicks.
FreshnessTwiddler: Re-ranking system for freshness.

Twiddlers: Re-Ranking Functions

Twiddlers are functions that adjust rankings after the initial algorithm processes results. They operate similarly to filters in WordPress, modifying information retrieval scores or changing the ranking of a document right before presenting it to the user. Notable twiddlers include:

NavBoost
QualityBoost
RealTimeBoost
WebImageBoost

Key Revelations for SEO Practitioners

Panda Algorithm Insights

Panda is a scoring modifier based on user behavior and external links, applied at the domain, subdomain, or subdirectory level. To recover from Panda-related penalties, focus on driving more qualified traffic and earning diverse links.

Authors and E-E-A-T

Google explicitly stores authorship information, associating authors with documents. This strengthens the importance of E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) in content.

Demotions and Penalties

The documentation lists various algorithmic demotions, including:

Anchor Mismatch: Links that do not match the target site are demoted.
SERP Demotion: Based on user dissatisfaction signals from the SERP.
Nav Demotion: Likely related to poor navigation practices.
Exact Match Domains Demotion: Demotion of exact match domains.
Product Review Demotion: Related to the product reviews update.
Location Demotions: Demotions for “global” pages not associated with a specific location.

Links and Their Importance

Links remain crucial in Google’s algorithms, with several insights from the leak:

Indexing Tier Impact: Links from higher-tier pages are more valuable.
Link Spam Velocity: Google can measure link velocity to identify spam.
Homepage PageRank: Homepage PageRank is considered for all pages.
Homepage Trust: Trust level of the homepage influences link value.
Font Size: Font size of terms and links matters in ranking.

Content and Metadata

Document Truncation: Google counts the number of tokens, and documents are truncated if they exceed a certain limit.
Short Content: Scored for originality.
Page Titles: Still measured against queries for relevance.
Dates: Freshness of content is crucial, with multiple date-related attributes tracked.

Domain Registration and Small Sites

Domain Registration: Information is stored and may affect sandboxing.
Small Personal Sites: Identified and potentially treated differently.

Open Questions and Further Research

Helpful Content Update: Is it known as Baby Panda?
NSR: Does it stand for Neural Semantic Retrieval?

Actionable Advice

Apologize to Rand Fishkin: Rand’s insights about click experiments and domain authority have been validated.
Create Great Content and Promote It: Focus on quality content and promotion for the best impact.
Correlation Studies: Revive vertical-specific correlation studies based on new insights.
Test and Learn: Continuously experiment to see what works for your website.

Conclusion

This leak provides invaluable insights validating many long-held SEO practices and beliefs. While the exact details of Google’s algorithms remain complex, understanding these factors can help you refine your SEO strategies.

Sanam Munshi

Sanam Munshi, the driving force behind Conquerra Digital, is more than just our founder – he’s a digital marketing maestro with 14 years of diverse experience. From playing key roles at top US agencies and SaaS companies to co-founding the Australian digital powerhouse Skyward Digital & Pracxcel, Sanam’s career is a journey of breakthrough achievements. His expertise lies in guiding teams to shatter performance records in the most competitive online arenas.

Sanam embraces a simple yet profound philosophy: Embrace change, because it’s the only constant. His positive outlook and wide-ranging experience make him a master not just in business, but in building lasting relationships across industries. At Collab Lab, Sanam does more than manage operations – he nurtures growth, cultivates client partnerships, and steers us towards uncharted territories of success.