Google Search API Leak

Google Internal Search Leak: Source Shares 1000’s of Google API Documents

On Friday, May 24th 2024, Rand Fishkin, the co-founder of SparkToro & Snackbar Studio, was invited to a Zoom call with an anonymous individual who claimed they had access to API documentation leaks from inside Google’s Search Division. This article was informed by Rand Fishkin’s blog post; you can read it here. 

 

The Google Internal Search Leak

The email sent from an anonymous source to Rand Fishkin claims that ex-Google employees and many others have confirmed the documents as authentic. The document claims directly contradict public statements made by Google over the years. 

This includes the fact that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more.

Rand talked to the source, and to verify the legitimacy, he discussed mutual ex-colleagues and events that both parties had previously attended before the contact showed 2,500 pages of API documentation.

This contained 12,014 attributes allegedly from Google’s internal content API warehouse. The API warehouse hosts documentation like this in almost every Google team, which explains various API attributes and modules to help familiarize those working on a project with the available data elements.

The leak matches other published GitHub entries and reposts on Google’s Cloud API documentation, which uses the same notation style, form, process/module/feature names and references. Think of it as instructions for members of Google’s search engine team. 

The document’s commit history code was uploaded to GitHub on March 27th, 2024, and will not be removed until May 7th, 2024. The documents in question do not show the weight of the particular elements in the search ranking algorithm, nor does it prove which elements are used in the ranking systems. 

“Some of the claims made by the anonymous source were: 

  • In their early years, Google’s search team recognised a need for full clickstream data (every URL visited by a browser) for many web users to improve their search engine’s result quality.
  • A system called “NavBoost” initially gathered data from Google’s Toolbar PageRank, and the desire for more clickstream data served as the key motivation for the creation of the Chrome browser (launched in 2008).
  • NavBoost uses the number of searches for a given keyword to identify trending search demand, the number of clicks on a search result and long clicks versus short clicks.
  • Google utilises cookie history, logged-in Chrome data, and pattern detection (referred to in the leak as “unsquashed” clicks versus “squashed” clicks) as effective means for fighting manual & automated click spam.
  • NavBoost also scores queries for user intent. For example, certain thresholds of attention and clicks on videos or images will trigger video or image features for that query and related NavBoost-associated queries.
  • Google examines clicks and engagement on searches both during and after the main query (referred to as a “NavBoost query”), boosting subsequent clicked search results for the previous keyword. 
  • NavBoost’s data is used at the host level to evaluate a site’s overall quality. This evaluation can result in a boost or a demotion.
  • Other minor factors such as penalties for domain names that exactly match unbranded search queries (e.g. mens-luxury-watches.com or newcastle-homes-for-sale.net), a newer “BabyPanda” score, and spam signals are also considered during the quality evaluation process.
  • NavBoost geo-fences click data, considering country and state/province levels and mobile versus desktop usage. However, if Google lacks data for certain regions or user agents, they may apply the process universally to the query results.
  • During the Covid-19 pandemic, Google employed whitelists for websites that could appear high in the results for Covid-related searches
  • Similarly, during democratic elections, Google employed whitelists for sites that should be shown (or demoted) for election-related information.”

 

Is it Legit?

Throughout this process, Rand took it upon himself to look further into the API lead, its authenticity and whether we can trust it. So he used his network and reached out to some ex-Google friends, showed them the document and asked them what they thought.

Three ex-Googlers wrote back: one said they felt uncomfortable looking at or commenting on it. The other two shared the following anonymously:

  • “I didn’t have access to this code when I worked there. But this certainly looks legit. “
  • “It has all the hallmarks of an internal Google API.”
  • “It’s a Java-based API. And someone spent a lot of time adhering to Google’s own internal standards for documentation and naming.”
  • “I’d need more time to be sure, but this matches internal documentation I’m familiar with.”
  • “Nothing I saw in a brief review suggests this is anything but legit.”

 

What Now? Does Google Use Everything in the API Docs?

The simple answer is it’s up for debate. Google could have retired some of these, used some just for texting or internal projects or possibly could’ve made API features available that have never been employed.

The documentation references deprecated features and s[ecofoc mtes], some of which explicitly say they should no longer be used. This suggests that those not marked with such details were still in use as of March 2024, when the leak occurred. 

Rand has also investigated the recency of the documentation, and the most recent date referenced in the API docs is August 2023. It is worth considering that Google changes yearly with new updates, and their famous AI updates and considerations do not appear in this leak. So, which of the items are actively used today in Google’s ranking systems? It’s all speculative. Although this leak contains some pretty juicy information, it is impossible to know. 

 


 

ROAR is committed to discovering and studying the latest developments within the SEO industry. If you’re looking for a data-driven strategic SEO agency to help you turn clicks into customers, contact our team of specialists. 

RELATED POSTS

What is a Certified HubSpot Agency?

There is no limit to the number of new platforms...

What is a Certified HubSpot Agency?

There is no limit to the number of new platforms...

The Benefits of Website Analytics in 2024

In the ever-evolving landscape of the digital...

The Benefits of Website Analytics in 2024

In the ever-evolving landscape of the digital...

GET A FREE SEO Audit

  • This field is for validation purposes and should be left unchanged.

GET A FREE Ads Review

Do you invest more than £1500 per month in ad spend? Find out how your performance can be improved with a FREE ads account review.

  • This field is for validation purposes and should be left unchanged.