Google Bayesian Spam Filtering Problem?

An anticensorware investigation by Seth Finkelstein

Abstract: This report describes a possible explanation for recent changes in Google search results, where long-time high-ranking sites have disappeared. It is hypothesized that the changes are a result of the implementation of a "Bayesian spam filtering" algorithm, which is producing unintended consequences.

Google vs Spam

"Google" search results have long been under attack by search-engine spammers. As Google handles more and more searching, the temptation to scam the search results has become very attractive, leading to "Google-spam" . Various advertising commission programs can make it extremely lucrative to lure people to pages designed to do nothing but generate ad revenue for the search-engine spammer.

During October 2003, Google implemented anti-spam measures which seemed to have the effect of causing the search results to crash when a "poison spam site" would have been displayed. Searching results were halted at encountering such a site. See "Google Spam Filtering Gone Bad" (Seth Finkelstein). Read that report for extensive technical discussion on Google's suppression mechanism.

In the following weeks, this problem was reduced, possibly by hand-excluding some spam sites. A potential crash moved further out in the search results, from perhaps the first page, to many pages later. But it was not completely fixed. Crashes from poison spam sites could still be detected, though now were much less of a concern to the typical user.

In mid November 2003, Google introduced a major update, which had puzzling results. Many long time high-ranking sites, which were not spammers, appeared to fall precipitously in ranking or disappear entirely from the search results. Many theories have been put forth to account for this. See, for example, "Been Gazumped by Google? Trying to make Sense of the "Florida" Update!" (Barry LLoyd).

A Theory And Evidence

I conjecture the new search results arise from Google's implementation of "Bayesian spam filtering" . While too complicated to fully explain here, a "Bayesian spam filter" is a method for probabilistically estimating the likelihood that material is spam. Think of it as a measure of "spamminess", with reference to a set of spam. In essence, determining how much something "looks spammy". This method has been very popular in spam-fighting.

Given that Google is under extreme spam siege, it makes sense that it would be adopting more complicated anti-spam measures. Unfortunately, the current implementation seems to have extreme unintended consequences. One of the problems with probabilistic methods, is that given very large datasets (such as, for example, the entire web), significant false positives can occur.

At the moment, Google seems to have adopted the following rule:

If a simple search has spam-related keywords, penalize high-spam-scoring results

Here, "simple search" now seems to mean roughly a search based only on a keyword or minimal combinations of keywords. Many people have noted in investigations, that making the search more complex, such as by trying to exclude a nonsense string, deactivates the new results algorithm (I believe this means deactivating the Bayesian anti-spam mechanism). It turns out in fact any complexity, such as filetype or site-based searches, will currently have this deactivation effect. While some have thought this simple carelessness on Google's part, a little reflection will show it may arise from an unintended consequence of not wanting to implement the Bayesian, probabilistic, spam penalizing of results on every type of search. Because if this probabilistic suppression mechanism was applied on searching within websites, a major part of Google's services, it could wreak havoc with the results there.

Some evidence may clarify this explanation. Consider a search on the word "bracelet" , compared to the search on "bracelet" without bayesianspamfiltering . Note the site "www.charmbracelet.org" appears in the second, more complicated search (as result #4), but not the first, simpler one (other changes are apparent, but let's follow this one). This is not a spam site, but a thoroughly legitimate organization. Any slightly more complex search which does not affect ranking has that site appearing as result #4 such as "bracelet" without XLS file types . Or "bracelet" not on site www.google.com .

So why has it disappeared from the simple search? I hypothesize that "bracelet" is marked as a suspicious spam keyword. Note the site "www.charmbracelet.org" ranks highly for the keyword "bracelet". But when the anti-spam mechanism is in place, it hypothetically then also gets a high spam score, also for ranking highly for the keyword "bracelet". Since it has few other words to overcome this high spam score, it's killed.

To observe that this effect is does not stem from a general overall re-ranking, consider a word often associated with "bracelet", such as "charm". The site "www.charmbracelet.org" also ranks highly for a search of "charm" not on site google.com , at result #10. And it retains this result on a simple search for the word "charm" , remaining at result #10. Thus "charm" is not considered a suspicious spam keyword, while "bracelet" triggers the anti-spam mechanism.

And this is further reinforced by combining the two words - apparently a spam keyword, and non spam keyword, still activate the anti-spam system. The site "www.charmbracelet.org" used to rank at result #2 for a combined search, see "charm" "bracelet" not on site google.com . But with the suppression in place, it's penalized for a simple search of "charm" "bracelet" .

Overall, sites will be penalized in search results to the extent that they match a "spam-profile", and the anti-spam mechanism is applied to the search. This can be quite mysterious and confusing.

Conclusion

Once more, spam is a plague on search engines as well as email. But technical solutions may have unintended consequences.

Version 1.0 November 26 2003

Support

This work was not funded by anyone, and has no connection to any organization. In fact, if anyone is providing financial support for such projects, the author would like to know.

Small update January 6 2004: Google has now implemented the spam suppression system throughout its searching. The comparison methods described above no longer work.

[I run the Google ads below with some irony ...]:

Mail comments to: Seth Finkelstein <sethf@sethf.com>

For future information: subscribe to Seth Finkelstein's Infothought list or read the Infothought blog

(if you subscribed a few months ago, please resubscribe due to a crash)

See more of Seth Finkelstein 's Censorware Investigations