BESS vs The Google Search Engine (Cache, Groups, Images)

An anticensorware investigation by Seth Finkelstein

Abstract: This report examines how N2H2's censorware deals with archives of large amount of information. Three features are examined from the Google search engine (Cache, Groups, Images). N2H2/BESS is found to ban the cached pages everywhere, pass porn in groups, and consider all image searching to be pornography. The general problems of censorware versus large archives are discussed (i.e., why censorware is impelled to situations such as banning the Google cache).

Introduction


Nothing is so difficult but that it may be found out by seeking.

-- Terence


N2H2 is a company which makes censorware (aka "filters"). Censorware is software which is designed and optimized for use by an authority to prevent another person from sending or receiving information. N2H2's product is sometimes sold under the name BESS, The Internet Retriever . This report examines how N2H2's censorware treats many features provided by Google , a popular and advanced search engine. Google has a huge cache of web pages, a large archive of netnews groups, and experimental image searching. Each feature turns out to be treated differently by N2H2.

Google Cache - Cache Me If You Can


Among Google's most popular special features are:

Cache: Provides a snapshot Google took of each page in your results. Click on it if the original page is unavailable or to see the page faster. ...

-- Google tour


While there are many search engines, one of Google's most popular and innovative aspects is its cache. This is an excellent feature where copies of web pages are stored locally on Google (only text pages are stored, no images or other binaries). It's a kind of temporary archive. Thus, a version of a web page matching a search (as of the time Google last visited that page) can be retrieved locally from Google's servers. This can sometimes be faster than retrieving the page from that page's own web server (especially if that web server machine is in another country or on a slow connection). And the cached copies of pages can be viewed even if the original page has been deleted (this works around many frustrations stemming from page-not-found errors for old pages). The Google cache is a wonderful advance in making search engines easier and better to use.

But, let's consider the implications of that cache feature for a moment ... stored locally on Google. Ponder this from a censorware point of view. All the cached pages, all of them, are sent from Google's server. Every cached page, literary, artistic, political, scientific or not, is seen as originating from the same domain (google.com).

In the language of BESS and N2H2, this makes the Google cache a LOOPHOLE. As discussed in my earlier report, (BESS's Secret LOOPHOLE (censorware vs. privacy & anonymity) ), a LOOPHOLE is a means by which a reader could possibly escape the necessary absolute control of censorware. LOOPHOLE is a secret (err, undocumented) blacklist category in BESS for sites which are always banned, everywhere. This blacklist category is not visible to administrators and cannot be deactivated. Indeed, from the point of view of censorware, it would undermine the whole control of the program to allow such sites.

Thus, BESS bans all use of the Google cache. No matter how useful this feature in searching and retrieving web pages, the fact that it has the potential to be used to escape the control of censorware means it cannot be permitted.

Google Cache - The Best Laid Bans ...


The best-laid plans of mice and men go oft astray

-- idiom derived from a Robert Burns poem


While BESS bans the Google cache, it turns out that the mechanism of the ban itself is extremely easy to defeat. A Google cache URL looks like, for example:

http://www.google.com/search?q=cache:k0u5Jf7nQ00:www.sethf.com/anticensorware/

Despite all the information there, it's just a complicated label for a web page from Google. Now, BESS does not ban the Google domain itself as a LOOPHOLE. So this means it can't base the ban on http://www.google.com. It doesn't ban Google searches themselves as a LOOPHOLE, so it can't use the part of the URL beginning http://www.google.com/search (which indicates a search).

It turns out that BESS bans the Google cache as a LOOPHOLE by looking for URLs starting with

http://www.google.com/search?q=cache

This can be verified by using N2H2's single-site blacklist checking form .

Type in anything starting with http://www.google.com/search?q=cache, and the result should come back as, e.g.:

The Site: http://www.google.com/search?q=cacheMeIfYouCan
is categorized by N2H2 as:
Loop Hole Sites

That's it. It's a very simple string search. Changing this string in some of the smallest ways, will fool BESS. It's that stupid.

For example, using any county-specific version of Google will bypass BESS, such as:

Putting any sort of nonsense parameter between the "?" character and the "q" character in the URL will also cause BESS to go astray. This is done by adding the character sequence "somestring=anotherstring&" between the "?" character and the "q" character, for example Use your imagination.

Even stranger, some versions of N2H2's censorware (e.g. Microsoft ISA server) seem to have a bug where N2H2 wants to ban the Google cache, but the server fails to do so. It's not for lack of trying, since a nonsense string such as http://www.google.com/search?q=cacheMeIfYouCan will still be banned. But the real, complicated, Google cache labels fail the ban matching-test. I believe I know the internal reason for the banning problem, but N2H2 doesn't pay me to debug their buggy software.

I've only seen this bug on a Microsoft (... quelle surprise ...) server based version of N2H2's censorware. That's found more often in corporations. N2H2's original server censorware, widely marketed to schools and libraries, uses a Linux-based server. The home version, even though running on Microsoft machines, seems to use a Linux server behind it.

The bottom line is that if you are reading this report behind N2H2 server censorware (especially in a corporation), and finding BESS doesn't ban the Google cache, try testing with a simple nonsense string after the q=cache part of the URL. That is, something along the lines of

http://www.google.com/search?q=cachetestnonsensestring

N2H2 sometimes can't even get a simple text-based banning mechanism to work correctly. It amazes me that people believe censorware can make any sort of sophisticated intellectual judgment.

Google Groups - DejaNews reincarnated


Google is attempting to compile the most complete archive possible of Usenet posts, ...

-- Google Usenet archive request


Google Groups is a massive archive of text messages (no binaries). It consists of several years worth of articles from a huge number of Usenet groups . The entire database can be searched and any matching messages retrieved.

So again, consider the problem from a censorware point of view. There's a huge database, with articles ranging from science to sex. And no way to distinguish between the results of search. While the various groups have useful labels such as alt.sex or sci.crypt, which indicate discussion topics, a search can return results from any group, and a message can also belong to several groups. The messages themselves are tagged by Google only with uninformative ID strings such as e.g. selm=9a7t12%242s5%241%40pencil.math.missouri.edu

Interestingly, BESS does not consider this archive to be a LOOPHOLE. It only blacklists Google Groups in the category "Message/Bulletin Board" . The Message/Bulletin Board blacklist is in fact even allowed in the configuration BESS calls "Typical School Filtering" .

This means any message, even the most prurient, can be retrieved from the Google Groups database even under their "Typical School Filtering" setting. Remember, there is no way BESS can tell the content of a retrieved message. It's all or nothing in terms of retrieving messages.

So, even though everyone is prohibited by N2H2 from using the Google cache, with the correct search terms a person can read all the (text) porn they desire. Go figure.

Of course, bans on certain keywords will stop the most obvious sex searching. But computers have no intelligence. For an in-depth examination of this point, see Jonathan Wallace's article N2H2's Weak AI So consider, for example, searches for various acronyms. Censorware can't distinguish a search for "ASSM" as to whether it's for material related to the group "alt.sex.stories.moderated" or the organization "Association of State Supervisors of Mathematics" (while the human searcher can rapidly choose between these, no matter which one was the intended target).

But were N2H2 to change the blacklisting here to be more restrictive, it would be denying access to much information useful for research in many areas.

Google Image Search - and if thine eye offend thee ...


WARNING: The results you see with this feature may contain adult content.
-- Google's warning about image searching
But if thine eye be evil, thy whole body shall be full of darkness.
-- Matthew 6:23
Google has an experimental feature that searches for images. Note, to Google's credit, this is not the hoary claim of image-recognition . That is, Google does not imply an understanding of anything from the pixels in the image. Rather, the searching is based on the text name given to the image file, the text of the page, and other purely prosaic text factors.

N2H2 quite straightforwardly blacklists all Google Image Search as "Pornography" . As some results might contain "adult content", they then handle that by forbidding everything related to image searching.

Perhaps since this deals with pictures and not text, the all-or-nothing dilemma is resolved here as "all" instead of "nothing".

Conclusion


Three statisticians go out hunting together. After a while they spot a solitary rabbit. The first statistician takes aim and overshoots. The second aims and undershoots. The third shouts out "We got him!"

-- From the January 89 issue of Unix/Review


It should be stressed that the dilemma here is intrinsic to censorware. Given a large undifferentiated archive, it'll either have to be banned entirely, or a reader may be able to retrieve items which would otherwise be banned. No matter which approach is taken by any censorware in a particular, the result is bound to be either too little or too much in terms of the concept of "filtering". There is no magic blocking program, only crude and ill-functioning attempts at control of information.


Version 1.0, Sep 4 2001

See also: BESS vs Image Search Engines


Mail comments to: Seth Finkelstein <sethf@sethf.com>

For future information:   subscribe    to   Seth Finkelstein's Infothought list    or read the    Infothought blog

(if you subscribed a few months ago, please resubscribe due to a crash)

See more of Seth Finkelstein 's Censorware Investigations