Wednesday, December 16, 2009

Why would more than 10,000 URLs be a problem?

I'm going to preface this (and all other censorship/filtering related posts) with a disclaimer:

I believe that mandatory censorship and filtering is wrong, inappropriate and risky.

That said, I'd like others to better understand the various technical issues behind implementing a filter. My hope is that people begin to understand the proper technical issues rather than simply re-stating others' potentially misguided opinions.

The "10,000 URL" limit is an interesting one. Since the report doesn't mention the specifics behind this view, and I can't find anything about it in my simple web searching, I'm going to make a stab in the dark.

Many people who implement filters using open source methods such as Squid will typically implement them as a check against a list of URLs. This searching can be implemented via two main methods:
  1. Building a list of matches (regular expressions, exact-match strings, etc) which is compared against; and
  2. Building a tree/hash/etc to match against in one pass.
Squid implements the former for regular expression matching and the latter for dstdomain/IP address matching.

What this unfortunately means is that full URL matching with regular expressions depends not only on the complexity of the regular expression, but the number of entries. It checks each entry in the list in turn.

So when Squid (and similar) software is used to filter a large set of URLs, and regular expressions are used to match against, it is quite possible that there will be a limitation on how many URLs can be included before performance degrades.

So, how would one work around it?

It is possible to combine regular expression matches into one larger rule, versus checking against many smaller ones. Technical details - instead of /a/, /b/, /c/; one may use /(a|b|c)/. But unfortunately not all regular expression libraries handle very long regular expressions so for portability reasons this is not always done.

Squid at least doesn't make it easy to match on the full URL without using regular expressions. Exact-match and glob-style match (eg, http://foo.com/path/to/file/*) will work very nicely. (I also should write that for Squid/Lusca at some point.)

A google "SafeSearch" type methodology may be used to avoid the use of regular expressions. This normalises the URL, breaks it up into parts, creates MD5 hashes for each part and compares them in turn to a large database of MD5 hashes. This provides a method of distributing the filtering list without specifically providing the clear-text list of URLs and it turns all of the lookups into simple MD5 comparisons. The downside is the filtering is a lot less powerful than regular expressions.

To wrap up, I'm specifically not discussing the effectiveness of URL matching and these kinds of rules in building filters. That is a completely different subject - one which will typically end with "it's an arms race; we'll never really win it." The point is that it is possible to filter requests against a list of URLs and regular expressions much, much greater than a low arbitrary limit.

No comments:

Post a Comment