Adrian Chadd's Ramblings: filtering

Saturday, December 19, 2009

Filtering via BGP, and why this doesn't always quite work out..

Another interesting thing to look at is the increasingly popular method of filtering by using a BGP "exception list". I've heard this anecdotally touted by various parties as "the solution" (but nothing I can quote publicly in any way, sorry) but really, is it?

This employs a little bit of routing trickery to redirect sites to be filtered via the proxy and passing the rest through untouched. This hopefully means that the amount of traffic and websites which need to pass through the filter is a lot less than "everything."

This is how the British Telecom "cleanfeed" solution worked. They redirected traffic to be filtered via a bunch of Squid proxies to do the actual filtering. This worked out great - until they filtered Wikipedia for a specific image on a specific article. (I'd appreciate links to the above please so I can reference them.) Then everything went pear-shaped:

From my understanding, all the requests to Wikipedia came from one IP, rather than properly pretending to be the client IP - this noticably upset Wikipedia, who use IP addresses to identify potential spammers; and
The sheer volume of requests to Wikipedia going through the filtering service caused it to slow right down.

So that is problem number 1 - it looks like it will work fine on a set of hardly viewed sites - but it may not work on a very busy site such as Wikipedia. Squid based filtering solutions certainly won't work on the scale of filtering Youtube (at least, not without using the magic Adrian-Squid version which isn't so performance-limited [/advertisement].)

The next problem is determining which IP addresses to redirect. Websites may change their IP addresses often - or have many IP addresses! - and so the list of IP addresses needs to be constantly updated. The number of IP addresses which need to be injected into BGP is based on all of the possible IP addresses returned for each site - this varies in the real world from one to hundreds. Again, they may change frequently - requiring constant updating to be correct. This leads to two main potential issues:

Hardware routing devices (ie, the top of the line ones which large ISPs use to route gigabits of traffic) have limited "slots" for IP addresses/networks. The router typically stops working correctly when those run out. If you're lucky, the number of IP addresses being filtered will fit inside the hardware routing table. If you're unlucky, they won't. The big problem - different equipment from different vendors has different limitations. Upgrading this equipment can costs tens or hundreds of thousands of dollars.
The only traffic being filtered is traffic being redirected to the filter. If the list of IP addresses for a nasty website is not kept 100% up to date, the website will not be properly filtered.

The third main problem is filtering websites which employ Content Delivery Networks. This is a combination of the above two problems. So I'm going to pose the question:

How do you filter a web page on Google?

No, the answer here isn't "Contact Google and ask them to kill the Web page." I'm specifically asking how one filters a particular web page on a very distributed infrastructure. You know; the kind of infrastructure which everyone is deploying these days. This may be something like Google/Yahoo; this may be being hosted on a very large set of end user machines on an illegally run botnet. The problem space is still the same.

The IP addresses/networks involved in potentially serving that website is dynamic - you don't get an "easy" list of IPs when you resolve the hostname! For example - there's at least hundreds of potential hostnames serving Youtube streaming media content. It just isn't a case of filtering "www.youtube.com".
There are a number of services running on the same infrastructure as Youtube. You may get lucky and only have one website - but you also may get unlucky and end up having to intercept all of the Google services just to filter one particular website.

All of a sudden you will end up potentially redirecting a significant portion of your web traffic to your filtering infrastructure. It may be happy filtering a handful of never-visited websites; but then you start feeding it a large part of the internet.

In summary, BGP based selective filtering doesn't work anywhere near as well as indicated in the ACMA report.

You can't guarantee that you'll enumerate all IP addresses involved for a specific website;
The ACMA may list something on a large website/CDN which will result in your filtering proxies melting; you may as well have paid the upfront cost in filtering everything in the first place;
The ACMA may list something with so many IP addresses that your network infrastructure either stops working; or the filter itself stops working.

Personally - I don't like the idea of the ACMA being able to crash ISPs because they list something which ISPs are just unable to economically filter. Thus, the only logical solution here is to specify a filtering infrastructure to filter -everything-.

Wednesday, December 16, 2009

Why would more than 10,000 URLs be a problem?

I'm going to preface this (and all other censorship/filtering related posts) with a disclaimer:

I believe that mandatory censorship and filtering is wrong, inappropriate and risky.

That said, I'd like others to better understand the various technical issues behind implementing a filter. My hope is that people begin to understand the proper technical issues rather than simply re-stating others' potentially misguided opinions.

The "10,000 URL" limit is an interesting one. Since the report doesn't mention the specifics behind this view, and I can't find anything about it in my simple web searching, I'm going to make a stab in the dark.

Many people who implement filters using open source methods such as Squid will typically implement them as a check against a list of URLs. This searching can be implemented via two main methods:

Building a list of matches (regular expressions, exact-match strings, etc) which is compared against; and
Building a tree/hash/etc to match against in one pass.

Squid implements the former for regular expression matching and the latter for dstdomain/IP address matching.

What this unfortunately means is that full URL matching with regular expressions depends not only on the complexity of the regular expression, but the number of entries. It checks each entry in the list in turn.

So when Squid (and similar) software is used to filter a large set of URLs, and regular expressions are used to match against, it is quite possible that there will be a limitation on how many URLs can be included before performance degrades.

So, how would one work around it?

It is possible to combine regular expression matches into one larger rule, versus checking against many smaller ones. Technical details - instead of /a/, /b/, /c/; one may use /(a|b|c)/. But unfortunately not all regular expression libraries handle very long regular expressions so for portability reasons this is not always done.

Squid at least doesn't make it easy to match on the full URL without using regular expressions. Exact-match and glob-style match (eg, http://foo.com/path/to/file/*) will work very nicely. (I also should write that for Squid/Lusca at some point.)

A google "SafeSearch" type methodology may be used to avoid the use of regular expressions. This normalises the URL, breaks it up into parts, creates MD5 hashes for each part and compares them in turn to a large database of MD5 hashes. This provides a method of distributing the filtering list without specifically providing the clear-text list of URLs and it turns all of the lookups into simple MD5 comparisons. The downside is the filtering is a lot less powerful than regular expressions.

To wrap up, I'm specifically not discussing the effectiveness of URL matching and these kinds of rules in building filters. That is a completely different subject - one which will typically end with "it's an arms race; we'll never really win it." The point is that it is possible to filter requests against a list of URLs and regular expressions much, much greater than a low arbitrary limit.