This employs a little bit of routing trickery to redirect sites to be filtered via the proxy and passing the rest through untouched. This hopefully means that the amount of traffic and websites which need to pass through the filter is a lot less than "everything."
This is how the British Telecom "cleanfeed" solution worked. They redirected traffic to be filtered via a bunch of Squid proxies to do the actual filtering. This worked out great - until they filtered Wikipedia for a specific image on a specific article. (I'd appreciate links to the above please so I can reference them.) Then everything went pear-shaped:
- From my understanding, all the requests to Wikipedia came from one IP, rather than properly pretending to be the client IP - this noticably upset Wikipedia, who use IP addresses to identify potential spammers; and
- The sheer volume of requests to Wikipedia going through the filtering service caused it to slow right down.
So that is problem number 1 - it looks like it will work fine on a set of hardly viewed sites - but it may not work on a very busy site such as Wikipedia. Squid based filtering solutions certainly won't work on the scale of filtering Youtube (at least, not without using the magic Adrian-Squid version which isn't so performance-limited [/advertisement].)
The next problem is determining which IP addresses to redirect. Websites may change their IP addresses often - or have many IP addresses! - and so the list of IP addresses needs to be constantly updated. The number of IP addresses which need to be injected into BGP is based on all of the possible IP addresses returned for each site - this varies in the real world from one to hundreds. Again, they may change frequently - requiring constant updating to be correct. This leads to two main potential issues:
- Hardware routing devices (ie, the top of the line ones which large ISPs use to route gigabits of traffic) have limited "slots" for IP addresses/networks. The router typically stops working correctly when those run out. If you're lucky, the number of IP addresses being filtered will fit inside the hardware routing table. If you're unlucky, they won't. The big problem - different equipment from different vendors has different limitations. Upgrading this equipment can costs tens or hundreds of thousands of dollars.
- The only traffic being filtered is traffic being redirected to the filter. If the list of IP addresses for a nasty website is not kept 100% up to date, the website will not be properly filtered.
The third main problem is filtering websites which employ Content Delivery Networks. This is a combination of the above two problems. So I'm going to pose the question:
How do you filter a web page on Google?
No, the answer here isn't "Contact Google and ask them to kill the Web page." I'm specifically asking how one filters a particular web page on a very distributed infrastructure. You know; the kind of infrastructure which everyone is deploying these days. This may be something like Google/Yahoo; this may be being hosted on a very large set of end user machines on an illegally run botnet. The problem space is still the same.
- The IP addresses/networks involved in potentially serving that website is dynamic - you don't get an "easy" list of IPs when you resolve the hostname! For example - there's at least hundreds of potential hostnames serving Youtube streaming media content. It just isn't a case of filtering "www.youtube.com".
- There are a number of services running on the same infrastructure as Youtube. You may get lucky and only have one website - but you also may get unlucky and end up having to intercept all of the Google services just to filter one particular website.
All of a sudden you will end up potentially redirecting a significant portion of your web traffic to your filtering infrastructure. It may be happy filtering a handful of never-visited websites; but then you start feeding it a large part of the internet.
In summary, BGP based selective filtering doesn't work anywhere near as well as indicated in the ACMA report.
- You can't guarantee that you'll enumerate all IP addresses involved for a specific website;
- The ACMA may list something on a large website/CDN which will result in your filtering proxies melting; you may as well have paid the upfront cost in filtering everything in the first place;
- The ACMA may list something with so many IP addresses that your network infrastructure either stops working; or the filter itself stops working.
Personally - I don't like the idea of the ACMA being able to crash ISPs because they list something which ISPs are just unable to economically filter. Thus, the only logical solution here is to specify a filtering infrastructure to filter -everything-.