Monday, December 21, 2009

The ACMA blacklist, and can it be distributed securely?

One sore point is that the ACMA Blacklist for online restricted content is "closed" - that is, we the public have no way at the present time of viewing what is it on it. The pro-filtering advocates quite validly state that opening up the ACMA Blacklist would basically be publishing URLs for naughty people to view - an "illegal content directory", if you will.

So what? If they want to find it, they'll find it - whether it is public or not.

The current downside though is that the blacklist can't be easily used by third-party filter software producers without what I understand to be an elaborate and expensive process.

So not only is it currently impossible for the public to vet the list to make sure only illegal content makes it on there, but it also means it can't be widely used everywhere unless you're a company with a lot of money to burn.

It seems like a bit of a silly situation to be in, doesn't it?

So, is it feasible to distribute the list in some encrypted way? How easy would it be to find what is on the list itself? This is a good question. The true answer is "no, it isn't feasible." Like everything technological, the true question is how much effort you're willing to go to in order to hide said list.

The ACMA blacklist is integrated into a few products which are currently available. The problem is hiding the URLs from the user. Software hackers are a clever bunch. If your computer runs the software then it is very possible to determine how to decrypt the URL list and use it. So simply publishing the cleartext ACMA blacklist - encrypted or not - is just never going to be secure. I believe this is how the ACMA blacklist was leaked to Wikileaks earlier in 2009.

There already exists a perfectly good way to distribute this sort of URL blacklist - eg Google SafeSearch. The ACMA could take the list of URLs, convert them to sets of MD5 strings to match against, and distribute that. They could distribute this openly - so everyone and anyone who wished to filter content based on this list could do so without having to pay the ACMA some stupid amount of money. Finally, it means that web site owners could compare their own URLs against the content of the blacklist to see if any of their pages are on it. It may not be that feasible for very large, dynamic URL sites - but it certainly is more feasible than what can be done today.

If the ACMA did this then I'd even write up a Squid plugin to filter against said ACMA blacklist. Small companies and schools can then use it for free. That would get the ACMA blacklist more exposure - which benefits the ACMA as much as it benefits anti-censor advocates. More use will translate to a larger cross-section of visited web sites - so people will be more likely to discover if something which shouldn't be blocked suddenly appears on the blacklist.

But is it truely secure? There's currently no way to take an MD5 string and turn it back into a URL. You could theoretically generate a set of URLs which would hash to that MD5 string but it'd take a damned long time. So, for all practical reasons, it can't be reverse engineered.

But what can be done is to log the URLs which match the filter and slowly build up a list of sites that way. Naughty people could then publish the set of URLs which match the blacklist rules. There's no technological method of avoiding that. If people discover a URL has been filtered, they may just share the link online.

The only real way the government has to counter sharing the cleartext URLs from the blacklist would be to make it illegal and enforce that law very strictly. This means enforcing it when naughty stuff is shared - but it also means that anyone who publishes URLs for content which should not be on the list may also get punished. That is a whole other debate.

So in summary - yes, the ACMA could publish the blacklist in a way that is more secure than they currently are. They could publish it - like Google does - to the public, so it can be integrated into arbitrary pieces of software. This may help it be more widely adopted and tested. But they will never be able to publish the list in a way that makes it impossible to identify and publish cleartext URLs.

Let me be clear here - there is no technological method for restricting what information people can share between each other, and this includes URLs identified to be on the ACMA blacklist.

Sunday, December 20, 2009

On filtering proxy/anonymizing servers..

I'd like to briefly talk about anonymizing/proxy servers. These services act as gateways between the user (and their web browser, for example) and the general internet. They typically hide the real user origin from the web site and the ISPs in question so access can not be easily traced. They are also useful diagnostic tools (eg to see whether web sites work from far away networks.) Others use them to circumvent country filters which are blocking access to "free-speech" and social networking web sites (eg China, Iran, etc.)

I'm not going to talk about the legitimate and illegitimate uses of these. Many more devices are used and abused for nefarious ways, but we don't see the postal system implement mandatory written filtering; nor do we see (legal!) mandatory monitoring and filtering of the telephone/cellular network.

One common way of working around URL filters in the workplace, schools and libraries is to use an anonymizer/proxy service on the internet. This is how many schoolchildren log onto facebook and myspace. Their use is dangerous (as you're typically giving the service your facebook/myspace/hotmail/gmail/etc credentials!) but again, there are plenty of legitimate and safe uses for them.

The problem is constructing filters which block access through these anonymizer/proxy services. Some of them will include the original URL in the request - they're easyish to block. Others will encrypt/obfuscate the URL so a normal filter won't work. There are plenty of tricks which are pulled; describing them will take a long time.

A growing number of these anonymizer/proxy services are using SSL encryption to totally hide what is going on (ie, blocking not only the URL, but the content itself.) This is just not possible to break without some intrusive additions to the users' computer. Let's not go there.

So, there really is only a few ways to combat this:
  1. You create complicated rules for each anonymizer/proxy service which attempts to track and decode the URL, and filter on that; or
  2. You create complicated fingerprints to identify types of traffic which indicate the user of an anonymizer/proxy service, and filter on that; or
  3. You just block any and all proxy anonymizer/proxy sites.
The problems!
  • 1 is difficult and longwinded. A lot of effort would have to be spent to continuously update the set of rules as new proxy services come on board designed to thwart these services.
  • 2 is just as difficult and longwinded - and it becomes possible that these fingerprints will identify legitimate sites as proxy services and filter traffic incorrectly.
  • 3 is what the majority of current content filters do. They don't bother trying to filter what people are doing with anonymizer/proxy services; they just blanket filter all of them.
Now, as I've mentioned, plenty of new anonymizer/proxy services pop up every day. I'd hazard a guess and suggest that the majority of them are run by shady, nefarious people who see the value in logging your access credentials to popular webmail/social networking sites and selling them to third parties.

The real concern - I've seen more than one user log onto their internet banking and work sites using these anonymizer/proxy services because they're so used to using them, they forget not to. Imagine, for a moment, that gambling sites are blocked and users turn to anonymizer/proxy services to gamble online. They use their credit card details. Ruh roh.

This is another example of the arms race which filtering companies deal with every day.
New anonymizer/proxy services are created every day - many specifically to allow users to bypass country-level filtering. Many of them may be logging and selling your authentication credentials to third parties. Users will simply begin using new anonymizer/proxy services as they creep up to work around any filtering which may be put in place. There is a non-trivial amount of effort required to keep track of all of these sites and noone will ever be 100% effective.

A large amount of effort will be needed to filter these services and perfectly legitimate uses will be blocked.

You don't want to push users to begin using anonymizing/proxy services - that is a battle that you won't win.

Saturday, December 19, 2009

Filtering via BGP, and why this doesn't always quite work out..

Another interesting thing to look at is the increasingly popular method of filtering by using a BGP "exception list". I've heard this anecdotally touted by various parties as "the solution" (but nothing I can quote publicly in any way, sorry) but really, is it?

This employs a little bit of routing trickery to redirect sites to be filtered via the proxy and passing the rest through untouched. This hopefully means that the amount of traffic and websites which need to pass through the filter is a lot less than "everything."

This is how the British Telecom "cleanfeed" solution worked. They redirected traffic to be filtered via a bunch of Squid proxies to do the actual filtering. This worked out great - until they filtered Wikipedia for a specific image on a specific article. (I'd appreciate links to the above please so I can reference them.) Then everything went pear-shaped:
  • From my understanding, all the requests to Wikipedia came from one IP, rather than properly pretending to be the client IP - this noticably upset Wikipedia, who use IP addresses to identify potential spammers; and
  • The sheer volume of requests to Wikipedia going through the filtering service caused it to slow right down.
So that is problem number 1 - it looks like it will work fine on a set of hardly viewed sites - but it may not work on a very busy site such as Wikipedia. Squid based filtering solutions certainly won't work on the scale of filtering Youtube (at least, not without using the magic Adrian-Squid version which isn't so performance-limited [/advertisement].)

The next problem is determining which IP addresses to redirect. Websites may change their IP addresses often - or have many IP addresses! - and so the list of IP addresses needs to be constantly updated. The number of IP addresses which need to be injected into BGP is based on all of the possible IP addresses returned for each site - this varies in the real world from one to hundreds. Again, they may change frequently - requiring constant updating to be correct. This leads to two main potential issues:
  • Hardware routing devices (ie, the top of the line ones which large ISPs use to route gigabits of traffic) have limited "slots" for IP addresses/networks. The router typically stops working correctly when those run out. If you're lucky, the number of IP addresses being filtered will fit inside the hardware routing table. If you're unlucky, they won't. The big problem - different equipment from different vendors has different limitations. Upgrading this equipment can costs tens or hundreds of thousands of dollars.
  • The only traffic being filtered is traffic being redirected to the filter. If the list of IP addresses for a nasty website is not kept 100% up to date, the website will not be properly filtered.
The third main problem is filtering websites which employ Content Delivery Networks. This is a combination of the above two problems. So I'm going to pose the question:

How do you filter a web page on Google?

No, the answer here isn't "Contact Google and ask them to kill the Web page." I'm specifically asking how one filters a particular web page on a very distributed infrastructure. You know; the kind of infrastructure which everyone is deploying these days. This may be something like Google/Yahoo; this may be being hosted on a very large set of end user machines on an illegally run botnet. The problem space is still the same.
  • The IP addresses/networks involved in potentially serving that website is dynamic - you don't get an "easy" list of IPs when you resolve the hostname! For example - there's at least hundreds of potential hostnames serving Youtube streaming media content. It just isn't a case of filtering "www.youtube.com".
  • There are a number of services running on the same infrastructure as Youtube. You may get lucky and only have one website - but you also may get unlucky and end up having to intercept all of the Google services just to filter one particular website.
All of a sudden you will end up potentially redirecting a significant portion of your web traffic to your filtering infrastructure. It may be happy filtering a handful of never-visited websites; but then you start feeding it a large part of the internet.

In summary, BGP based selective filtering doesn't work anywhere near as well as indicated in the ACMA report.
  • You can't guarantee that you'll enumerate all IP addresses involved for a specific website;
  • The ACMA may list something on a large website/CDN which will result in your filtering proxies melting; you may as well have paid the upfront cost in filtering everything in the first place;
  • The ACMA may list something with so many IP addresses that your network infrastructure either stops working; or the filter itself stops working.
Personally - I don't like the idea of the ACMA being able to crash ISPs because they list something which ISPs are just unable to economically filter. Thus, the only logical solution here is to specify a filtering infrastructure to filter -everything-.

How does a proxy interfere with throughput?

Another big question with the filtering debate is figuring out how much of an impact on performance an inline proxy filter will have.

Well, it's quite easy to estimate how much of an impact an inline server running Windows/UNIX will have on traffic. And this will be important as the filtering mechanisms tested by the government and which will be implemented by more than one of the ISPs.

Inline proxies are very popular today. They're used in a variety of ISP and corporate environments. Traffic is either forced there via configuration on users' machines, or redirected there transparently by some part of the network (eg a router, or transparent bridge.)

The inline proxy will "hijack" the TCP sessions from the client and terminate them locally. The client believes it is talking to the web server but it actually talking to the proxy server.

Then the inline proxy will issue outbound TCP sessions to the web server as requested by the user - and in some configurations, the web server will think it is talking directly to the client.

This is all relatively well understood stuff. It's been going on for 10-15 years. I was involved in some of the early implementations of this stuff in Australia back in the mid 1990's. What isn't always well understood is how it impacts performance and throughput. Sometimes this doesn't matter for a variety of reasons - the users may not have a large internet connection in the first place, or the proxy is specifically in place to limit how much bandwidth each user can consume. But these proxies are going to be used for everyone, some of which will have multi-megabit internet connections. Performance and throughput suddenly matter.

I'll cover one specific example today - how inline proxies affect data throughput. There are ways that inline proxies affect perceived request times (ie, how long it takes to begin and complete a web request) which will take a lot more space to write about.

Each request on the inline proxy - client facing and server facing - will have a bit of memory reserved to store data which is being sent and received. The throughput of a connection is, roughly speaking, limited by how big this buffer is. If the buffer is small, then you'll only get fast speeds when speaking to web sites that are next door to you. If the buffer is large, you'll get fast speeds when speaking to web sites that are overseas - but only if they too have large buffers on their servers.

These buffers take memory, and memory is a fixed commodity in a server. Just to give you an idea - if you have 1GB of RAM assigned for "network buffers", and you're using 64 kilobyte buffers for each session, then you can only hold up (1 gigabyte / 64 kilobyte) sessions - ie, 16,384 sessions. This may sound like a lot of sessions! But how fast can you download with a 64 kilobyte buffer?

If you're 1 millisecond away from the webserver (ie, its on the same LAN as you), then that 64 kilobyte buffer will give you (64 / 0.001) kilobytes - or ~ 64 megabytes a second. That's 512 megabits. Quite quick, no?

But if you're on DSL, your latency will be at least 10 milliseconds on average. That's 6.4 megabytes a second, or 51.2 megabits. Hm, it's still faster than ADSL2, but suddenly its slower than what bandwidth the NBN is going to give you.

Say you're streaming from Google. My Perth ISP routes traffic to/from Google in Sydney for a few services. That's 53 milliseconds. With 64 kilobyte buffers, that's (64 / 0.053), or 1207 kilobytes/second. Or, around a megabyte a second. Or, say, 8-10 megabits a second. That isn't even ADSL2 speed (24 megabits), let alone NBN speeds (100 megabits.)

So the operative question here is - how do you get such fast speeds when talking to websites in different cities, states or countries to you? The answer is quite simple. Your machine has plenty of RAM for you - so your buffers can be huge. Those streaming websites you're speaking to build servers which are optimised for handling large buffered streams - they'll buy servers which -just- stream flash/video/music, which handle a few thousand clients per server, and have gigabytes of RAM. They're making enough money from the service (I hope!) to just buy more streaming servers where needed - or they'll put the streaming servers all around the world, closer to end-users, so they don't need such big buffers when talking to end users.

What does this all mean for the performance through a filtering proxy?

Well, firstly, the ISP filtering proxy is going to be filtering all requests to a website. So, it'll have to filter all requests (say) to Youtube, or Wikipedia. This means that all streaming content is potentially passing through it. It's going to handle streaming and non-streaming requests for the websites in question.

So say you've got a filtering proxy with 16GB of RAM, and you've got 64 kilobyte buffers. You have:
  • Say, minimum, 262,144 concurrent sessions (16 gigabytes / 64 kilobytes) going through the proxy before you run out of network buffers. You may have more sessions available if there aren't many streaming/downloading connections, but you'll always have that minimum you need to worry about.
  • Actually, it's half of that - as you have a 64 kilobyte buffer for transmit and a 64 kilobyte buffer for receive. So that's 131,072 concurrent sessions.
  • If your streaming site is luckily on a LAN, and you're on a LAN to the proxy - you'll get ~ 100mbit.
  • If you're on ADSL (10 milliseconds) from the proxy - you'll get 6.4 megabytes/second, or 51 megabits/sec from the proxy.
  • If you're on NBN (1 millisecond, say) from the proxy - you'll get 64 megabytes/second, or 512 megabits from the proxy.
  • BUT - if the proxy is 50 milliseconds from the web server - then no matter how fast your connection is, you're only going to get maximum (65536 / 0.050) bytes/sec, or 1.2 megabytes/second, or 12 megabits/second.
  • And woe be if you're talking to a US site. No matter how fast your connection is, the proxy will only achieve speeds of 320 kilobytes/sec, or 2.5 megabits. Not even ADSL1 speed.
The only way to increase the throughput your proxy has is to use larger buffers - which means either packing much more RAM into a server, or limiting the number of connections you can handle, or buying more servers. Or, if you're very unlucky, all of the above.

Now, the technical people in the know will say that modern operating systems have auto-tuning buffering. You only need to use big buffers for distant connections, rather than for all connections. And sure, they're right. This means that the proxy will handle more connections and obtain higher throughput. But the question now is how you design a proxy service which meets certain goals. Sure, you can design for the best case - and things like auto-tuning buffering is certainly good for raising the best-case performance. But the worst case performance doesn't change. If you have lots of international streaming sessions suddenly being filtered (because, say, some very popular US-centric website gets filtered because of one particular video), the performance will suddenly drop to the worst case scenario, and everyone suffers.

Now, just because I can blow my own trumpet a bit - when I design Squid/Lusca web proxy solutions, I always design for the worst case. And my proxies work better for it. Why? Because I make clear to the customer that the worst case solution is what we should be designing and budgeting for, and new equipment should be purchased based on that. The best case performance is just extra leg room during peak periods. That way clients are never surprised by poorly performing and unstable proxies, and the customer themselves knows exactly when to buy hardware. (They can then choose not to buy new proxies and save money - but then, when you're saving $100k a month on a $6k server, buying that second $6k server to save another $100k a month suddenly makes a lot of sense. Skimping on $6k and risking the wrath of your clients isn't appealing.)

Wednesday, December 16, 2009

People who don't understand an arms race aren't doomed to repeat it...`

This article is amusing. Apparently geeks can build Napster to circumvent "stuff" so geeks should be able to build a better RC filter.

Here's some history for you.

"Stuff" initially was "we're already sharing files via DCC on the Internet Relay Chat system (IRC); let's make an indexed, shiny, graphical, automated version of that!" It wasn't to circumvent any kind of censorship or filtering, and it wasn't a great leap of imagination. It was a small, incremental improvement over what existed. The only reason you think it was a big leap for a lone teenager is that Napster popularised file sharing. It made it easy for the average teenager to do.

Secondly, there are most likely individuals and companies profiting off the construction and use of non-web based distribution of RC materials. Filtering web traffic won't stop this distribution - it will simply stop the web distribution of RC materials. The filtering technology will quickly grow to counter these needs, and then new tools will appear to circumvent the filter. This is a classic arms race, pure and simple.

The only people who profiteer from an arms race are the arms dealers. In this case, the arms dealers are the companies developing tools to distribute the material, and companies developing tools to filter the material.

The astute reader should draw a parallel between what I've described and malware/viruses versus anti-virus software. Why is it we can't filter viruses 100%? Because there's money to be made in both writing the nasty software and filtering the nasty software. The end-users end up paying the price.

This censorship nonsense will suffer the same fate.

Why would more than 10,000 URLs be a problem?

I'm going to preface this (and all other censorship/filtering related posts) with a disclaimer:

I believe that mandatory censorship and filtering is wrong, inappropriate and risky.

That said, I'd like others to better understand the various technical issues behind implementing a filter. My hope is that people begin to understand the proper technical issues rather than simply re-stating others' potentially misguided opinions.

The "10,000 URL" limit is an interesting one. Since the report doesn't mention the specifics behind this view, and I can't find anything about it in my simple web searching, I'm going to make a stab in the dark.

Many people who implement filters using open source methods such as Squid will typically implement them as a check against a list of URLs. This searching can be implemented via two main methods:
  1. Building a list of matches (regular expressions, exact-match strings, etc) which is compared against; and
  2. Building a tree/hash/etc to match against in one pass.
Squid implements the former for regular expression matching and the latter for dstdomain/IP address matching.

What this unfortunately means is that full URL matching with regular expressions depends not only on the complexity of the regular expression, but the number of entries. It checks each entry in the list in turn.

So when Squid (and similar) software is used to filter a large set of URLs, and regular expressions are used to match against, it is quite possible that there will be a limitation on how many URLs can be included before performance degrades.

So, how would one work around it?

It is possible to combine regular expression matches into one larger rule, versus checking against many smaller ones. Technical details - instead of /a/, /b/, /c/; one may use /(a|b|c)/. But unfortunately not all regular expression libraries handle very long regular expressions so for portability reasons this is not always done.

Squid at least doesn't make it easy to match on the full URL without using regular expressions. Exact-match and glob-style match (eg, http://foo.com/path/to/file/*) will work very nicely. (I also should write that for Squid/Lusca at some point.)

A google "SafeSearch" type methodology may be used to avoid the use of regular expressions. This normalises the URL, breaks it up into parts, creates MD5 hashes for each part and compares them in turn to a large database of MD5 hashes. This provides a method of distributing the filtering list without specifically providing the clear-text list of URLs and it turns all of the lookups into simple MD5 comparisons. The downside is the filtering is a lot less powerful than regular expressions.

To wrap up, I'm specifically not discussing the effectiveness of URL matching and these kinds of rules in building filters. That is a completely different subject - one which will typically end with "it's an arms race; we'll never really win it." The point is that it is possible to filter requests against a list of URLs and regular expressions much, much greater than a low arbitrary limit.

.. summary from Retro Night, take #2

Gah, I deleted the wrong post. Typical me.

Three of us got together on Tuesday night to resurrect some Amiga hardware. In summary, we have working machines, one sort of working machine, and a few bad floppy drives. The prize thus far is a working Amiga 2000 with a few megabytes of RAM expansion, a not-yet-working SCSI-and-drive expansion card (ghetto indeed!), a video genlock device to overlay graphics on a PAL/NTSC signal, and a functional Amiga 1200 with extra goodies.

The aim now is to get an environment working enough to write Amiga floppy images out so we can start playing some more games. I'm hoping the Amiga 1200, when paired with a floppy drive and some form of MS-DOS readable flash device, will fit that bill reasonably nicely.

More to come next week.