How is Aircell blocking VoIP phone calls from systems like Skype, SightSpeed and Gizmo? (And how did Andy get through with Phweet?)
Ever since last week’s announcement of the “Gogo Inflight Internet Service” provided to American Airlines by a company called Aircell – and the ensuing coverage in the blogosphere – I’ve been getting asked about how exactly Aircell is blocking VoIP calls. Especially after Andy Abramson was able to make a call using Phweet. Aircell is very clear in the Gogo terms of service that no voice calls are allowed:
No Voice Applications. You will not use any type of voice application (including, without limitation, voice over Internet protocol) without written permission from Aircell.
And early users of the system who tried VoIP calls reported that indeed after about 5 seconds or so, their VoIP conversation was terminated. Repeatedly. They could use Skype, for instance, for IM, but not for voice.
So how is Aircell blocking VoIP? (And how did Andy get through with Phweet?)
Unfortunately, I was a wee bit busy last week when all this was breaking, so it’s taken me until now to come up for air enough to write about how Aircell could be blocking VoIP calls on the planes.
THE BIG CAVEAT
First, I should state up front – I have absolutely NO connection to Aircell, Gogo, American Airlines, etc.. I don’t know exactly how they are blocking VoIP calls but am laying out here how they could be blocking VoIP calls.
[I also feel compelled to say that I personally think it's silly of Aircell to block VoIP because there will inevitably be people who figure out ways to route around the blocking. I think Aircell's excuse that they want to block people from talking loudly is also rather lame. I've been on long flights where I was trying to sleep and I've had two people talking very loudly in the seat behind me or next to me. (And yes, sometimes I've asked them to please speak quieter.) Other times I've had babies screaming and crying for most of the flight... or other "energetic" children carrying on. No, I don't really want my neighbor to be in an involved VoIP call but my point is that there are disruptions already. Part of me wonders if there aren't really more technological issues with doing streaming audio/video, but anyway... that's their policy and if you want to use their service, you have to agree. You don't (yet) have a choice in services to use, and you probably won't.]
LET’S PLAY “SPOT THE VOIP PHONE CALL”
Given that Skype conversations in particular are encrypted, I’ve had several people ask me if this means that Aircell can decrypt Skype calls. How else, they ask, can Aircell differentiate between Skype text chat and Skype voice chat? It’s simple really:
A VoIP call in progress has a distinct profile from a network packet point-of-view. In general, the audio streams of VoIP calls (as compared to the call control/signalling channels) have the following characteristics:
- there are a zillion small packets
- the packets are sent over UDP versus TCP
- the packets are sent using the Real-time Transport Protocol (RTP) over UDP
Now with Skype’s encryption, network software can’t know about the contents of the audio stream, so the software can’t know about my #3 here, but the software can recognize the pattern based on the first two. For other non-encrypted services, the software can very simply look for RTP streams.
Now, why do VoIP systems use a zillion small packets? Typically, a VoIP system will sample the speaker’s voice at an interval of every 10, 20 or 30 milliseconds. Most I’m familiar with seem to go for a 20ms rate. So 20ms of audio is captured, encoded digitally and sent off in an IP packet. The exact size of the IP packet will vary depending upon what codec is used to encode the audio. The standard G.711 packets will be at one size and G.729 will be much smaller. Many of the VoIP streams I’ve captured in network traces have a total packet length of somewhere between 35-70 bytes.
To put that in perspective, understand that an Ethernet packet can have typically a max size of 1500 bytes. And packets sent by various protocols can be even larger and will be “fragmented” into smaller pieces (for instance, 1500-byte pieces) to be moved across the network. [Network geeks: Please give me some poetic license here... I realize I could be more precise, but I'm trying not to completely bore the readers.]
The point is that voice packets are typically tiny in comparison to other packets – and there are a lot of them.
How many? Well, if you take a 20ms sampling rate, that means that you are sampling the audio voice 50 times each second… so that’s 50 packets per second for one audio stream. Almost every voice conversation involves two audio streams (one from the caller, one from the recipient) and so you are looking at 100 packets per second for a typical two-way VoIP conversation.
The reason for this is relatively simple. In a file transfer, you are looking to move a file across the network as fast as possible – but you aren’t necessarily in “real-time”. So you generally will stuff the packets with as much info as you can and push them across the network. With voice, we are making use of the fact that the human ear will deal with some lost audio, and so we are chopping up the audio up into a zillion tiny pieces, tossing them in unreliable UDP packets and hoping that enough get there so that the listener can make sense of the conversation.
Put another way, if I wanted to send you this blog post via snail-mail, I could print it out, stick it in an envelope and mail it to you. That’s file transfer. On the other hand, if I were to chop this blog post up and write each word on a post card and stick them in the snail-mail to you, odds are that enough would arrive at the other end that you could assemble them into something like this post. (With luck, maybe you could even make the post shorter!)
So once you know the pattern, it’s fairly easy to spot the calls. For instance, where’s the VoIP call in this network trace?
Do you see it? How about this trace with a filter turned on to show UDP in red?
Ta da… there’s your VoIP call!
Let’s look at this one in a bit more detail:
There it is… all in UDP… and coming in at about 100 packets per second. And if I look at the actual Wireshark traces, I can see that these 100 packets per second are all very tiny sizes. Many of them are between 37 and 50 bytes.
And this is an encrypted Skype call!
No need to decrypt it. Just see that it’s a steady stream of 100 very small packets per second (50 packets per second each way) all over UDP.
Kill the stream. Block it. Conversation dead. No more VoIP on the plane.
It’s basically the network security version of Whack-A-Mole. See a VoIP stream start up… block it. See another one… block it. See yet another… block it. Whenever anything pops up that meets the profile, stomp on it.
This explains, too, why people could talk for a few seconds and then had their conversations terminated. The pattern has to appear in the network monitoring software. The software has to be sure it’s a VoIP stream and not something else… and then the software can block it.
Now I don’t know for a fact that this is how Aircell is blocking VoIP, but it would be easy enough to do it this way.
GETTING MORE SOPHISTICATED
There are, of course, easier ways to kill the conversations. If the VoIP calls use unencrypted audio streams then it’s incredibly trivial to block. Just block the RTP protocol. Period. End-of-story. Now this does involve a hair more packet inspection in that the software has to look farther into the packet headers to see the protocol type, but again this is easy to do. All RTP is typically used for is streaming audio or video… block it and the “problem” goes away.
They could of course go even further into packets and see if they are Session Initiation Protocol (SIP) packets and if so, what the packets are asking to do. If one endpoint sends a SIP INVITE packet to another endpoint and indicates that an audio conversation is to start, the software could again simply block the impending audio stream. (Of course this couldn’t be done if the SIP was encrypted…)
The software could also simply block ports. Block any usage of port 5060 or 5061 and you would probably kill off most “regular” SIP conversations. (Yes, SIP endpoints can make connections on non-standard ports, but the majority of clients probably wouldn’t.) The challenge here is that some SIP endpoints also would use SIP to set up non-audio communication channels like text chat, so blocking all SIP would also block SIP-based text chat which is probably not desirable.
The software could also block on a service level… if it knew, for instance, the host names or IP addresses for the media servers and consumer services (through which the audio would be sent), the software could block all connections to those media servers.
There’s a whole range of additional layers the network monitoring software could use. Any good system will have a “defense in depth” strategy and make use of many of these different algorithms.
Of course, adding on these layers does require more computing power and will undoubtedly add some latency (even on a microscopic level). It may be that for right now they can simply do the pattern recognition approach and shut down VoIP calls.
SO HOW DID ANDY DO IT?
Okay, so how did Andy make a call using Phweet? Given that this post has already gone on this long, I’ll publish my guess in a subsequent post. The text above should give you enough clues, though… any pattern recognition system is inherently fragile because it depends upon recognizing patterns. So what if the audio packets you are sending don’t match any known patterns? What if (hint) the folks at Aircell forgot to watch all protocols?
Stay tuned for some more network charts in my next post…
P.S. This is, by the way, why I think that these type of systems trying to block VoIP calls are inherently doomed… someone will inevitably find a way to “cloak” their VoIP calls so that they are unrecognizable or indistinguishable from other data traffic… it’s a cat and mouse game and inevitably people will find ways to get around the watchers…