Very interesting news: Apple’s new iPhone4 application, Facetime, is a VoIP and IP Video application using SIP signaling and RTP media. Security researcher and SANS Instructor Josh Wright has posted a very interesting and detailed analysis of the Facetime application on Packetstan, a new blog developed by his other SANS and InGuardians colleagues . A couple of quick summary points from Josh’s analysis and a quick look at the phone that we have done thus far:
- iPhone Facetime client doesn’t use SIP REGISTER for authentication
- Uses STUN for NAT Traversal and resolving the remote callee and called party’s IP address of each iPhone
- After the remote party’s IP address has been resolved, SIP INVITE and MESSAGE packets are exchanged directly between iPhone devices
- Cleartext SIP and RTP
- RTP video appears to use the H.264 video codec, and audio appears to use the AAC-LD codec (the same audio codec used in Cisco TelePresence).
- FaceTime uses XMPP to authenticate each iPhone to an Apple Jabber server, using TLS and mutual certificate authentication.
Josh and I were discussing this the other day because he was trying to use the ‘VideoSnarf’ tool in order to re-construct the H.264 encoded media packets. The codec does appear to be H.264, but with some slightly modified reserved fields. Right now this isn’t working, but we hope to have an updated version of VideoSnarf working together with Facetime traffic in the near future.
This is so interesting because it is the first SIP client on a widely deployed consumer Smartphone device developed and supported by a vendor such as Apple. I think it signals that we are going to see more of these applications – these are exciting times. It will be interesting to see how other vendors follow up soon with 2-camera video clients on Smartphones using VoIP protocols, taking the lead from Apple. I am sure many others will be taking a closer look at Facetime, and the attack surface area here for potential exposures are very interesting, as well as the potential security measures that can be applied in order to protect Facetime traffic.
It will be interesting to see the complete NAT traversal algorithm. For example, what does it do when both the phones are behind the same FW or if both the phones are behind symmetric NATs. In other words, are they utilizing ICE framework.
So, what are they doing with reserved fields? Is it really standard 264, or are those reserved bits an indication that they’re running some proprietary extension?
Interesting they promoted it as being SRTP-encrypted at the launch, and it obviously isn’t… I wonder how they are bypassing CALEA, since they’re connected to the PSTN (and thus covered by CALEA even if it’s an on-net (IP) call), and point-to-point media is not compliant with CALEA (unfortunately).
Thanks for the comment. At this point in time I’m not entirely clear on the H264 reserved fields and what they signify. But I can let you know as soon as we have had the time to research this. We will definitely release a new version of VideoSnarf that can output the media files if we can decode them.