When WLM got this new feature called voice calls I remember being quite excited. Like, which protocols would they be using this time, which codecs, and so on. I was quite surprised when I learned that the they were using SIP, RTP and ICE this time around, instead of all-proprietary protocols for both signaling and transport like most of the attempts at similar features in the past. There was one disappointment though, if you’re not running Windows and thus not able to use WLM, you won’t get the same experience as those using the official client. The reason is simple; they get to use an adaptive wideband codec which works well over lossy networks, you don’t. If you want to be a first class citizen in the Messenger world you need to be running Windows.
As much fun as it would be to reverse-engineer this codec I sadly don’t have enough sparetime for that these days, I’m no longer a student with virtually infinite amounts of sparetime like I was back in the libmimic days. I still have this passion for reverse-engineering though, and I find it great as a recreational activity in the life part of the work/life balance. So I sat down a couple of hours one late evening with a fresh cup of Yerba Mate tea and poked around with IDA, and worked out the internal API of this codec inside the appropriate binary. I also learned that all the supported audio codecs share the same interfaces. Of course there aren’t any exported functions exposing them, but that’s just esthetics anyway. Just load the DLL, scan through its memory for the signatures of what you need, declare some function pointers, wrap any structures you might need and write some assembly glue, and you’re all set. Use the wineloader approach used by various media players, and you can even run the code on a non-Windows OS given that it’s x86.
So what I did was put together a GStreamer plugin that would wrap all the encoders and decoders dynamically, just like gst-ffmpeg does with the codecs provided by FFmpeg. I first got this plugin working on Windows, and then ported it to Linux by using the wineloader code of gst-pitfdll as a starting point.
Plain encoding and decoding isn’t enough though, you also need error-concealment to work well with lossy networks. Reverse-engineering this interface was more work, as I had to figure out some internals of their RTP stack in order to know more about the data structures and what the different fields meant. I also had to inherit from an internal C++ base class, and this was a quite fun experience. It’s all very simple though, it’s just a matter of figuring out the size of the class, inherit from it the same way as you would with GObject/C and write some wrapper functions to emulate thiscall using stdcall. Meaning that you put the this pointer into ecx and pass the rest of the arguments on the stack.
I implemented it by having a headerfile with for example:
typedef struct _MSEncoder MSEncoder;
…
HRESULT WINAPI ms_encoder_set_bitrate (MSEncoder * encoder, guint bitrate);
Where WINAPI would be defined to the compiler-specific attribute for specifying stdcall calling convention. Now comes the interesting part, the C wrapper, which for MSVC looks like:
__declspec(naked) HRESULT WINAPI
ms_encoder_set_bitrate (MSEncoder * encoder, guint bitrate)
{
INVOKE_VFUNC (16);
}
The naked attribute tells the compiler not to generate any prolog and epilog, meaning that you’re responsible for setting up the stack frame if you want one. In my case I don’t want one, as I want to simply just put the first argument in ecx and shift the return address one level up on the stack.
So the macro INVOKE_VFUNC is simply:
#define INVOKE_VFUNC(func_offset) \
__asm { \
/* fill in ecx with ‘this’ pointer */ \
__asm mov ecx, [esp + 4] \
\
/* put return address where ‘this’ pointer was */ \
__asm pop edx \
__asm mov [esp], edx \
\
/* call the function in the vtable */ \
__asm mov edx, [ecx + 0] \
__asm mov edx, [edx + func_offset] \
__asm jmp edx \
}
As gcc doesn’t support naked functions on x86 I wrote some fairly self-explanatory assembler code for the GNU Assembler. Naked functions are actually quite beautiful, I really hope that gcc will support them on x86 one day.
So, if you want to play with the code you can browse it here (the plugin is in the “ext/mscodecs” subdirectory):
http://bazaar.launchpad.net/~oleavr/oabuild/gst-plugins-farsight/files
Or better:
bzr branch lp:~oleavr/oabuild/gst-plugins-farsight
To build just the plugin and play with it without installing it, simply do:
./configure
cd ext/mscodecs/
make
If you have a Windows installation with Messenger handy, copy RTMPLTFM.dll from the installation directory into /usr/local/lib/win32. However, if you don’t, make sure you accept their EULA and follow the instructions in the helper/README file.
Now to play with it:
export GST_PLUGIN_PATH=./.libs
gst-inspect-0.10 mscodecs
sender:
gst-launch-0.10 audiotestsrc samplesperbuffer=320 is-live=true ! msenc_rta16 bitrate=29000 ! rtprtaudiopay pt=114 ! udpsink
receiver without error concealment:
gst-launch-0.10 udpsrc ! application/x-rtp, payload=114, clock-rate=16000 ! rtprtaudiodepay ! msdec_rta16 ! alsasink
receiver with error concealment (experimental, very alpha):
gst-launch-0.10 udpsrc ! application/x-rtp, payload=114, clock-rate=16000 ! msrtahealer ! alsasink
(Add lossrate=15.0 to the encoder on the sender launch line to enable FEC in case you wanna test it on a lossy network or with “identity drop-probability=x” before the udpsink. Any value greater than 10.0 makes the encoder enable maximum FEC.)
It should now be really easy to set up an RTAudio voice call with a Messenger client using Farsight 2 — any takers? 🙂
Posted in GStreamer, Reverse Engineering
Tags: GStreamer, Reverse Engineering