Hand Cannot Erase - Realtime sound spatialization using hand gestures

This monsoon semester I picked up a really interesting course, Digital Audio (DES514 @ IIITD). Although it had been ages since I'd done anything related to music (seriously), the realization that it's my last semester made me have the whole "screw grades let's pick up something fun" talk in my head.

P.S. This also led me to pick up a philosophy course.
P.P.S. The semester didn't have a happy ending.

Anyway, coming back to the topic. Probably the most intense part of the course was the project, which Anant and I did together. We wanted to do something related to audio modulation (obviously), but not just stick to computational manipulations.

Before I delve further,

  • TL;DR We created an interface which allows a user to move a source of sound around in a ring of speakers using his hand
  • A quick demo video can be found here
  • All code can be found in this repo

Although it is really tough to capture this project on camera (since we are experimenting with 8 channels, whereas a standard camera has mostly 2 input channels), you may be able to observe some changes in sound amplitude (mostly unwarrented due to the reverb in the small room hahaha).

Design Philosophy

We decided that we'd like to build an interface that would meet three broad goals

  • It should be capable of modifying an input audio signal real-time.
  • It should allow the user to have a high degree of control over how the input is signal is modified.
  • The interface design should be minimal and it should feel natural (no one likes wires hanging from their body).

The first goal was easy to meet. We had been using SuperCollider (an open source platform for audio synthesis and algorithmic composition) throughout the semester, which can be used to modify an input audio signal in a few lines of code. An example would be passing the input signal through a low pass filter (LPF). And the best part, we can change parameters (like the max pass frequency of the LPF) and it gets reflected in the output signal in real-time.

The real challenge was merging this with our last two goals. We definitely did not want a wearable device, and at the same time wanted to allow the user to use his/her entire body to modify the input signal. As you would have guessed, we decided to move forward with a Kinect sensor. Although that meant coding in Visual Studio (and using Windows), it was perfect for our use case (especially since it came with its own gesture detection APIs).

Designing a prototype

For our project (since we just had a week to finish), we decided to move forward with just spatialization of the input sound signal in a circle using hand movements of the user. Once this was possible, porting it to include other effects (such as an LPF/HPF would be just grunt work).

Our interface would constitute of a server running the Kinect SDK, using which we detect the direction in which the user's hand is pointing. This information is sent to another server running supercollider. The angle is used to pan a given input signal across the speakers in such a way that it seems as if the source of sound is at the direction in which the user is pointing.

We had access to a brand new 8.1 surround system (big shout out to Prof. Timothy!), so panning the input signal across speakers was something which we could actually test. A rough schema of what we had in mind


It was a straightforward design which has 3 different components, each of them explained in detail below. Getting these three components together was the tricky part.

Detecting hand motion using Microsoft Kinect

The Kinect sensor uses depth sensing and other computer vision approaches to estimate the skeleton of the person standing in front of it. The Kinect developer SDK allows us to use this information captured by the hardware. Using this SDK, we can capture the 3-D coordinates of the certain important points detected on the body (joints, fingers etc).

Since we can detect the coordinates for both the body centre and the fingers, we used these to calculate the x and y coordinate of the hand in the plane perpendicular to the body of the user. This allows us to understand two things

  • The direction in which the user is pointing towards.
  • How close is the user's hand from the center of his/her body.

Image is borrowed

We send this information to the superCollider server running on a different machine using OSC, a protocol running over UDP.

Panning sound in superCollider

For computing the output signal streams for the eight different channels (each for a given speaker), we used the VBAP plugin for Supercollider3. This takes a signal and an angle from the median plane and redistributes the signal into a number of channels assuming the angle to be the source of the sound. This plugin is based on Vector base amplitude panning, more information on which can be found here.

For ease in testing, we create a GUI which displays the entire ring of speakers. There is a movable pointer, which is used to indicate the source of the sound. The pointer can be moved around in the circle as shown below.

This pointer is used to indicate the source of the sound. If we point towards the speaker C, it would sound as if the source of sound is the speaker C.

We also introduce a parameter ‘spread’, which allows us to widen the area from which the sound comes. This allows us to change the distance of the sound source (or how far the user perceives it to be).

We were able to modify the code for a 5.1 surround panner (in superCollider examples) and adapt it to the 8.1 surround setup which we had.

The pointer can be moved around using the mouse, the spread can be changed using a slider. However, our end goal was to link this with the user’s movements. We do so by using the information captured by the Kinect sensor.

Coordinate update in real time

We make use of Open Sound Control protocol to transfer detected coordinates from the server running Kinect to the server running supercollider. OSC makes use of the UDP, which can be used between two machines on the same local network. While coding from the C# environment, we had to import a SharpOSC library into Visual Studios to be able to create a SharpOSC object and send information over the server that 2 laptops were connected to. In supercollider, we simply had to open a port and start listening to whatever was being transmitted by the Kinect server to the supercollider server IP (on the same port).

After receiving the user coordinates (at the superCollider server) form the Kinect stream, we simply updated the coordinates of the pointer and the spread values based on the distance between the user’s center and his/her hands. This update was done every 10 microseconds (using the AppClock API), allowing us to make updates in almost real time.

Show Comments