Replacing 'Wave to Engage' on Xbox One by Combining Body Pose, Gaze and Motion to Determine Intention to Interact

Schwarz, J., Marais, C., Leyvand, T., Hudson, S., Mankoff, J. Combining Body Pose, Gaze and Motion to Determine Intention to Interact in Vision-Based Interfaces. In Proceedings of the 32nd Annual SIGCHI Conference on Human Factors in Computing Systems (Toronto, Canada, April 26 - May 1, 2014). CHI '14. ACM, New York, NY.

This was one of the projects I worked one while working as as researcher at Microsoft Xbox on the team responsible for shipping skeletal tracking and other APIs leveraging the Kinect. It was 11 months before the release of the Xbox One, and our aim was to develop a simple but robust method to engage and disengage with the Kinect. The previous 'wave to engage' gesture on Xbox 360 was robust but difficult to trigger. We wanted something simpler but which maintained the same low false positive rate.

A colleague (Claude Marais) had developed a prototype that combined facial features, body pose and motion to generate and 'intention to interact score' which approximated a user’s intention to interact with the system. The idea was that this 'intention to interact' score could then be combined with a simple gesture to determine engagement, and could be used in concert with a threshold to determine disengagement.

I refined the initial algorithm and then ran a 30-person lab study to compare four engagement algorithms in single and multi-user scenarios. I found that combining intention to interact with a “raise an open hand in front of you” gesture yielded the best results. The latter approach offered a 12% improvement in accuracy and a 20% reduction in time to engage over a baseline “wave to engage” gesture used on the Xbox 360. The results of this work helped to replace the ‘wave to engage’ gesture on Xbox One.

Fig 1. Project Overview

Demonstrating Improvement over 'Wave to Engage'

The most difficult part of the project was running a user study to demonstrate the algorithm's improvement over 'wave to engage' on Xbox 360. To do this, I did the following:

  • Implement a port of the 'wave to engage' gesture from Xbox 360. One of the many perks of working at Microsoft was having direct access to the code.
  • Implement intention to interact computation, along with all 4 methods of engagement.
  • Hook the engagement algorithms into a simple 'bop-the-mole' game used for the study, complete with logging (I recorded everything so that I could replay runs later, which came in handy). This was done using DirectX and an early version of the Kinect API for Xbox One. The study ran on an Xbox One.
  • Ran a 40-person lab study with tasks aiming to be as comprehensive as possible while remaining in the confounds of a lab.
  • Analyze the massive amount of data, which included learning some new statistical tests and developing and running a hypothetical engagement task (raise hand up without the score), which I then ran all my pre-recorded data against.

A detailed description of the study, along with results are in the paper, but the short summary is that using an engagement score does lead to faster engagement time compared to wave to engage, with engagement score + simple gesture yielding the best results.


Thank you to my manager at Microsoft, Tommer Leyvand, for letting me pursue and publish the work.

Thank you to Tyler Murphy for his illustration of Figure 1 in the paper.

Header image credit: Harry Potter for Kinect