Research & Development

Posted by BBC Research and Development on , last updated

Xia Cui details her work with Internet Research & Future Services in evaluating speech/music discriminators

My name is Xia and I’m doing a masters degree in Web Science and Big Data Analytics at University College London. For the last few months I’ve been working with BBC R&D on a research project, looking at automated speech/music discrimination. This is a method for detecting and highlighting where music appears in a piece of audio. This area of research is of particular interest to the BBC as it has such a large archive of radio programmes, many containing music.

My interest in this area sprang from a hobby of mine: fansubbing. This is where volunteer fans of a TV show get together online to subtitle and translate programmes for a foreign or hearing-impaired audience. I volunteer in a group that helps translate Japanese Animation into other languages. Although the work is fun, video files are time-consuming and fiddly to navigate, so I started to think about automated methods that could help a subtitler quickly jump to a specific bit of footage. Three useful technologies sprang to mind: automated speech recognition (which transcribes the spoken word content), automated alignment (which can help create clickable time-stamped transcripts) and music detection (which could, for example, help you quickly jump to the start of the programme). As the BBC were already working on the speech recognition and alignment they suggested I take a look at automated music detection. 

The aim of my project was to evaluate the performance of two types of speech/music discriminators. They were chosen because of the simplicity of their classification approach (unlike more complex machine learning methodologies which would take longer to implement). One was a standard BBC speech/music segmenter based on the classic Zero Crossing Rate approach (ZCR). The other goes by the acronym CBA-YAAFE and uses a more up-to-date method called Continuous Frequency Activation (CFA).  With the help of BBC R&D staff I implemented an evaluation framework in Ruby and processed almost 1000 hours of radio split into 3 types: ‘speech radio’, ‘music radio’ and ‘mixed’. We chose programmes where the BBC already had a ground-truth of the exact in-points and out-points of the music so we could accurately verify the output.

The accuracy for BBC speech-music segmenter (ZCR) turned out to be 87.61% for the crucial category of ‘mixed’ shows.  ZCR provides a measure of weighted average of the spectral energy and this algorithm discriminates speech and music on the basis of the how skewed the distribution of the zero crossing rate is in the time-domain waveform. However, this approach is almost 20 years old now, so we were hoping for better results from the more recently developed CFA which focuses on the structural difference at various signal levels. The accuracy for this method was 94.97% on the same block size test of 2.3 seconds. For comparison, we also measured the time boundary distance to the ground truth of actual in and out points for each song. The CFA process produced much fewer wrongly inserted boundaries (on average less than 1 per hour long show versus 21 for ZCR) so it proved itself to be both more accurate at detecting songs and more precise in identifying the start and end points. This was a very encouraging result and has lead to discussion of possible implementation of CFA-based discriminators in production systems.

On a personal level, I was delighted that BBC R&D gave me the chance to work with them and grateful for the time and help their staff gave me. Near the end of project, I gave a presentation of my project to the team at Internet Research & Future Services and got a lot of positive feedback about the work I’d done and  the usefulness of my evaluation approach. My experience has certainly given me a lot more confidence for future work in this area. My thanks to all at BBC R&D, especially my supervisors Matt & Jana and adviser Rob