Research & Development

Posted by Sebastian Ward, Robert Dawes on , last updated

Over the last few editions of the Watches (Springwatch, Autumnwatch and Winterwatch), BBC Research & Development has been collaborating with the BBC’s Natural History Unit (NHU) to investigate the application of artificial intelligence technologies such as machine learning and computer vision to live natural history production. For this year’s Autumnwatch, we’ve expanded the work to investigate how we can apply similar techniques to the audio from the show’s wildlife cameras.

Throughout the week of Autumnwatch the audience could watch live streams of a selection of the wildlife cameras for 12 hours each day. The Live Stream team manage this and always have an operator watching and listening to ensure that the video and audio are of acceptable quality and complies with the BBC’s editorial guidelines. They try to ensure that the audio remains in keeping with the natural setting of the production and that man-made noises such as vehicle noise or speech are avoided.

One particular challenge for an operator is detecting the presence of unsafe audio and then working out which stream it is appearing on. Our system sets out to assist the team with this task. A single member of the production team will often have to monitor the audio from up to eight feeds at once. They are normally listening to a mix of several of the audio sources, so after hearing some unsafe audio on the mix, they may have to then go through all the sources one by one to try and locate the problem sound. This can potentially take several minutes if the problem sound is intermittent and so is difficult to track down. Additionally, if the operator is listening to a single source, they can miss problem sounds on all the other sources. They may become reliant on other members of the team discovering the problem and passing that information on.

Our tools can detect unsafe audio and alert the production team to its presence on a particular stream in a few seconds. Fundamentally it is hard for a person to listen to eight different audio streams at once. However, it is relatively easy to watch eight different videos at once. So our system translates the problem audio into a visual warning on the operator’s screen. This warning also remains on the screen for several seconds, so it’s easy to spot even if the problem sounds might have only been brief.

Screenshot of the multi-camera view with speech icons overlaid on two of the feeds.

Our monitoring system warns of speech detected on two of the cameras

Audio tagging

Before we can warn the production team about the audio, we need to determine what we’re hearing. For this, we use a machine learning-based audio classifier. We chose the classifier because of its high accuracy in detecting a wide range of sounds, hierarchically described by the Google AudioSet ontology. The classifier achieves state-of-the-art performance in AudioSet tagging, with a mean average precision (mAP) of 0.439.

The AudioSet ontology contains a large selection of different sounds with a great deal of variety. This allows our audio monitoring system to be used for any sort of audio content, opening up the possibility for a range of applications in all sorts of productions and programmes.

Our system takes in streams from the cameras and puts them into our cloud-based media management system. We then take a live stream of the audio from this recording into our processing tool. This tool chunks the audio up into short clips of around a second before passing them onto the tagging system. When we receive the results from the tagger, we examine the scores for different sounds and see if any of the problematic audio types have scored highly. If they have, we warn the monitoring team.

Before using our system live at Autumnwatch, we wanted to put it through its paces. A notable issue that we encountered here was a lack of suitable audio to test it with. To address this, we ran a set of experiments in conjunction with R&Ds audio team, in which we generated 4400 hours of hours of unsafe audio by mixing clean Springwatch audio with a selection of relevant sound effects from Freesound. We used these to test the tagging system, and to determine the sensitivities at which the various unsafe sounds would be picked up. Understanding these sensitivities and so being able to set detection thresholds is important because speech, for example, is far more likely to cause compliance issues and so should be picked up with far greater sensitivity than say, vehicles, which will not.

Remote Production Cluster integration

A key task for this tool was to have it integrate as naturally as possible into the existing monitoring workflow. Much of the remote production work makes use of a collection of cloud-based tools put together by colleagues in BBC News called the Remote Production Cluster. This includes facilities to ingest and route video - and then generate a multiviewer displaying those camera feeds.

We initially created a web interface that closely mirrors the layout of camera sources that the monitoring team see in their multiviewer; a box for each camera feed with either a green speaker symbol for clean audio or a red speaker when there is problem audio. When there is a problem, we also display a larger red icon denoting what type of audio has been detected. These pictograms should be simple to understand at a glance so the operator can quickly tell what kind of audio is being detected. Examples include a speech bubble for speech and a car for vehicle noise.

A screenshot of the web interface showing a grid of eight spaces with speech icons over two of the cells on the grid.

Web interface displaying audio warnings

The web interface would add an additional window that the monitoring team would need to keep track of. So to better integrate the monitoring with the existing workflow, colleagues in BBC News were able to take our web interface, turn it into a video source and overlay it on the existing multiviewer. This is a far easier to use monitoring tool; only one screen needs to be visually monitored to see the full set of available data.

Screenshot of multi-camera view with icons overlaid on the video feeds showing what sounds are present.

Results

We tested the system live during Autumnwatch 2021. The latency between detecting a sound and displaying a warning was about 2 seconds: well below the previous average monitoring delay. The operators found it extremely useful for gauging the health of audio from a particular stream, which meant they could ensure unhealthy streams were low volume in the outgoing audio mix. A notable example was several hours during streaming where playing children could be heard in the distance on several streams. The audio monitoring software reliably picked this up and highlighted which of the eight cameras had this problematic audio.

One problem we encountered was wind. During the first day of streaming, the wind was so strong that several camera feeds were determined to contain loud vehicle noise. The classifier construed the wind noise as that from a nearby lorry or car due to its volume. Having an additional metric to evaluate the difference between wind and vehicles - a task that is sometimes difficult to even a human listener when it is picked up through a microphone, compressed and sent over the internet - would prove useful in future watches.

Our tests at Autmnwatch showed how potentially useful the tool could be and proved a great opportunity to get feedback from the streaming team on how we can make it more helpful for them. The tests were also a catalyst to develop ways to integrate our tools into the existing infrastructure and workflow at Autumnwatch. We intend to use what we have learnt to develop our tools further and then apply improved versions at Winterwatch in the new year.

Screenshot of the multi-camera view with a footstep icon overlaid on one of the feeds.

Warnings of footsteps on the nighttime cameras

Tweet This - Share on Facebook

BBC R&D - Intelligent Video Production Tools

BBC Winterwatch - Where birdwatching and artificial intelligence collide

BBC World Service - Digital Planet: Springwatch machine learning systems

TVB Europe - Artificial intelligence and Springwatch

BBC Springwatch | Autumnwatch | Winterwatch

BBC R&D - Cloud-Fit Production Update: Ingesting Video 'as a Service'

BBC R&D - Tooling Up: How to Build a Software-Defined Production Centre

BBC R&D - IP Studio Update: Partners and Video Production in the Cloud

IBC 365 - Production and post prepare for next phase of cloud-fit technology

Topics