Research & Development

Posted by Tristan Ferne on , last updated

This sprint in R&D's IRFS team we worked on analysing Casualty, sentiment analysis, the web and TV, immersive video and atomising stories.

According to the subtitles we have access to, these are the most popular phrases spoken in Casualty, the BBC medical drama.

Most popular 6-word phrases

215 what are you going to do
174 can you tell me your name
160 i'll see what i can do
142 give you something for the pain
136 what do you want me to

Most popular 5-word phrases

648 going to be all right
623 do you want me to
514 what are you doing here
465 can i have a word
367 i'm going to have to

Most popular 4-word phrases

1872 what are you doing
1809 i don't want to
1298 do you want to
1218 can you hear me
1216 are you all right

They reminded us of robot-written scripts and earlier instances.

Update: Andrew has extrapolated these. Using a simple n-gram ‘letter’ model (i.e. just looking at the probability of 1 letter given n-letters before) he can generate random Casualty-ness, like this...

Happy birthday!  Oh, Charlie!
Who is it?  It's Duffy!

CAN we win the Night Shift.
Let's shock him.
As the most senior person on duty, Gareth Davies will be supervising all the doctors in the department?  That depends on us.  
Is he in his office?  I don't know, Kelly. I've given up asking.  


Hi, I'm Helen.  I want your complexion.
It's nothing wrong. I think I would have been going on?  Put her over here, quick! Hurry up! ..What are we goin'? We'll see what they're doing him a favour. 
OK.  Sats 98%, Pulse 100, BP 140/85, resps 25, sats 96, resps are 30. Pulse irregular. Like my barber.

Ad infinitum.

 

Analysing the web

This sprint the Discovery team has been looking at ways to analyse the sentiment of relatively long articles, as basic methods better suited to short sentences or social media posts tend to yield useless sentiment values. Our initial experiments with splitting articles into sentences and assessing the distribution of non-neutral sentences is proving promising, and we are reviewing similar approaches published in the past, such as this paper from Cornell(PDF). We also updated our seriousness analyser, sorted out some infrastructural stuff and sketched some new tools.

We have been working on ways to extract text from web articles using the DOM rendered by a headless browser (the open-source phantom.js) rather than HTML source, with some success - and the added benefit of the ability to generate screenshots at various resolutions. We still, however, face issues with questionable javascript-based redirects. Then Tim pointed us to a possible solution from Google.

"All curation grows until it requires search. All search grows until it requires curation." (Benedict Evans on Twitter)

Analysing media

Jana's speaker identification work has given some very good results, with significant improvements over the previous attempts. And Matt has been debugging an error we found in our Kaldi training on the new GPU machine. We've now managed to complete training of a new model with 3 times as much training data and it yields a measurable improvement in the system's performance.

Connecting TVs and radios

Chris has been documenting the MediaScape project as it wraps up - all the work done on device discovery, pairing, and authentication, as well as the overall architecture. And he joined a W3C Web and TV Interest Group conference call to kick off the “Cloud Browser” Task Force. "The Cloud Browser Task Force is a subset of the Web and TV Interest Group, whose goal is to discuss support for web browser technology within devices such as HDMI dongles and lightweight STBs (set-top boxes).". Libby is currently going around boring everyone she knows with set top box FACTS.


VR and 360 video

We've been improving the HTML5/VR music visualiser - working on procedural terrain generation, investigating new types of visualisation and adding a new scene for performance comparison. And Andrew helped facilitate some user testing in the North Lab with Middlesex University for a study that is looking at the experience of viewing 360 films on three different devices: laptop, phone & headset. And Zillah has been very busy running several VR pilots.

Atomising stories

Chrissy has been setting up things for the atomised news trial and working with Lei of BBC News Labs to investigate what data we can get out of BBC systems, while Lara has been tweaking the front-end of the prototype. Thomas joined the UX team and has been refreshing the design.

Andrew has been refining the design templates for our TV Story Explorer prototype and getting assets for other dramas. He has also been thinking about an X-Ray-type service that could incorporate storylines and key moments. Alan also joined the team and is starting to think about how to parse scripts to extract story data.


Also

Tristan and Libby were presenting things at the BBC Data Day. The BBC College of Journalism wrote it up here.

We're hiring a lead engineer to run our software engineering team (1 year contract, based in central London).

Links

We’ve been discussing a few nice local (to us) exhibitions about data and the interwebs, such as this one ending soon (with this artwork in particular), and this upcoming one.

Building an automated “sarcasm detector” remains one of our somewhat-jokey goals, and it looks as though we’re not the only ones

A machine learning cheat sheet for Python and R.

Two nice posts on antisocial and solipsistic application design (via Ian Forrester)

Mobile Web vs. Native Apps or Why You Want Both