« Previous | Main | Next »

BBC World Cup 2010 dynamic semantic publishing

Post categories:

Jem Rayfield | 10:00 UK time, Monday, 12 July 2010

The World Cup 2010 website is a significant step change in the way that content is published. From first using the site, the most striking changes are the horizontal navigation and the larger, format high-quality video. As you navigate through the site it becomes apparent that this is a far deeper and richer use of content than can be achieved through traditional CMS-driven publishing solutions.

The site features 700-plus team, group and player pages, which are powered by a high-performance dynamic semantic publishing framework. This framework facilitates the publication of automated metadata-driven web pages that are light-touch, requiring minimal journalistic management, as they automatically aggregate and render links to relevant stories.

eng_595.jpg

Dynamic aggregation examples include:

The underlying publishing framework does not author content directly; rather it publishes data about the content - metadata. The published metadata describes the world cup content at a fairly low-level of granularity, providing rich content relationships and semantic navigation. By querying this published metadata we are able to create dynamic page aggregations for teams, groups and players.

The foundation of these dynamic aggregations is a rich ontological domain model. The ontology describes entity existence, groups and relationships between the things/concepts that describe the World Cup. For example, "Frank Lampard" is part of the "England Squad" and the "England Squad" competes in "Group C" of the "FIFA World Cup 2010".

The ontology also describes journalist-authored assets (stories, blogs, profiles, images, video and statistics) and enables them to be associated to concepts within the domain model. Thus a story with an "England Squad" concept relationship provides the basis for a dynamic query aggregation for the England Squad page "All stories tagged with England Squad".

This diagram gives a high-level overview of the main architectural components of this domain-driven, dynamic rendering framework.

diagram_595.png

The journalists use a web tool, called 'Graffiti', for the selective association - or tagging - of concepts to content. For example, a journalist may associate the concept "Frank Lampard" with the story "Goal re-ignites technology row".

In addition to the manual selective tagging process, journalist-authored content is automatically analysed against the World Cup ontology. A natural language and ontological determiner process automatically extracts World Cup concepts embedded within a textual representation of a story. The concepts are moderated and, again, selectively applied before publication. Moderated, automated concept analysis improves the depth, breadth and quality of metadata publishing.

Journalist-published metadata is captured and made persistent for querying using the resource description framework (RDF) metadata representation and triple store technology. A RDF triplestore and SPARQL approach was chosen over and above traditional relational database technologies due to the requirements for interpretation of metadata with respect to an ontological domain model. The high level goal is that the domain ontology allows for intelligent mapping of journalist assets to concepts and queries. The chosen triplestore provides reasoning following the forward-chaining model and thus implied inferred statements are automatically derived from the explicitly applied journalist metadata concepts. For example, if a journalist selects and applies the single concept "Frank Lampard", then the framework infers and applies concepts such as "England Squad", "Group C" and "FIFA World Cup 2010" (as generated triples within the triple store).

This inference capability makes both the journalist tagging and the triple store powered SPARQL queries simpler and indeed quicker than a traditional SQL approach. Dynamic aggregations based on inferred statements increase the quality and breadth of content across the site. The RDF triple approach also facilitates agile modeling, whereas traditional relational schema modeling is less flexible and also increases query complexity.


Our triple store is deployed multi-data center in a resilient, clustered, performant and horizontally scalable fashion, allowing future expansion for additional ontologies and indeed linked open data (LOD) sets.

The triple store is abstracted via a JAVA/Spring/CXF JSR 311 compliant REST service. The REST API is accessible via HTTPs with an appropriate certificate. The API is designed as a generic façade onto the triplestore allowing RDF data to be re-purposed and re-used pan BBC. This service orchestrates SPARQL queries and ensures that results are dynamically cached with a low 'time-to-live' (TTL) (1 minute) expiry cross data center using memcached.

All RDF metadata transactions sent to the API for CRUD operations are validated against associated ontologies before any persistence operations are invoked. This validation process ensures that RDF conforms to underlying ontologies and ensures data consistency. The validation libraries used include Jena Eyeball. The API also performs content transformations between the various flavors of RDF such as N3 or XML RDF. Example RDF views on the data include:

Automated XML sports stats feeds from various sources are delivered and processed by the BBC. These feeds are now also transformed into an RDF representation. The transformation process maps feed supplier ids onto corresponding ontology concepts and thus aligns external provider data with the RDF ontology representation with the triple store. Sports stats for matches, teams and players are aggregated inline and served dynamically from the persistent triple store.

The following "Frank Lampard" player page includes dynamic sports stats data served via SPARQL queries from the persistent triple store:

frank_595.jpg


The dynamic aggregation and publishing page-rendering layer is built using a Zend PHP and memcached stack. The PHP layer requests an RDF representation of a particular concept or concepts from the REST service layer based on the audience's URL request. If an "England Squad" page request is received by the PHP code several RDF queries will be invoked over HTTPs to the REST service layer below.

The render layer will then dynamically aggregate several asset types (stories, blogs, feeds, images, profiles and statistics) for a particular concept such as "England Squad". The resultant view and RDF is cached with a low TTL (1 minute) at the render layer for subsequent requests from the audience. The PHP layer dynamically renders views based on HTTP headers providing content negotiated HTML and/or RDF for each and every page.

To make use of the significant number of existing static news kit and architecture (apache servers, HTTP load balancers and gateway architecture) all HTTP responses are annotated with appropriate low (1 minute) cache expires headers. This HTTP caching increases the scalability of the platform and also allows content delivery network caching (CDN) if demand requires.

This dynamic semantic publishing architecture has been serving millions of page requests a day throughout the World Cup with continually changing OWL reasoned semantic RDF data. The platform currently serves an average of a million SPARQL queries a day with a peak RDF transaction rate of 100s of player statistics per minute. Cache expiry at all layers within the framework is 1 minute proving a dynamic, rapidly changing domain and statistic-driven user experience.

The development of this new high-performance dynamic semantic publishing stack is a great innovation for the BBC as we are the first to use this technology on such a high-profile site. It also puts us at the cutting edge of development for the next phase of the Internet, Web 3.0.

So what's next for the platform after the World Cup? There are many engaging expansion possibilities: such as extending the World Cup approach throughout the sport site; making BBC assets geographically 'aware' is another possibility; as is aligning news stories to BBC programs. This is all still to be decided, but one thing we are certain of is that this technological approach will play a key role in the creation, navigation and management of over 12,000 athletes and index pages for the London 2012 Olympics.


Jem Rayfield is Senior Technical Architect, BBC News and Knowledge. Read the previous post on the Internet blog that covers the BBC World Cup website, The World Cup and a call to action around Linked Data.


Metadata is data about data - it describes other data. In this instance, it provides information about the content of a digital asset. For example, a World Cup story may include metadata that describes which football players are mentioned within the text of a story. The metadata may also describe the associated team, group or organization associated to the story.

IBM LanguageWare Language and ontological linguistic platform.

RDF is based upon the idea of making statements about concepts/resources in the form of subject-predicate-object expressions. These expressions are known as triples in RDF terminology. The subject denotes the resource; and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, to represent the notion "Frank Lampard plays for England" in RDF is as a triple, the subject is "Frank Lampard"; the predicate is "plays for" and the object is "England Squad".

SPARQL (pronounced "sparkle") is an RDF query language its name is a recursive acronym (i.e. an acronym that refers to itself) that stands for SPARQL Protocol and RDF Query Language.

BigOWLIM A high performance, scalable, resilient triplestore with robust OWL reasoning support

LOD The term Linked Open Data is used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web.

JAVA Object-orientated programming language developed by Sun Microsystems.

Spring Rich JAVA framework for managing POJOs providing facilities such as inversion of control (ioc) and aspect orientated programming

Apache CXF JAVA Web services framework for JAX-WS and JAR-RS

JSR 311 Java standard specification API for RESTful web services.

Memcached Distributed memory caching system (deployed multi datacenter)

Jena Eyeball JAVA RDF validation library for checking ontological issues with RDF

N3 Shorthand textual representation of RDF designed with human readability in mind.

XML RDF XML representation of an RDF graph.

XML (Extensible Markup Language) is a set of rules for encoding documents and data in machine-readable form

Zend Open source scripting virtual machine for PHP, facilitating common programming patterns such as model view controller.

PHP Hypertext Preprocessor general-purpose dynamic web scripting language, use to create dynamic web pages.

CDN A content delivery network or content distribution network (CDN) is a collection of computers usually hosted within Internet Service Provider hosting facilities. The CDN servers cache local copies of content to maximize bandwidth and reduce requests to origin servers.

OWL Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies.

Comments

  • Comment number 1.

    FYI = The RDF views are not live yet. But should be usable by the end of the week.

  • Comment number 2.

    Have you thought about applying to speak at TED about the work you are doing. I know quite a few would find this stuff really interesting and there have been previous speakers on how the web will change toward the semantic web

  • Comment number 3.

    Jem Rayfield this is a verry col posting. I'm working at NOS ([Unsuitable/Broken URL removed by Moderator] public broadcaster in the Netherlands). We are using almost the same architecture and technique on our website(s).

  • Comment number 4.

    This was a very interesting article, I'm currently working on pretty much exactly the same thing using BigOWLIM, Sparql and JSR 311 though we decided to go with custom asset extraction software based around gate.ac.uk

    I was wondering if your plans for this included allowing third parties to link into your knowledge base (like with DBPedia) and what the process for that might be?

  • Comment number 5.

    Matt, we do plan on exposing a public facing SPARQL endpoint. However this will be a separate cluster/instance with a replicated snapshot exported from the live triplestore. Without this separation queries may have a negative impact on audience facing products (such as the world cup). We will be publishing our ontologies in tandem.

    Cheers
    Jem

  • Comment number 6.

    wizard710 , perhaps you can tell us more about TED.

  • Comment number 7.

    @jemray that's great news, looking forward to it.

    Is there a place to look for the announcement or will it just be posted in this blog?

  • Comment number 8.

    @marie_wallace It's really fantastic to see organizations like the BBC building exciting semantic applications, built on linked open data, which demonstrate quantifiable business value. Great stuff!

    And if you happen to be interested in some of the underlying text analysis technology, then check out IBM LanguageWare which presents an interesting new paradigm around ontologically driven text analysis, that allows you to merge human domain understanding (model what's in our heads) with the algorithmic execution power of a text analytics engine. Check it out, it rocks :-)

    LanguageWare is part of IBM Content Analytics, formerly IBM Cognos Content Analytics.

  • Comment number 9.

    Hi Jem, thanks a lot for sharing this with us. Very, very interesting stuff.

    I wonder why BBC didn't go for Apache Clerezza as the semantic web application platform. Would have saved you a lot of time as it comes with pretty much everything you added individually. Clerezza is even based on OSGI, so should be even easier to maintain.

    Were there any issues with Apache Clerezza per se or haven't you simply checked the platform yet?

  • Comment number 10.

    Hi Marco
    We did look at Clerezza
    There were a combination of reasons that we have not used it for the World Cup
    1. The nature of our deliver platform, and caching architecture means we had to think very carefully about caching within the REST/API layer such that api query responses (of any content-negotiated RDF flavour) would be highly performant. We were more comfortable with a bespoke (and traditional) approach that would give us the level of control we needed without any surprises.
    2. Our software design/architecture for our services layer is simple, clean, and extremely performant (given the request volume we need to handle) - we were not convinced OSGi / Clerezza would have improved on this.
    3. Also to be fair, there was not a huge amount of OSGi experience on the development team, and with an immutable delivery deadline (The Kick-Off of the first match!), we wanted to avoid any learning curves that would add to project risk (this argument also applies to Clerezza as well as OSGi in this instance).
    4. Clerezza’s incubator status. We have typically tried to avoid (if possible) 3rd party projects that are in still in the incubator stage (this certainly was not a deciding factor though)

    We were very conscious from day 1 that the World Cup website would be a very high profile foray into semantic web publishing, and ultimately we knew we would be able to deliver a top class solution in the time available using components that we were very familiar with, without introducing a new and unproven (to us) technology.
    This is not to say we won’t use it in the future though :) - I am keeping a close eye on Clerezza

  • Comment number 11.

    @jemray How do you get to see the RDF?
    Is the OWL ontology published?

    Thanks - RI

  • Comment number 12.

    Hi - just to followup - any one seen the actual RDF produced by the BCC system ??????

  • Comment number 13.

    We have a legal delay with regards RDF publishing and content negotiation. The sports statistics data is currently under review.

    Apologies for the delays....

  • Comment number 14.

    This comment was removed because the moderators found it broke the house rules. Explain.

  • Comment number 15.

    This comment was removed because the moderators found it broke the house rules. Explain.

 

More from this blog...

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.