Oxford engineers may not seem the most likely people to mix with the the likes of the film and television industries, but a team of researchers in the Department of Engineering Science is doing just that to develop ways of extracting huge quantities of information from moving images.
‘When I started using Google, about a decade ago, I was blown away by what it could do for the web,’ says Professor Andrew Zisserman. ‘I went away, understood how it worked, and decided I wanted to engineer something like that for vision. That way, I could effortlessly search images or videos for – well – anything, and get the results immediately. Just imagine: you could see someone on TV, click on their face, and instantly find out what else they’ve appeared in,’ he continues. ‘So that’s what I set out to do.’
His dream – to search huge quantities of footage and instantly find specific clips containing particular objects or people – sounds ambitious even now. But unperturbed by the magnitude of the problem or its possible impact for the film and television industry, Zisserman’s team of researchers has been working on the task by advancing a field known as computer vision: the science of making machines that can ‘see’ by recognising patterns and shapes.
The whole process starts with training computers to recognise specific objects amongst millions of images. ‘Basically, you measure visual features in the images,’ explains Zisserman. Based on those features – which might be sharp edges, shapes or textures – software can be taught to pick out images which illustrate an object, irrespective of the viewpoint, lighting or even partial obstruction. Then, when the software is shown a new set of images, it scores and ranks them depending on the presence of the key features, much like a Google search.
The problem is that searching video in this way places huge demands on computing resources – a problem that Zisserman’s team has been trying to overcome. ‘You can represent an object by a jumble of iconic bits, but it turns out that it doesn’t matter where they are. For a motorbike, you might have a wheel, a seat...and just that you have them somewhere in an image is enough to recognise an object,’ explains Zisserman. He and his student Josef Sivic dubbed this concept ‘visual words’ and it lies at the heart of making searches much more efficient. So efficient, in fact, that even Google now uses the technology in its image search system Google Goggles.
The team has used the technique to analyse a common gripe of Hollywood movie makers: continuity errors. These lapses in consistency, where two shots of the same scene don’t quite match, pop up all too often. So Zisserman, together with Dr Lynsey Pickup, has been playing what is effectively a giantgame of spot-the-difference: developing software that automates the process of spotting the mistakes. By scanning frames of a movie that should theoretically contain objects in the same physical locations, the team can detect subtle differences – a job previously left for over-enthusiastic film fans.
While that is proof that finding objects within footage is possible, video search also needs to be able to identify humans – an altogether tougher task. Changing facial expression, differing hair styles and constant movement make actors extremely difficult for computers to identify. By using cutting-edge facial recognition, though, Dr Mark Everingham, working in Zisserman’s team, can identify the eyes, nose and mouth, using distinctive features around these areas to reliably spot faces time and again. Indeed, by following the detected face between frames, it is possible to track actors as they move around a set, and even automatically assign the correct character name to each face through space and time.
Unsurprisingly, this is making a big impression commercially as it allows video content to be labelled and searched automatically. ‘At VideoSurf, we run a scaleable video search engine. We do for video what Google does for text,’ explains Eitan Sharon, Chief Technology Officer and co-founder of VideoSurf. ‘We’ve developed a smartphone app that lets you point your phone at any screen – even live TV – and in a few seconds it can tell you what the video is, who’s in it, even recommend other videos you might like.’ All of this takes its cue from University of Oxford research. ‘Andrew Zisserman has really left his mark on computer vision over the last decade,’ he adds. ‘He’s changed the way we think about and tackle video, and shaped what we do.’