Why Recent Claims About AI Lip Reading Don't Tell the Whole Story
Artificial Intelligence (AI) technologies are again making a big splash in the news — this time with AI sinking its virtual teeth into lip reading. Researchers at the University of Oxford in the U.K., in collaboration with Google’s Deepmind, have developed a system they claim can lip read more accurately than humans.
Lip reading has been one of the most prominent areas of research for the past decade. The main focus has been on overcoming the shortcomings in audio recognition in noisy environments. Most recently, however, the focus has been firmly planted on speech recognition algorithms and how such systems could help people who are deaf or hard of hearing have better access to television through accurate, real-time subtitling.
I won’t lie to you — when I first saw the news title “AI has beaten humans at lip reading,” I had to choke back a laugh. I believe it would be impossible for machine lip readers to be more accurate than professional lip readers given the huge variation in dialogues, accents and human characteristics.
Just last week I received a call from the BBC2’s Daily Politics show, asking that I lip read David Cameron. Clearly, television still requires “real humans” to assist in accurate lip reading, especially during live broadcasts.
Putting my personal opinions aside for the moment, I thought it only fair to investigate the AI lip reading phenomenon before coming to a conclusion.
An artificial intelligence system that can lip read
First, there was LipNet, trained in lip reading by watching thousands of hours of video and then matching text to the movement of the TV person’s mouth. However, learning from specifically selected videos has its limits. The videos had every speaker’s face well-lit and facing forward, speaking only in a standardized sentence structure. Obviously, this is not an accurate representation of TV programs or how people speak in real-world situations.
Then, another team at the Oxford’s Engineering Science department collaborated with Google’s Deepmind and took it a step further. They had the AI interpret over 100,000 video clips from BBC television, which represented a much wider variation of lighting and head positions. The highly praised AI system, affectionately referred to as WAS (Watch, Attend and Spell) was trained using 5,000 hours of TV programs containing 118,000 sentences and a vocabulary of 17,500 words.
Right now the system can only operate on full sentences of recorded videos. Jon Soon Chung, a doctoral student at Oxford, explains that there is still a lot more work that needs to be done on improving the accuracy of the system, before one day implementing WAS to function in real-time.
What professionals in the hard of hearing world say about AI
Jesal Vishnuram, technology manager for the charity Action on Hearing Loss, said:
“AI lip reading would be able to enhance the accuracy and speed of speech-to-text especially in noisy environments, and we encourage further research in this area and look forward to new advances being made.”
What do I think?
I would like to see more evidence of AI accurately lip reading people from various cultural backgrounds and how the researchers might go about teaching the machine the meaning behind words when they have no emotional reference. Due to the AI being trained predominantly via BBC video clips, I doubt the variety of ways people of different backgrounds pronounce words can be accurately interpreted.
That being said, I do believe that in future AI lip reading technology could potentially help support and even improve professional lip readers’ accuracy. However, these lip reading AI’s are a long way from being able to think and understand like humans do.
“In principle, these “deep mind” neural net approaches – where the machine does all the learning by using feedback from whether its own ‘guesses’ are correct – are powerful. As long as they keep learning and get plenty of input from connected real speech, and good input (auditory, visual – or written as seems to be the case in this demo) they should be able to lip read – possibly better than many people can.
Voice recognition software has come a long way using these processes, and adding lip movements to the acoustic signal should work as well – if not better – than auditory alone. However, so far, these models are all ‘demo models’ … not really suitable for a test drive in open country, more to marvel at in the showroom of the laboratory. How long it will be before everyday viable machine lip reading comes about will, I think, depend on the financial prospects to develop the model. That may come from consumer products, but I think it’s more likely to come from military and surveillance institutions where there may be a need to get information from noisy signals… and where it looks like funding (in the USA at least) will be more reliable. Where human lip reading experts will continue to be of use is where they know more about the speech patterns, language and culture of the people they see speaking than may be available to the machine. Humans also pick up on other information ‘in the frame’ (for instance, what is going on around the speaker/s) – which the machine will not be looking for.”
– Ruth Campbell, Emeritus Professor, University College London
A lip reader doesn’t just read lips
While AI lip readers can somewhat accurately predict what people are saying by extracting information from their lip movements, there’s one key element they are not capable of learning: empathy. To truly understand what’s being said and convey an accurate message, a person needs to focus on how something is being said, rather than purely focusing on what is being said.
Being able to lip read is a challenging skill that takes dedicated time and an in-depth understanding of human emotions and body language. When you read lips, only about 30 percent of speech can actually be seen on the lips. The rest is inferred from context, movements of the jaw, cheeks, neck, and the expression of the eyes.
The best lip readers I know use lip reading as their primary means of communication and will only lip read the languages they have a fluent command over.
While AI lip reading certainly has a place in the world of television, it still has a way to go.
Lip reading is a skill that requires mental agility and a good knowledge of the language being spoken.
We want to hear your story. Become a Mighty contributor here.
Thinkstock photo by Wavebreak Media.