How will Google Video Search Improve? A Short History from Marissa Mayer
Let’s face it: neither Google Video search nor looking for good YouTube videos can satisfy you momentarily. You are interested in, let’s say ‘digital ghost mask’ (what else) and you get one single Robota trailer in YouTube. You could see one example if you hit the term ‘logitech orbit cam‘ but why would you do so? You need virtual ghost masks or bunny faces, so that’s what you type in. No wonder, Google is trying to gather lots and lots of speech data to be able to develop its video search engine through speech to text conversion. So what is the idea?
Making people purposefully dial 1-800-GOOG-411 - to get their pizza delivered etc.
Marissa gave an interview back in October to Infoworld talking to Juan Carlos Perez (found through Garett Rogers & Blogoscoped).
IDGNS: There are different technology approaches to video search. Blinkx, for example, maintains it does it better than Google because it indexes the text of what is said in videos with speech recognition technology. Where is Google with video search today?
Mayer: Google Video has had an interesting evolution. When we first launched it, it was based on closed captions, so literally a transcription of the program, but interestingly, you couldn’t play video. So we changed it so that you could play video, and now we’re searching the meta content. That said, one of the future elements of what’s likely to happen in search is around speech recognition.
You may have heard about our [directory assistance] 1-800-GOOG-411 service. Whether or not free-411 is a profitable business unto itself is yet to be seen. I myself am somewhat skeptical. The reason we really did it is because we need to build a great speech-to-text model … that we can use for all kinds of different things, including video search.
The speech recognition experts that we have say: If you want us to build a really robust speech model, we need a lot of phonemes, which is a syllable as spoken by a particular voice with a particular intonation [my note: sorry, phonemes are not syllables, just sounds that change the meaning of a word, or something like that in plain English, but Marissa is not a linguist and it is a peripheral question for potential GOOG 411 users what their speech is needed for]. So we need a lot of people talking, saying things so that we can ultimately train off of that. … So 1-800-GOOG-411 is about that: Getting a bunch of different speech samples so that when you call up or we’re trying to get the voice out of video, we can do it with high accuracy.
IDGNS: What about non-speech content in videos — the action in the clip?
Mayer: That’s going to be particularly hard, given that most of Google’s approaches are based on text right now. So we really do need the text, which is why our inclination is to build a great speech-to-text model and pull the text out…. That said, there are a lot instances of humor, context, things that happen in frame that don’t necessarily have words, but for that we’re going to have to rely on the community to do things like tagging.
There is some very early research happening around recognizing faces in videos, recognizing particular objects, understanding that hey, there’s a ball in the frame right now, but it’s very early and not at all ready to be deployed in a commercial application.
Basically, what Marissa tells us here is that blinkx is claiming something that is too early to be taken seriously yet.
Let’s see what Blinkx says about its technology:
blinkx takes a holistic approach to video search: the power of its solution lies in using every characteristic of the video itself to understand the content. For example, blinkx listens to the sound track using speech-to-text technology, looks at the images on screen using advanced video analytics, and reads other information embedded into the file by using media-analysis plug-ins to extract, for example, closed captioning. In this way, blinkx is processing as much information as possible to enable both extremely accurate search, and more advanced operations such as automatic hyperlinking of related content or implicit query, which understands the content a user is producing and viewing.
Blinkx was featured in March 2007 on Reuters for adding speech to its search algo and Kevin Heiser from Jupiter Research enlisted a few things you can do with Blinkx: the video search engine supposedly recognizes lyrics in clips, dialogs at a “sophisticated enough level.” Kevin says “the challenge for Blinkx is that they are working with different sounds, voices, lyrics, singing styles across the world, so there’s no technology that has the capability of doing a very accurate job on the fly, translating those lyrics or that actual content into a kind of a relevant capsule that describes what the video is.” Suranga Chandratillake talks about scene capture with the ambitious aim to enable users to jump to exactly that segment of the video that they are interested in rather than having to look through a half-an-hour irrelevant session. Blinkx is also analyzing texts that appear on the screen (e.g. in sports matches results, players, red cards, etc.). What a pity I cannot do one simple thing with the videos: link to this interview on Blinkx… But here’s an interview with Suranga Chandratillake made by Beet.TV: not only linkable but embeddable too… The irony though is that I couldn’t find the Reuters bit on blinkx through a YouTube search. But it was right on the first page for a Google Video search…
[…] post by annplugged Share and Enjoy: These icons link to social bookmarking sites where readers can share and […]