Image search on the internet relies mostly on text – in other words, when you make a query in Google Image search, the engine does not know what the image is about, but it relies on caption (human tagging – manual assignment of meta-data), key words close to the image.
Up to now this is the best that we have and it is reasonably good but we must admit that it is not perfect. So as I have a keen interest in Computer Vision, the question i’m asking is how can we improve on that?
The answer to that seems to lie in the realm of Artificial Intelligence and Machine Learning. For example, how do you know that the top image of my Blog theme is about a bridge, fog and trees and most likely the season is autumn? Can a new born know about that? Well no! You can recognise a tree because you have been told that this is a tree.
So what we need is an engine that is trained to recognise objects around it. Educating a child is fun but teaching a machine is boring so how can we automate the process of machine learning. In most Computer Vision systems, we would be stuck but with Image search engines, we might be luckier.
How? By using Internet users to train our engine.
An image engine could be made smarter so that it learns from users clicks. For example, suppose I make a Google image search on swimming I will get images of people swimming, swimming pools,… but I’m not interested in swimming pools and I will not click on these but will click on images of people swimming. That could be used to automatically train an image engine, the more clicks obtained, the more relevant the image is to the keyword and eventually it is tagged to swimming. Yet, the system is not perfect, if there’s a beautiful girl in swimsuit, what r the odds that it will be clicked on though unrelated? 😉
The next step would be for the system to automatically recognise other images that are similar and automatically tag them to swimming. This is more complex as typically a swimming image is likely to contain water + swimmer. So the engine should eventually recognise water and swimmer as two other components in the image. Again these can be done through user-click tagging but eventually the user-click tagging weight associated to water will be less than that associated to swimming – engine will thus know that this image is more related to swimming than to water but still linked to water. Through segmentation techniques, the engine can learn that there are only 2 components in the image (water + swimmer) and through a system of synonyms water can be made equal (roughly) to sea. In the end, we might have an intelligent engine who knows about things of the world.