In This Issue
Spring Issue of The Bridge on Frontiers of Engineering
March 15, 2012 Volume 42 Issue 1

Large-Scale Visual Semantic Extraction

Thursday, March 15, 2012

Author: Samy Bengio

Machine-learning algorithms for image annotation can enable the classification of huge amounts of different types of data.1

The emergence of the web as a tool for sharing information has led to a massive increase in the size of potential datasets. Millions of images on web pages have tens of thousands of possible meanings that can be associated with labels, or annotations, in the form of HTML tags. These tags are used by webpage authors to describe the content of sections of web pages and by search engines or other web pages to link to them. Thus, HTML tags can be conveniently collected by querying search engines, such as (Torralba et al., 2008), or human-curated labels, such as (Deng et al., 2009).

Clearly, we need machine-learning algorithms for image annotation that can learn from and annotate data on this large scale. The algorithms must include (1) scalable training and testing times and (2) scalable memory usage. The ideal algorithm would be fast and would fit on a laptop, at least at annotation time. Although many models have been proposed and tested on small datasets, it is not clear if any of them satisfies these requirements.

In the first part of this article, we describe models that can learn to represent images and annotations jointly in a low-dimensional embedding space. The low dimension simultaneously implies fast computations for ranking annotations and small memory usage. To ensure good performance for such a model, we propose estimating its parameters by learning to rank annotations and optimize the process so that at least the top ranked annotations are relevant for a given image (this process is often called optimizing precision-at-top-k).

In the second part of the article, we propose a novel algorithm for shortening testing time in multi-class classification tasks in which the number of classes (or labels) is so large that even a linear algorithm can become computationally infeasible. Our proposal is for an algorithm for learning a tree structure of the labels in the joint embedding space. We believe that by optimizing the overall tree loss (i.e., the loss one would incur by using the tree to retrieve the top-k labels for any given image), such an algorithm would be more accurate than existing tree-labeling methods.

Joint Embedding of Images and Labels

Imagine that one could transform different types of objects - images, videos, music tracks, or texts - into a common representation (such as a vector of real numbers) that could be used by machines to compare them. One could then look for the nearest image similar to a given text label or videos similar to a given image and so on. Each type of object (image, video, music, text) could have a different transformation, or mapping function, and these could be learned using appropriate training data.

  Figure 1

Figure 1 illustrates our idea for a simple case that involves two types of objects—images and text labels. We propose to learn two mappings (or transformations) jointly into a feature space (which we call embedding space) where the images and labels are both represented. Even though the mapping functions for images and text labels are different, they can be learned jointly to optimize the objective of our final task—annotating images with appropriate text labels.

The first step consists in transforming images, which are often available in jpg, gif, and similar file formats, into more appropriate representations. Years of research in computer vision have yielded several alternatives.

Most recent approaches start by detecting so-called interest points, which are positions in the image considered important to analyze (e.g., Dalal and Triggs, 2005; Lowe, 1999). Features are extracted from these interest points using well-known methods, such as scale-invariant feature transforms (SIFT) or histograms of oriented gradients (HOG). Finally, the features are aggregated into a single fixed-length vector representing the image.

One common approach includes (1) learning a dictionary of common features from a large repository of images; (2) using an unsupervised clustering approach, such as K-Means; and (3) representing the image as a histogram of the number of times each feature of the dictionary is present in the image.

Assuming the image representation is a d-dimensional real-valued vector, we denote it by x Ï ℜd. Furthermore, let us assume our annotation task can consider up to Y possible text labels (or annotations). We can then represent each annotation as a unique integer i Ï ϒ = {1,...,Y} index into the dictionary of all possible annotations.

Finally, consider an arbitrary D-dimensional embedding (or target) space into which we want to map both images and labels. Our goal is then to learn to transform images, represented in their original image-feature space ℜd into a new representation in the joint space ℜd : ΦI(x) ℜd → ℜD while jointly learning a mapping for annotations (i.e., a position in the new joint space for each annotation):

ΦW(i) : {1,...,Y} → ℜD

In the simplest case, we assume that both mappings are linear, that is:

ΦI(x) = xTV


ΦW(i) = Wi

where V is a matrix of size d ´ D mapping images into the embedding space, W is a matrix such that line i, Wi is a D-dimensional vector that represents the label in position i in the dictionary, and T means “transpose” and applies to the matrix that precedes it. For example, xT means that the lines and columns in matrix x are transposed.

The goal for a given image is to rank all possible annotations so that the highest ranked annotations best describe the semantic content of the image. Consider the following model:

fi(x) = ΦI(xW(i)T = xTVWiT

where the possible annotations indexed by i are ranked according to the magnitude of fi(x), largest first.

Image Labeling as a Learning-to-Rank Task

Labeling an image can be viewed as a ranking task. Given an image, one must order labels so that the ones at the top of the list best correspond to the image and the ones at the end of the list are unrelated to it. Various learning-to-rank methods have been proposed in the machine-learning literature over the years, some of which can scale to large datasets and some, such as SVM-Rank, that cannot (e.g., Joachims, 2003).

Here is the simplest scalable approach. One can decompose the ranking task into several smaller tasks:

loss = Math formula

where the notation |a|+ is a if a>0, 0 otherwise, and for each training image x, we want the score of each appropriate label y+ for that image to be higher than the score of any inappropriate label y for that image by a margin of at least one. Otherwise we will incur the corresponding loss.

This loss can be minimized very efficiently on very large datasets using stochastic gradient descent. In this case, one randomly selects a triplet (x, y+, y) from the training set and computes its loss. If the loss is 0, then nothing changes. Otherwise, one can compute the gradient of the loss with respect to the parameters (in this case, V and W) and use it to modify the parameters in the direction opposite the gradient with some fixed but small learning rate, which amounts to a slight improvement in the model with respect to that triplet. By doing so repetitively, this procedure can be shown to converge to the nearest local optimum of the global loss.

However, note that most of the chosen triplets will yield a 0 loss (e.g., it very quickly becomes easy to say that a given image is more like a dolphin than a truck), hence not improving the model, which can lead to very slow progress. A faster alternative would be a loss that concentrates on the top of the ranking instead of considering all triplets (x, y+, y) uniformly.

In Weston et al., 2010, we proposed the weighted approximate rank loss (WARP), which can weigh each triplet according to the estimated rank of appropriate labels and still yield an efficient implementation. In order to estimate the rank of appropriate labels efficiently, we used the following trick: for a given image x and its correct label y+, we randomly sampled several incorrect labels, from y up, until we found one that yielded a non-negative loss.

The number of samples that had to be drawn from the set of possible labels to find one that yielded a non-negative loss is a good indicator of how well-ranked the correct label is. If it is very well ranked, one will have to sample many incorrect labels before finding one that is ranked higher. If it is poorly ranked, one will quickly find an incorrect label that is ranked higher.

In fact, a good estimate of the rank of the correct label is simply the number of incorrect labels in the dictionary divided by the number of samples needed in the proposed approach. One can then use this estimate to weigh the gradient update accordingly. The resulting model can be trained much faster and will perform at a much higher level of precision at the top of the ranking.

Large-Scale Learning

We trained an embedding model with the WARP loss on a very large dataset of more than 10,000,000 training images and more than 100,000 labels. The labels correspond to queries people have tried on Google Image Search, and images attributed to these labels were the ones they often clicked on. This made for a very noisy dataset in which queries were written in several languages with many spelling mistakes and many apparently similar queries (although there are often subtle differences between apparently similar queries, like dolphin, which usually means the sea animal, and dolphins, which often also means the football team in the United States).

Table 1

An interesting side effect of training such a model is that looking at the nearest labels to a given label in the embedding space provides a natural way of organizing labels. Examples in Table 1, which show the nearest labels to a given label, include several misspellings, translations, and semantically similar labels. Table 2 provides examples of images from our test set of 3,000,000 along with the nearest 10 labels in the embedding space. In both tables, one can see that the model is not perfect (e.g., the image of the dolphin was labeled “orca” in position 2, before “dolphin” in position 3). However, given the difficulty of the task, the model works quite well.

Table 2

Learning Label Trees

Real-time applications are not effective for labeling images when the number of labels is large (in our testing dataset we had about 100,000 labels). A naive approach would be to compute a score for each potential label, then sort them, and finally show the top few. However, as the number of potential labels increases, this procedure becomes infeasible in real time. Ideally, the labeling time should be, at most, proportional to the log of the number of labels.

Existing tree-based approaches (Beygelzimer et al., 2009a,b, Hsu et al., 2009) are typically less accurate than naive linear-time approaches. We have proposed a novel approach for learning a label tree, in which each node divides the set of potential labels into multiple subsets and a local decision is made to choose which subset of labels the current image is most probably associated with (Bengio, 2010). Figure 2 shows an example of a label tree. If the subsets are balanced (i.e., approximately the same size), this procedure decreases the number of labels to consider at a logarithmic rate until a prediction is reached. 

Figure 2

In our model, we apply the following two steps: (1) learning a label tree, and (2) learning predictors for each node of the tree. These steps are described in the following subsections.

Learning the Label-Tree Structure

To learn a label tree such as the one in Figure 2, we proceed as follows. Given a set of labels in a node, we look for a partition of that set into subsets. Within the subset, labels are difficult to distinguish with classifiers trained on their corresponding images. However, it is easier to distinguish images belonging to labels of one subset from images belonging to labels of another subset.

This basic rule enables us to postpone difficult decisions (e.g., is this an ipod nano or an ipod touch) and make easy decisions first (e.g., is this about either [dolphin, whale, shark] or about [ipod touch, ipod nano, iphone]). We can do this by computing the confusion matrix between all labels to count the number of times our classifiers confuse class i with class j on a test set of images not used to train the classifiers.

Table 3

We then use the resulting matrix to apply spectral clustering (Ng et al., 2002), a technique that can cluster objects (in this case labels) using a given similarity (or distance) matrix between them, instead of using, say, the Euclidean distance between two labels. This procedure can then be used recursively to create a complete label tree. Table 3 shows an example of labels that were clustered together using this technique.

Learning a Label-Embedding Tree

Once labels are organized into a tree, one can retrain a new embedding model (using the algorithm described above) in which each image can be labeled either with its original set of labels or with reference to the nodes of the tree that contain them. Thus, the embedding space contains representations of all labels, as well as all nodes of the tree.

At each training iteration, one selects an image at random and then selects one of its correct labels at random, including the nodes of the tree where the labels appear. Whenever an internal node is selected as a positive label for a given image, we select the competing negative label as one of the sibling nodes in the label tree. This corresponds to how the tree would be used at test time.

This repetitive procedure results in a structure of labels based on both semantic and visual similarities. In addition, using this method, the performance of a label-embedding tree, on average, performs both faster and slightly better at test time (Bengio et al., 2010).


The problem of image annotation for very large numbers of possible labels is very challenging. In recent years, significant progress has been achieved by proposing progressively better feature-extraction techniques from images to solve problems like rotation and translation invariances. In this article, we focus on the modeling aspect of the problem, particularly for very large, varied datasets of images and labels.

The proposed approach (learning a joint embedding space) achieves a performance level that is at least as high as performance levels on a similar task at that scale. In addition, the approach described above yields a novel visual semantic space in which images and text labels coexist and their relative distances in that space reflect both semantic and visual differences.

We also propose a method, an additional tree-based structure over the labels that learns from data, so that a given image can be annotated quickly in a time frame proportional to the log of the number of potential labels. With its improved performance and quick reactions, this approach may prove to be very efficient.


Bengio, S., J. Weston, and D. Grangier. 2010. Label embedding trees for large multi-class tasks. In Advances in Neural Information Processing Systems, NIPS, 2010.

Beygelzimer, A., J. Langfordm, and P. Ravikumar. 2009a. Error-correcting tournaments. Pp. 247–262 in International Conference on Algorithmic Learning Theory, ALT.

Beygelzimer, A., J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl. 2009b. Conditional Probability Tree Estimation Analysis and Algorithm. In Conference in Uncertainty in Artificial Intelligence, UAI, 2009.

Dalal, N., and B. Triggs. 2005. Histograms of Oriented Gradients for Human Detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2005.

Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition.

Hsu, D., S. Kakade, J. Langford, and T. Zhang. 2009. Multi-Label Prediction via Compressed Sensing. In Advances in Neural Information Processing Systems, NIPS, 2009.

Joachims, T. 2003. Optimizing Search Engines using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, 2003.

Lowe, D. 1999. Object recognition from local scale-invariant features. Pp. 1150–1157 in Proceedings of the International Conference on Computer Vision, Vol. 2.

Ng, A.Y., and M.I. Jordan, and Y. Weiss. 2002. On Spectral Clustering: Analysis and an Algorithm. In Advances in Neural Information Processing Systems, NIPS, 2002.

Torralba, A., R. Fergus, and W.T. Freeman. 2008. 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11): 1958–1970.

Weston, J., S. Bengio, and N. Usunier. 2010. Large Scale Image Annotation: Learning to Rank with Joint Word-Image Embeddings. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML-PKDD, 2010.


1 This article is based on Weston et al., 2010, and Bengio et al., 2010.

About the Author:Samy Bengio is a research scientist in machine learning, Google Research.