Image-to-word transformation based on dividing and vector quantizing images with word
Active In SP
Joined: Feb 2011
21-02-2011, 10:14 AM
Image-to-word transformation based on dividing and vector quantizing images with words
We propose a method to make a relationship between images and words. We adopt two processes in the method, one is a process to uniformly divide each image into sub-images with key words, and the other is a process to carry out vector quantization of the sub-images. These processes lead to results which show that each sub-image can be correlated to a set of words each of which is selected from words assigned to whole images. Original aspects of the method are, (1) all words assigned to a whole image are inherited to each divided sub-image, (2) the voting probability of each word for a set of divided images is estimated by the result of a vector quantization of the feature vector of sub-images. Some experiments show the effectiveness of the proposed method.
To permit complete access to the information avail¬able through the WWW, media-independent access methods must be developed. For instance, a method enabling use of an image is needed as a possible query-to retrieve imagesfl] and texts.
So far, various approaches regarding image-to-word transformation are being studied -. But they are very limited in terms of vocabulary or domain of im¬ages. In real data, it is not possible to segment ob¬jects in advance, assume the number of categories, nor avoid the presence of noise which is hard to erase.
In this paper, a method of image-to-word transfor¬mation is proposed based on statistical learning from images to which words are attached. The key concept of the method is as follows: (1) each image is divided into many parts and at the same time all attached words for each image are inherited to each part; (2) parts of all images are vector quantized to make clus¬ters; (3) the likelihood for each word in each cluster is estimated statistically.
2 Procedure of the proposed method
2.1 Motivation and outline
To find the detailed correlation between text and im¬age (not simply discriminating an image into a few categories), each portion of the image should be cor¬related to words instead of the whole image to words.
Assigning keywords to images portion by portion would be an ideal way to prepare learning data. How¬ever , with the exception of a very small vocabulary, we cannot find such learning data nor can we pre¬pare them. The more the size of the data increases, the more difficult assigning keywords to images por¬tion by portion becomes. So we have to develop an another method to avoid this fundamental problem.
To avoid this problem, we propose a simple method to correlate each portion of an image to key words only using key words for the whole image.
The procedure of the proposed method is as follows
1. Many images with key words are used for learn¬ing data,
2. Divide each image into parts and extract features from each part,
3. Each divided part inherits all words from its orig¬inal image.
4. Make clusters from all divided images using vec¬tor quantization,
5. Accumulate the frequencies of words of all par¬tial images in each cluster, and calculate the like¬lihood for every word,
6. For an unknown image, divide it into parts, ex¬tract their features, and find the nearest clusters for all divided parts. Combine the likelihoods of their clusters, and determine which words are most plausible.
The main point of this method is to reduce noise (i.e. unsuitable correlating) by accumulating similar partial patterns from many images with key words.
For example, suppose an image has two words, 'sky' and 'mountain'. After dividing the image, the part which has only the sky pattern also has 'sky' and 'mountain' due to the inheriting of all words. The word 'mountain' is inappropriate for the part. How¬ever if an another image has two words, 'sky' and 'river', accumulating these two images, the sky pat¬tern has two 'sky"s, one 'mountain' and one 'river'. In such way, we can hope that the rate of inappropri¬ate words are gradually decreased by accumulating similar patterns.
Figure 1 shows the concept of estimating likeli¬hoods of data.
2.2 Dividing image, feature extrac¬tion, and inheriting key words
Each image is divided equally into rectangular parts because it is the simplest and fastest way to divide images. The number of divisions ranges from 3x3 to 7x7. In this paper, the dividing method driven by the contents of images such as region extraction has not been tried.
In parallel with the dividing, all words given for an image are inherited into each of the divided parts. This is a straightforward way to give words to each part because there is no informations to select words at this stage.
Extracted features for the divided images are (1) a 4x4x4 cubic RGB color histogram and (2) an 8-directions x 4-resolutions histogram of intensity after Sobel filtering, which can be calculated by general fast and common operations.
Feature (1) is calculated as follows:
1. divide RGB color space into 4x4x4,
2. count the number of pixels fall in each bin.
As a result, 64 features are calculated. Feature (2) is calculated as follows: For 4-resolutions (1, 1/2, 1/4 and 1/8):
1. filtering by vertical (Sy) and horizontal (Sx) So¬bel filters,
2. for each pixel, calculate arguments ( tan" (S„/Ss) ),
3. divide arguments [—n, n) into 8 directions,
4. sum the intensity ( y S% + S ) of each pixel in each direction.
As a result, 32 features are calculated.
As a result of these operations, a total of 96 features are calculated from a divided image.
2.3 Vector quantization
The feature vectors extracted from the divided parts of all learning images are clustered by vector quan¬tization in a 96-dimensional space. In this paper, data incremental vector quantization is used. In this method centroids (representative vectors for each cluster) are created incrementally for data input. Each cluster has one centroid and each data belongs to a cluster uniquely.
There is only one control parameter in this method, that is, the threshold of error for quantization (re¬ferred to later as scale). The less a scale is, the more centroids are created.
The procedure for vector quantization is as follows:
1. Set the scale d.
2. Select a feature vector as the first centroid,
3. For the i-th feature (2< % < total number of feature vectors):
if there are centroids such that the distance from the i-th feature is less than d, then the i-th feature vector belongs to the nearest centroid,
else set the i-th feature vector as a new centroid.
2.4 Probability estimation of key words for each cluster
After centroids Cj (j = 1,2,C) are created by the vector quantization, likelihoods (conditional proba¬bility) P(wl\cj) for each word to, (j = l,2,...,W) and each Cj are estimated by accumulating their fre¬quency:
P(Wi\Cj) = —w ———
_ (mji/n^jiii/N) T, k=i(mJk/nk)(nk/N)
TTlji 171 ji
where, rrijl is the total of word to, in centroid Cj, Mj(= X^=i mjk) means the total of all words in cen¬troid Cj, nl is the total of word to, in all data, and N(= EfcLi nk) is the total of words for all data (each word is counted repeatedly each time it appears).
download full report