Automatic Semantic Annotation of Real-World Web Images
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
summer project pal
Active In SP

Posts: 308
Joined: Jan 2011
29-01-2011, 08:03 PM

Automatic Semantic Annotation of Real-World Web Images
A Seminar Report
Sarath S Nair
Department of Computer Science & Engineering
College of Engineering Trivandrum
Kerala - 695016

As the number of Web images is increasing at a rapid rate, searching them semantically
presents a significant challenge. Many raw images are constantly uploaded with few meaningful
direct annotations of semantic content, limiting their search and discovery. In this paper, a
semantic annotation technique based on the use of image parametric dimensions and metadata
is presented. Using decision trees and rule induction, a rule-based approach is developed to
formulate explicit annotations for images fully automatically, so that by the use of this method,
a semantic query such as “sunset by the sea in autumn in New York” can be answered and
indexed purely by machine. The system is evaluated quantitatively using more than 100,000
Web images. Experimental results indicate that this approach is able to deliver a highly competent
performance, attaining good recall and precision rates of sometimes over 80 percent.
This approach enables a new degree of semantic richness to be automatically associated with
images which previously could only be performed manually.

.pdf   Automatic Semantic Annotation of real world web images.pdf (Size: 1.15 MB / Downloads: 54)

1 Introduction
Image archives on the Internet are growing at a phenomenal rate. With digital cameras becoming
increasingly affordable and the widespread use of home computers possessing hundreds
of gigabytes of storage, individuals nowadays can easily build sizable personal digital photo
collections. Photo sharing through the Internet has become a common practice. According to
reports released in 2007, an Internet photo-sharing startup,, has 40 million monthly
visitors and hosts two billion photos, with new photos in the order of millions being added on a
daily basis. Image search provided by major search engines, such as Google, MSN, and Yahoo!,
relies on textual descriptions of images found on the Web pages containing the images and the
file names of the images. These search engines do not analyze the pixel content of images and,
hence, cannot be used to search for unannotated image collections.
Research in image annotation has reflected the dichotomy inherent in the semantic gap and
is divided between two main categories: concept-based image retrieval and content-based image
retrieval. The former focuses on retrieval by image objects and high-level concepts, while the
latter focuses on the low-level visual features of the image. In order to determine image objects,
the image often has to be segmented into parts. Common approaches to image segmentation
include segmentation by region and segmentation by image objects. Segmentation by region
aims to separate image parts into different regions sharing common properties. These methods
compute a general similarity between images based on statistical image properties and common
examples of such properties are shape, texture and color where these methods are found to be
robust and efficient. Segmentation by object, on the other hand, is widely regarded as a hard
problem, which if successful, will be able to replicate and perform the object recognition function
of the human vision system. However, an advantage of using low-level features is that, unlike
high-level concepts, they do not incur any indexing cost as they can be extracted by automatic
algorithms. In contrast, direct extraction of high-level semantic content automatically is beyond
the capability of current technology.
This system is based on systematic analysis of image capture metadata in conjunction with
image processing algorithms. The result is the ability to automatically formulate annotations
to large numbers of Web images which endow them with a new level of semantic richness. In
doing so, we can answer semantic queries such as “Find images of sunset by the sea in New
York in autumn” purely automatically without any form of manual involvement.
2 Related Works
A real-time annotation demonstration system, ALIPR (Automatic Linguistic Indexing of
Pictures - Real Time), is provided online at ALIPR, launched officially on
November 1, 2006, is an advanced machine-assisted automatic image annotation modeling and
optimization technique for online images and is an important benchmark by which the competence
of other systems may be measured. The system annotates any online image specified by
its URL. This work is the first to achieve real-time performance with a level of accuracy useful
in certain real applications. The annotation is based only on the pixel information stored in the
image. The ALIPR automatic image annotation engine has a vocabulary of 332 English words
at the moment. However, many more English words can be used to search for pictures. With
an average of about 1.4 seconds on a 3.0 GHz Intel processor, the system identifies annotation
words for each picture.
3 System Structure
Images are annotated with predefined semantic concepts in conjunction with methods and
techniques of image processing and visual feature extraction. The system structure is depicted
schematically in Figure 1. The first step, metadata extraction, is followed by automatic semantic
annotation, which is a novel approach using decision trees and rule induction to develop a rulebased
mechanism to formulate annotations and search for specific images fully automatically.
The next step is similarity extraction of image content, such as color, shape, and texture
measures. The last step is semantic concept manipulation, which works in conjunction with
commonsense knowledge in order to build semantic resources from a noisy corpus and apply
those to the image retrieval task. All images in the database are metadata-embedded and
stored in JPEG format.
Figure 1: System Structure
• Metadata extraction: To extract all image acquisition parameters, including aperture(f),
exposure time(t), subject distance (d), focal length (L) and fire activation (h), from metadata
of images. Exposure value (EV ), timeslot, location could also be annotated based
on metadata of images.
• Color feature extraction: Colour features such as the NF, RGB and colour spaces have
specific advantages. We use color histograms which describe the global color distributed
in an image. It is easy to compute and is insensitive to small changes in viewing positions.
• Shape feature extraction: Extracts shape context correlogram, which takes into account
the statistics of the distribution of points in shape matching and retrieval. These
descriptors are good for incorporating the spatial distribution of the colors or the spatial
distribution of binary shapes, but do not permit the incorporation of context of different
• Texture feature extraction: Texture distribution methods represent a complex texture
by measuring the frequency with which a set of atomic textures appear in the image.
• Automatic semantic annotation of images: A rule based approach to automatically
formulate annotations and search for specific images fully automatically, enabling a new
degree of semantic richness to be automatically associated with images which previously
can only be performed manually.
• Semantic manipulation:Form a corpus of common sense knowledge in order to build
semantic resource from a noisy corpus and applying the resource appropriately to the
image retrieval tasks.
3.1 Correlating Scene Characteristics With Image Dimensions And
3.1.1 Scenes of Images
In relation to image acquisition, many images may be broken down to a few basic scenes such
as nature and wildlife, portrait, landscape, and sports. A landscape scene comprises the visible
features of an area of land, including physical elements such as landform, living elements of flora
and fauna, and abstract elements such as lighting and weather conditions. The goal of portrait
photography is to capture the likeness of a person or a small group of people. Like other types
of portraiture, the focus of acquisition is the person’s face, although the entire body and the
background may be included. Sports photography corresponds to the genre of photography
that covers all types of sports. The equipment used by a professional photographer usually
includes a fast telephoto lens and a camera that has an extremely fast exposure time that can
rapidly take pictures. Some typical scene categories and scene images are given in Table 1 and
Figure 2, respectively, where scene images are subsets of the corresponding image categories.
Categories Scenes
Day Scenes (Sd)
Landscape(Si) Night Scenes (Sn)
Sunrises and Sunsets (Sss)
Indoor Events (Sie)
Indoor Portraits (Sip)
Portraits(Sp) Outdoor Events (Soe)
Outdoor Portraits (Sop)
Sports (Ss)
Nature(Sna) Macro (Sm)
Wildlife (Sw)
Table 1: Scenes of Images
Figure 2: Each column shows the top matches to semantic queries of (a) “night scenes,”
(b)“outdoor portraits,” © “day scenes,” (d) “wildlife,” and (e) “sports”
The image file format standard, embedded in the images and established by the Japan Electronic
Industry Development Association, makes use of the Exchangeable Image File Format
(EXIF) and contains metadata specification for image file format used in digital cameras. The
metadata tags defined in the metadata standard cover a broad spectrum of data including:
date and time information, acquisition parameters and descriptions, and copyright information,
which have shown to be useful for managing digital libraries where limited tags and
comments may be entered manually. Some other common records of acquisition parameters
include aperture (f), exposure time (t), subject distance (d), focal length (L), and fire activation
(h). Location information can be included in the metadata, which could come from a GPS
receiver connected to the image acquisition devices. In addition, GPS data can be added to
any digital photograph, either by correlating the time stamps of the photographs with a GPS
time-dependent record from a hand-held GPS receiver or manually using a map or mapping
An image Ii may be characterized by a number of dimensions di1, ..., dik which correspond to
the image acquisition parameters, i.e., Ii = (di1, ..., dik). Each image corresponds to a point in
k-dimensional space. Figure 3a shows the image points of the images from Figure 2 and Figure
3b shows the clustering of images from the training set. As seen from Figure 3, each particular
type of image scene tends to cluster in a group, which forms the basis of rule-induction algorithm
in the next section. Each of these dimension values may be a scalar or vector. An example of
scalar dimension is the exposure value (EV). An example of a vector-valued dimension is the
GPS coordinates.
Figure 3: Image distribution in 3D space. (a) Images from Figure 2 (b) Images from training
3.1.2 Rule Induction and Annotation Algorithm
Here, the image dimensions are analyzed to annotate and classify images. The images are
annoted with predefined semantic concepts in conjunction with methods and techniques of
image processing and visual feature extraction. The algorithm first constructs a decision tree
starting from a training set, and each case specifies values for a collection of attributes and
may include discrete or continuous values. In supervised classification learning, one is given a
training set of labeled instances, and each training instance is described by a vector of attribute
(feature measurement variable) values x and a categorical class label y (output-dependent
variable). The task here is to induce some rules  which approximate the mapping X ! Y ,
and a common choice for the representation of  is a decision tree. The C4.5 is a well-known
decision tree classifier which has the ability to induce annotation rules in the form of decision
trees from a set of given examples. The information gain of an attribute 0a0 for a set of cases
T may be calculated. If a is discrete, and T1; ...; Ts are subsets of T consisting of cases with
distinct values for attribute a, then
info(S) = −
freq(Cj , S)
× log2(freq(Cj , S)
) (1)
gain = info(T) −
× info(Ti) (2)
is the entropy function.
If a is a continuous attribute, cases in T are ordered with respect to the value of a. Assume
that the ordered values are v1, .., vm. For i  [1,m−1] the value v = vi+vi+1
2 induces the splitting,
1 = {vj |vj  v}, Tv
2 = {vj |vj > v} (3)
For each value v, the information gain gainv is computed by considering the splitting above,
and the information gain for a is defined as maximum gainv.
An important consideration in the use of these algorithms is how to set the parameters of the
algorithm. There is a minobj or m parameter in C4.5 which governs algorithm termination.
When a node contains fewer than m instances, it is not split further but rather made into
a leaf labeled with the majority class. It has been suggested that the user should manually
experiment with different values of the m parameter. After training C4.5 with m = 14 on the
entire training set, the following set of rules are obtained. Letting the set of images be I and
Sl, Sp, Sna"I, Sn, Sd, Sss"Sl, Sop, Soe, Sip, Sie, Ss"Sp, Sm, Sw"Sna, we have,
8iI, (ti > 0.125) ^ (di > 30)
^(EVi  8) ) iSn, (4)
8iI, (di > 30) ^ (EVi > 8) ^ (ti  0.125)
^(10 < Li  100) ) iSd, (5)
8iI, (di > 50) ^ (EVi > 12) ^ (Li  200)
^(ti  0.002) ) iSss, (6)
8iI, [(fi  3) ^ (5 < di  8)]
^{[(ti  0.00625) ^ (Li  30)]
_[(30 < Ld  182) ^ (ISOi  250)]
_(Ld > 182) _ (ti  0.003125)} ) iSop, (7)
8iI, (fi > 5.6) ^ (Li  25)
^(5 < di  8) ^ (ti > 0.003125) ) iSoe, (8)
8iI, [(5 < di  8) ^ (fi > 5.6)]
^{[(ti > 0.003125) ^ (Li  25)] _ [(ti > 0.011111)
^(Li > 25) ^ (hi = 0)]} ) iSip, (9)
8iI, (5 < di  8) ^ {(fi  5.6)
^{[(Li > 182) ^ (ti > 0.00625)]
_[(ISOi > 250) ^ (25 < Li  182)]}} _ [(hi = 1)
_(fi > 5.6) ^ (Li > 25) ^ (ti > 0.011111)] ) iSie, (10)
8iI, (10 < di  40) ^ (150 < Li  380)
^(ti  0.005) ) iSs, (11)
8iI, (di  1) ^ (EVi > 11)
^(100 < Li  200) ) iSm, (12)
8iI, (Li > 450) ^ (di > 20) ) iSw, (13)
There are two types of scenes consisting of multiple disjunctions: outdoor portraits and
indoor events. For example, four conjoined conditions were generated in the scene of outdoor
portraits. The first condition of such a scene is able to deliver a precision of 78.18 percent,
and it attains 86.68 percent when all conditions are used concurrently. This incorporation
of multiple conjoined conditions plays an important role to demonstrate the gain from using
multiple annotation approaches.
3.2 Structured Annotation Description Scheme
MPEG-7 Visual Description Tools included in the standard consist of basic structures and
Descriptors that cover the following basic visual features: Color, Texture, Shape, Motion,
Localization, and Face recognition. Each category consists of elementary and sophisticated
There are seven Color Descriptors: Color space, Color Quantization, Dominant Colors, Scalable
Color, Color Layout, Color-Structure, and GoF/GoP Color. The Scalable Color Descriptor
is a Color Histogram in HSV Color Space, which is encoded by a Haar transform. Its binary
representation is scalable in terms of bin numbers and bit representation accuracy over a broad
range of data rates. The Scalable Color Descriptor is useful for image-to-image matching and
retrieval based on color feature.
There are three shape Descriptors: Region Shape, Contour Shape, and Shape 3D. The shape
of an object may consist of either a single region or a set of regions as well as some holes in
the object. Since the Region Shape descriptor makes use of all pixels constituting the shape
within a frame, it can describe any shapes, i.e. not only a simple shape with a single connected
region but also a complex shape that consists of holes in the object or several disjoint region.
The Contour Shape descriptor captures characteristic shape features of an object or region
based on its contour. It uses so-called Curvature Scale-Space representation, which captures
perceptually meaningful features of the shape. It captures very well characteristic features of
the shape, enabling similarity-based retrieval. It reflects properties of the perception of human
visual system and offers good generalization. It is robust to non-rigid motion. It is robust to
partial occlusion of the shape.
There are three texture Descriptors: Homogeneous Texture, Edge Histogram, and Texture
Browsing. Homogeneous texture has emerged as an important visual primitive for searching and
browsing through large collections of similar looking patterns. For instance, a user browsing an
aerial image database may want to identify all parking lots in the image collection. A parking
lot with cars parked at regular intervals is an excellent example of a homogeneous textured
pattern when viewed from a distance. The edge histogram descriptor represents the spatial
distribution of five types of edges, namely four directional edges and one non-directional edge.
Since edges play an important role for image perception, it can retrieve images with similar
semantic meaning. Thus, it primarily targets image-to-image matching (by example or by
sketch), especially for natural images with non-uniform edge distribution.
3.3 Feature and Similarity Measure
In addition to the above annotation rules, the annotation system can be further enriched
through color histogram matching. These descriptors are good for taking into account the
spatial distribution of the colors and the spatial distribution of binary shapes.
First, all images were converted from RGB color space to an indexed image value and then
extract the global color feature based on the color histogram. Histogram search is sensitive to
intensity variation, color distortion, and cropping. The RGB index value ranges from (0,0,0)
(Black) to (1,1,1) (White). Sunrises and sunsets are particularly suited to evaluate this as it
offers a rich semantic meaning. Here, we select one image i 2 Sss from the training set with
image indexed values (0.433333, 0.372549, 0.266665). By covering all test sets of Sss, indexed
image values of all test sets between (0.517647, 0.305882, 0.105882) and (0.560784, 0.427451,
0.227451) were evaluated. A total of 697 images were annotated, which achieves a recall rate
of 100 percent, while the precision rate is only 1.3 percent. This indicates that, by color
alone, the annotation performance is not satisfactory. The precision rate of the first dimension
alone, namely, global color distribution, is relatively low and delivers poor precision, using it in
conjunction with the three other dimensions; the precision rate grows to 85.71 percent. Clearly,
compared to annotation without the global color feature enabled, the performance is around
58.33 percent to 60 percent. In addition to the color, through the use of other measures, the
shape and texture content may be similarly determined. In combination, therefore, a good level
of semantic annotation accuracy can be achieved.
4 System Evaluation
Evaluations are conducted onWeb images from using three annotation approaches:
the proposed approach, traditional textual annotations, and ALIPR . The annotation system
has been developed using an image database which consists of a collection of 3,231 unlabeled
images obtained from a photograph album over the Web at random. All images in the database
are metadata-embedded and stored in JPEG format with varying sizes ranging from 200 * 72
to 32,770 * 43,521 pixels. The file sizes of images are in between 22,397 and 768,918 bytes. All
images were manually labeled with semantic concept (the scene of images) before arbitrarily
dividing this image database into a training set and a test set. Out of these 2000 images were
kept in the test set. 100,000+ images from were also evaluated.
These experimental results indicate that traditional human tagging without title and description
is good and relatively stable. However, the results of human tags linked up with titles and
descriptions are somewhat disappointing because one would expect that titles and descriptions
would enrich semantic concepts for effective image searching. The experimental results indicate
that titles and descriptions can sometimes harm the performance of textual image annotations.
For example, images with the title “The iPod Shuffle Sport Case” are substantially not related
to sports scenes, and images with the title “Scenes from a Great Day” are not referring to day
scenes. These experiments indicate that about half of the results of the proposed approach
yield better performance than the competing methods. For ALIPR, except the “indoor” and
“festival” concept (named “indoor events” in the system), the results of this approach fare
better in terms of accuracy and stability.
The combined approach sometimes yields better performance than the individual approaches,
the gain in precision through the inclusion of human tags tends to be rather marginal and, in
some cases, actually gives a lower precision.
Summing up the experimental results, the proposed approach not only offers automated semantic
annotation, its annotation performance is as good as and sometimes better than the
tags by human beings. By the use of this method, a semantic query such as “sunset by the sea
in autumn in New York” can be answered and indexed purely by machine.
Figure 4: Precision rate comparison between the proposed approach, tags by humans, and
ALIPR grouped by scenes of images and image categories
5 Conclusion
By the systematic analysis of embedded image metadata and parametric dimensions, it is
possible to determine the semantic content of images. Through the use of decision trees and
rule induction, a set of rules are established that allows the semantic contents of images to
be identified. When jointly applied with feature extraction techniques, this produces a new
level of meaningful image annotation. Using this image annotation method, it is possible to
provide semantic annotations for any unlabeled images in a fully automated manner. The
system is evaluated quantitatively using more than 100,000 Web images outside the training
database. Experimental results indicate that this approach is able to deliver highly competent
performance, attaining good recall and precision rates of sometimes over 80 percent. This
approach enables an advanced degree of semantic richness to be automatically associated with
images which previously could only be performed manually.
[1] R.C.F.Wong and C.H.C. Leung. “Automatic semantic annotation of real-world web images”.
IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 30(No. 11), Nov 2008.
[2] S. Bileschi M. Riesenhuber T. Serre, L. Wolf and T. Poggio. “Robust object recognition
with cortex-like mechanisms”. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.
29(No. 3):411–426, Mar 2007.
[3] J. Li and J.Z.Wang. “Real-time computerized annotation of pictures”. IEEE Trans. Pattern
Analysis and Machine Intelligence, Vol. 30(No. 6):985–1002, June 2008.
[4] S. Ruggieri. “Efficient c4.5”. IEEE Trans. Knowledge and Data Eng., Vol. 14(No. 2):438–444,
Mar/Apr 2002.
[5] P. van Beek et al. “Multimedia content description interfacepart 5. multimedia description
schemes,”. ISO/IEC JTC1/SC29/WG11 MPEG01/N3966, Mar 2001.
[6] MPEG 7.


Important Note..!

If you are not satisfied with above reply ,..Please


So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page

Quick Reply
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  web spoofing full report computer science technology 13 9,011 20-05-2016, 11:59 AM
Last Post: Dhanabhagya
  web image re-ranking using query-specific semantic signatures ppt jaseelati 0 261 02-03-2015, 01:23 PM
Last Post: jaseelati
  web operating system seminar jaseelati 0 320 17-02-2015, 02:20 PM
Last Post: jaseelati
  web enabled automated manufacturing system jaseelati 0 226 13-01-2015, 02:34 PM
Last Post: jaseelati
  web based claim processing system pdf jaseelati 0 376 10-01-2015, 02:34 PM
Last Post: jaseelati
  embedded web technology ppt jaseelati 0 514 16-12-2014, 04:34 PM
Last Post: jaseelati
  Calling a Web Service from an ASP.NET Web Page ppt study tips 1 543 19-10-2014, 11:24 PM
Last Post: LICjKYTCf
  Web Browser full report seminar paper 3 2,358 26-09-2013, 09:42 AM
Last Post: seminar projects maker
  Building Java Web services with NetBeans 7 seminar projects maker 0 431 24-09-2013, 12:44 PM
Last Post: seminar projects maker
  Introduction to Web Services PPT seminar projects maker 0 548 11-09-2013, 03:46 PM
Last Post: seminar projects maker