BottomUp/TopDown Image Parsing with Attribute Grammer
summer project pal Active In SP Posts: 308 Joined: Jan 2011 
29012011, 09:55 AM
BottomUp/TopDown Image Parsing with Attribute Grammer A Seminar Report by Remya M.R. M105112 Department of Computer Science & Engineering College of Engineering Trivandrum 201011 Abstract The paper, presents an attribute graph grammar for image parsing on scenes with manmade objects, such as buildings, hallways, kitchens, and living rooms. For this choose one class of primitives 3D planar rectangles project and implimentationed on images, and six graph grammar production rules. Each production rule not only expands a node into its components but also includes a number of equations that constrain the attributes of a parent node and those of its children. Thus proposed graph grammar is context sensitive. The grammar rules are used recursively to produce a large number of objects and patterns in images and thus is a type of generative model.The inference algorithm integrates bottomup rectangle detection which activates topdown prediction using the grammar rules. The output of the inference is a hierarchical parsing graph with objects, surfaces, rectangles, and their spatial relations. In the inference, the acceptance of a grammar rule means a recognition of an object, and actions are taken to pass the attributes between a node and its parent through the constraint equations associated with this production rule. When an attribute is passed from a child node to a parent node, it is called bottomup, and the opposite is called topdown. BottomUpTopDown Image Parsing.pdf (Size: 5.3 MB / Downloads: 59) 1 Introduction In real world images, especially manmade scenes, such as buildings, oces, and living spaces, a large number of complex patterns (objects) are composed of a small set of primitives using few operational relations. This is very similar to language where a huge set of sentences can be generated with relatively small vocabulary and grammar rules in a hierarchy from words, phrases, clauses, and to sentences. The objective of image parsing is to decompose an image into its constituent components (objects, parts, surfaces, primitives) in a hierarchical structure represented by a parsing graph.Fig.1 shows a parsing graph for a kitchen scene using the rectangle primitives. Here only shows a subset of the rectangles for clarity. Its vertical edges show the decomposition of scene and objects, and its horizontal links specify the spatial relations between objects and components. Thus a parsing graph is more general than a parsing tree in language. In the literature the study of syntactic pattern recognition was pioneered by Fu et al in the 197080s. Unfortunately its applicability was very much limited by two diculties.(i) The primitive patterns(terminators in their visual languages)could not be computed reliably from real images. Thus their parsing algorithms achieved little success in real images.(ii) The string grammar(nite state automation)and stochastic context free grammars(SCFG)in early work are not strong enough to express complex visual patterns. 5 Figure 1: The hierarchic parse graph constructed by an iterative topdown/bottomup algo rithm. 6 2 Overview of the TopDown/BottomUp Inference Al gorithm The paper mainly focuses on the topdown and bottomup inference. The algorithm proceeds in three phases. Phase I is bottomup computation.Phase I computes a number of edge segments from the input image, and estimate a number of vanish points(usually three). Thus the line segments converging to the same vanishing point are put in a line set. Then generate a number of rectangle hypotheses. By drawing two pairs of line segments from two out of the three line sets, and then evaluate it by the goodness of t to a rectangle in the edge map. Thus an excessive number of rectangles are proposed as bottomup particles which may overlap or con ict with each other. The bottomup particles are tested and pursued one by one in a procedure similar to the matching pursuit algorithm. Phase 2 initializes the terminal nodes of the parse graph in a greedy way. In each step, the algorithm picks the most promising bottomup rectangle hypothesis with the heaviest weight among all of the candidates and accepts it if it reduces the description length. Then, the weights of all of the candidates that overlap or con ict with this accepted rectangle are reduced as in the matching pursuit algorithm. Phase 3 integrates topdown/bottomup inference. Each rectangle in the cur rent parse graph matches (often partially) to a production rule with attributes passed to the nonterminal node. These nonterminal nodes are, in turn, matched to other production rules, which then generate topdown proposals for predictions (see the downward arrows in Fig.1). In phase 3,each of the ve grammar rules (omitting the scene rule) maintains a data structure that stores all of its weighted candidates. Each step of the algorithm picks the most promising proposal (with the heaviest weight) among all ve candidate sets. Thus, a new nonterminal node is added to the parse graph. This corresponds to recognizing a new subconguration and activates the following actions: 1)creating potentially new "`topdown" proposals and inserting them into the lists.2) reweighting some proposals in the candidate sets.3) passing attributes between a node and its parent through the constraint equations associated with this production rule. 7 3 Related Work on Image Parsing In the literature, the study of syntactic pattern recognition was pioneered by Fu and You,Hanson and Riseman and Ohta et al. and other image understanding systems with top down/bottomup inference in the 19701980s. In recent years, attribute grammars and context sensitive graph grammars have been developed in visual diagrams parsing. In the vision lit erature, grammars are mostly studied in binary shape recognition, such as the grammars for medial axis and shock graphs. Most recently, there has been a resurgence of compositional computing for segmentation and object recognition. Detecting rectangular structures in images has been well studied in the vision litera ture, especially for detecting building roofs in aerial images. One class of methods detects edge segments and line primitives and then groups them into rectangles. The other types of methods use Hough Transforms on edge maps to detect rectangles globally. A Markov chain Monte Carlo method was developed in rectangular scene construction in which also uses com positional structures. Putting detecting rectangle structures in the broader topic of modeling structural variability, this work is also closely related to a variety of representations, including shape grammars with algebraic constraints. This work is also related to some previous work on object recognition and image parsing by datadriven Markov chain Monte Carlo(DDMCMC). 8 4 Attribute Graph Grammer for Scene Representation 4.1 Attribute Graph Grammer An attribute graph grammar is augmented from the SCFG by including attributes and constraints on the nodes. Denition 1.An attribute graph grammar is specied by a 5tuple: G = (VN; VT ; S;R; P) where VNand VT are the sets of nonterminal and terminal nodes, respectively, and S is the initial node for the scene. R is a set of production rules for spatial relationships. P is the probability for the grammar. A nonterminal node is denoted by capital letters, A;A1;A2 2 VN: and a terminal node is denoted by lowercase letters,a; b; c; a1; a2 2 VT . Both nonterminal and terminal nodes have a vector of attributes denoted by X (A) and x (a), respectively. R = (r1; r2; :::; rk) is a set of production rules expanding a nonterminal node into a number of nodes in VN [ VT . Each rule is associated with a number of constraint equations. For example, the following is a rule that expands one node A into two nodesA1;A2 2 VN : r : A ! (A1;A2) The associated equations are constraints on the attributes with constraints gi(X(A)) = fi(X(A1);X(A2)); i = 1; 2; :::; n®: gi ()and fi () are usually project and implimentationion functions that take some elements from the attribute vec tors. For instance, letX(A) = (X1;X2;X3) and let X(A) = (X11;X12): Then, an equation could simply be an equivalence constraint (or assignment) for passing the information between nodes A and A1 in either direction, X1 = X11: In the parsing process, the attributes of a child node X11 pass it to X1 in rule r. This is called "bottomup message passing". Then,X1 may be passed to another child node's attribute X2, with X21 = X1: This is called "topdown message passing." A production rule may instantiate a nonterminal node to a terminal node: r : A ! a with constraints gi(X(A)) = fi(x(a)); i = 1; 2; :::; n®: Denition 2.A parse graph G is a treestructured representation expanded from a root node S by a sequence of production rules (r1; r2; :::; rk) and augmented with spatial relations and constraints. Denition 3.A conguration C is a set of terminal nodes (rectangles in this paper): C = (ai; x(ai)) : ai 2 VT ; i = 1; 2; :::;K: Denition 4.A language of grammar G is the set of all valid congurations that can be derived by the production rules starting from a root node S. It is denoted by (G) = n (C;X ©) : S 1;:::; k ! C; i 2 R; i = 1; 2; ; :::; k o : 4.2 A Class of Primitives:Rectangles The simple grammar uses only one class of primitives, i.e.,the project and implimentationion of planar rectangles in 3space into the image plane. As illustrated in Fig. 2, it has two pairs of parallel line segments in 3D, which intersect at two vanishing points v1; v2 in the image plane. There fore,the set of terminal nodes is denoted by VT = (a; x(a)) : x(a) 2 a: There are many equivalent ways of dening the attributes x(a) for a rectangle.The 9 Figure 2: A planar rectangle (shaded) is described by eight variables: the two vanishing points v1 = (x1; y1) and v2 = (x2; y2) and the four directions 1; 2; 3 and 4 at the two vanishing points. method chooses the variables to simplify the constraint equations and thus denote x(a) by eight variables: two vanishing points v1 = (x1; y1) and v2 = (x2; y2), two orientations 1 and 2 for the two boundaries converging at v1,and two orientations 3 and 4 for the two boundaries converging at v2: x(a) = (x1; y1; x2; y2; 1; 2; 3; 4): 4.3 Six Production Rules As a generic grammar for image interpretation, the proposed representation has the root node S for the scene and one nonterminal node A for objects and surfaces: VN = (S;X(S)); (A;X(A)) : X(S) = n;X(A) A The scene node S generates n independent objects.The object node A can be instantiated(assigned)to a rectangle (rule r5) or can be used recursively by the other four production rules: r2the line production rule, r3the mesh production rule, r4the nesting production rule, and r6the cube production rule.The six production rules are summarized in Fig.3. For a line object of n = n(A) rectangles, X0(A) = fv1; v2; 1; 2; 3; 4; 1; :::; n 


