Beyond the Sum of Parts: Shape-based Object Detection and its Applications

Yarlagadda, Pradeep Krishna

Preview

PDF, English
Download (33MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00014965
URN: urn:nbn:de:bsz:16-heidok-149656
URL: http://www.ub.uni-heidelberg.de/archiv/14965

Abstract

The grand goal of Computer Vision is to generate an automatic description of an image based on its visual content. Such a description would lead to many exciting capabilities, for example, searching through the images based on their visual content rather than the textual tags attached to the images. Images and videos take an ever increasing share of the total information content in archives and on the internet. Hence, such automatic descriptions would provide powerful tools for organizing and indexing by means of the visual content. Category level object detection is an important step in generating such automatic image descriptions.

The major part of this thesis addresses the problems encountered in popular lines of approaches which utilize shape in various ways for object detection namely, i) Hough Voting, ii) Contour based Object Detection and iii) Chamfer Matching. The problems are tackled using the principles of emergence which states that the whole is more than the sum of its parts.

Hough Voting methods are popular because they efficiently handle the high complexity of multi-scale, category-level object detection in cluttered scenes. However, the primary weakness of this approach is that mutually dependent local observations independently vote for intrinsically global object properties such as object scale. All the votes are added up to obtain object hypotheses. The assumption is thus that object hypotheses are a sum of independent part votes. Popular representation schemes are, however, based on an overlapping sampling of semi-local image features with large spatial support (e.g. SIFT or geometric blur). Features are thus mutually dependent. The question arises as to how to incorporate the feature dependences into Hough Voting framework. In this thesis, the feature dependencies are modelled by an objective function that combines three intimately related problems: i) grouping of mutually dependent parts, ii) solving the correspondence problem conjointly for dependent parts, and iii) finding concerted object hypotheses using extended groups rather than based on local observations alone.

While Voting with dependent groups brings a significant improvement over standard Hough Voting, the interest points are still grouped in a query image during the detection stage. The grouping process can be made robust by grouping densely sampled interest points in training images yielding contours and evaluating the utility of contours over the full ensemble of training images. However, contour based object detection poses significant challenges for category-level object detection in cluttered scenes: Object form is an emergent property that cannot be perceived locally but becomes only available once the whole object has been detected and segregated from the background. To tackle this challenge, this thesis addresses the detection of objects and the assembling of their shape simultaneously, while avoiding fragile bottom-up grouping in query images altogether. Rather, the challenging problems of finding meaningful contours and discovering their spatially consistent placement are both shifted into the training stage. These challenges can be better handled using an ensemble of training samples rather than just a single query image. A dictionary of meaningful contours is then discovered using grouping based on co-activation patterns in all training images. Spatially consistent compositions of all contours are learned using maximum margin multiple instance learning. During recognition, objects are detected and their shape is explained simultaneously by optimizing a single cost function.

For finding the placement of an object template or its part in an edge map, Chamfer matching is a widely used technique because of its simplicity and speed. However, it treats objects as being a mere sum of the distance transformation of all their contour pixels, thus leading to spurious matches. This thesis takes account of the fact that boundary pixels are not all equally important by applying a discriminative approach to chamfer distance computation, thereby increasing its robustness. While this improves the behaviour in the foreground, chamfer matching is still prone to accidental responses in spurious background clutter. To estimate the accidentalness of a match, a small dictionary of simple background contours is utilized. These background elements are trained to focus at locations where, relative to the foreground, typically accidental matches occur. Finally, a max-margin classifier is employed to learn the co-placement of all background contours and the foreground template. Both the contributions bring significant improvements over state-of-the-art chamfer matching on standard benchmark datasets.

The final part of the thesis presents a case study where shape-based object representations provided semantic understanding of medieval manuscripts to art historians. To carry out the case study, a novel image dataset has been assembled from illuminations of 15th century manuscripts with ground-truth information about various objects of artistic interest such as crowns, swords. An approach has been developed for automatically extracting potential objects (for e.g. crowns) from the large image collection, then analysing the intra-class variability of objects by means of a low dimensional embedding. With the help of the resultant plot, the art historians were able to confirm different artistic workshops within the manuscript and could verify the variations of art within a particular school. Obtaining such insights manually is a tedious task and one has to go through and analyse all the object types from all the pages of the manuscript. In addition, a semi-supervised approach has been developed for analysing the variations within an artistic workshop, and extended further to understand the transitions across artistic styles by means of 1-d ordering of objects.

Document type:	Dissertation
Supervisor:	Ommer, Prof. Dr. Björn
Place of Publication:	Heidelberg, Germany
Date of thesis defense:	15 May 2013
Date Deposited:	23 May 2013 08:47
Date:	2013
Faculties / Institutes:	Philosophische Fakultät > Kunsthistorisches Institut The Faculty of Mathematics and Computer Science > Department of Computer Science Service facilities > Interdisciplinary Center for Scientific Computing
DDC-classification:	004 Data processing Computer science
Controlled Keywords:	Computer Vision, Machine Learning, Pattern Recognition, Visual Recognition, Compositionality