![]() |
|||||||||||||||||||||||||||||||||||||
| June 2005 |
Issue #11 |
||||||||||||||||||||||||||||||||||||
|
|
![]() Active 3-D Object Recognition through Next View Planning by Sumantra Dutta Roy Abstract 1 Introduction 3-D object recognition is a difcult task primarily because of the loss of information in the basic 3-D to 2-D imaging process. Most model-based 3-D object recognition systems consider features from a single image, using properties invariant to an object, and preferably, invariant to the viewpoint. We often need to recognize 3-D objects which because of their inherent asymmetry (in any set of features: geometric, photometric, or colour-based, for example), cannot be completely characterized by an invariant computed from a single view. In order to use multiple views for an object recognition task, one needs to maintain the relationship between different views of an object. In single view recognition, systems often use complex feature sets, which are not easy to extract from images. In many cases, it may be possible to achieve unambiguous recognition using simple features, and suitably planned multiple views [6, 1]. A single view of a 3-D object often does not contain sufficient features to recognize it unambiguously. Objects which have two or more views in common with respect to a feature set, may be distinguished through a sequence of views. As a simple example [6,1], let us consider the set of features to be the number of horizontal and vertical lines, and a model base of polyhedral objects. Fig. 1(a) shows a given view of an object. All objects in Fig. 1(b) have at least one view which correspond to two horizontal and two vertical lines.
A further complication arises if the given 3-D object does not t inside the camera'sfield of view. Fig. 2 shows an example of such a case. The view in Fig. 3(a) could have come from any of the objects in Fig. 3(b), (c) and (d). Even if the identity of the object were known, one may often like to know what part of the object the camera is looking at - the pose of the camera with respect to the object. Single-view recognition systems often use complex feature sets, which are associated with high feature extraction costs,which in itself, may be noisy. A simple feature set is more applicable for a larger set of objects. In many cases, it may be possible to achieve recognition using a simpler feature set and suitably planned multiple observations.
An Active Sensor is one whose parameters can be varied in a purposive manner. For a camera, this implies a purposive control over the external parameters (the parameters R and t describing the 3-D Euclidean transformation between the camera coordinate system and a world coordinate system) and the internal parameters (given by the internal camera parameter matrix A, composed of parameters: the focal lengths in the x- and y- image directions, the position of the principal point, and the skew factor). Our work in active 3-D object recognition has primarily considered two areas namely,
The following sections give an overview of our work in this area. 2 Aspect Graph-based Modeling and Recognition using Noisy Sensors We consider the problem of recognizing an isolated 3-D object using simple features, and suitably planned multiple views. We assume a single rotational degree of freedom (hereafter, DOF) between the object and the (uncalibrated) camera. We propose a novel hierarchical aspect graph-based knowledge representation scheme which encodes domain knowledge about different view of the objects in the model base. This plays an important role in calculating the probabilities of different entities as evidence comes in from a view, as well as planning an optimal next view, subject to memory and processing constraints. The planning process is reactive - the system uses both the past history and the current observation to plan the next view. This feature, coupled with the explicit modeling of uncertainty, allows a high degree of robustness to feature detection errors, and does not incur a high computational cost which would be associated with an off-line system. Information from each observation prunes the search for taking a view that disambiguates between the possible view interpretation hypotheses at any stage. To serve as a benchmark, we use a simple deterministic case to show that the number of views required to disambiguate between a set of n competing aspects corresponding to the first view, is O(log n). An important feature of our system is that it does not incur the overhead of tracking the region of interest across views. We present details of this system in [2]. We have experimented with
databases of reasonably complex shapes, which have a large degree of
interpretation ambiguity corresponding to a view. Fig.4 shows the objects
in the aircraft model base, and results of some experiments with the
model base. For the experiments in Fig. 4, we use a very simple set
of features: the number of horizontal and vertical lines in an image
of the object, and the number of circles. In the bottom part of this
figure, the initial view in each of these cases has the same features:
three horizontal and vertical lines each, and In [5], we present a characterization of errors in aspect graphs, as well an algorithm for estimating aspect graphs, given noisy sensor data. The algorithm has low-order polynomial time complexity, in the size of the tessellated viewpoint space. We also propose a function to evaluate the output of aspect graph construction algorithms. We have examined the both a single rotational DOF case, as well as a 3-DOF case for rotations. For the experiments in Fig. 4, the top row right set shows an example of using our aspect graph construction strategy which explicitly models feature detection errors. Due to the shadow of the left wing on the fuselage of the aircraft, the feature detector detects four vertical lines instead of three, the correct number. Our error modeling scheme in the aspect graph construction, and error handler in the object recognition system enable it to recover from this feature detection error. [3] presents a complete overview of the entire system, the aspect graph construction and the object recognition part. The papers also enumerate the three cases when our algorithm is not guaranteed to succeed.
In [8], we propose the idea of an appearance-based aspect graph. This is an alternative to feature-based methods, since eigenspace information captures all information about the view of an object that can be obtained from an image. In this case as well, we propose a hierarchical knowledge representation scheme, and a probabilistic reasoning framework which helps in both generating hypothesis about a given view of an object, and planning the next view, if required. The scheme is robust to problems which are usually associated with appearance-based methods namely, background clutter, and size and position changes in a view of an object. The paper [8] represents work in progress, and presents some very preliminary work in this area. 3 Recognition of Large 3-D objects through Next View Planning using Inner Camera Invariants Our second scheme proposes a new on-line method for recognizing large 3-D objects, which may not t in a camera's field of view. Unlike the previous case, we consider a projective camera model, and consider the case when the internal camera parameters may vary - either accidentally, or on purpose (e.g., a zoom-in operation to get details of a particular portion of the object, or a zoom-out operation to get a wider field of view). In [9, 10], we propose a new class of invariants for
complete 3-D Euclidean pose estimation using an uncalibrated camera
- Inner Camera Invariants. These are imagecomputable
functions, independent of the internal parameters of the camera. We
use these to advantage in our recognition scheme for large 3-D objects
(Details in [4, 7]). We propose a new part-based recognition knowledge
representation scheme. We consider a very general definition of the
word `part': in our formulation, an object is composed of parts, but
is not partitioned into a set of parts. A view of an object contains
2-D or 3-D parts (which are detectable using 2-D or 3-D projective invariants,
for example), and other `blank' or `featureless' regions (which do not
have features detectable using We have experimented with a set of architectural models (Fig. 3), and a set of buildings in the I.I.T. Bombay academic area. While our formulation is for a general 6-DOF case, we have experimented with a 4-DOF setup, similar to the one depicted in Fig. 2. For the architectural models, we show results of successful recognition and pose estimation even in cases of a high degree of interpretation ambiguity associated with a view. Fig. 3 shows such an example. Such a view could have come from any of the three models, different views of which are shown in Fig. 3(b), (c) and (d), respectively. Fig. 5 shows an example of the system's resilience to changes in the internal parameters of the camera. For the same two initial views, we progressively zoom the camera out at the third view. The system correctly recognizes the object in each case, and estimates the pose accurately in each case (<9.425°, -22.000mm, -9.999mm, 150.00mm>, <9.888°, -22.000mm, -9.999mm, 150.00mm>, and <9.896°, -22.000mm, -9.999mm, 150.00mm>, respectively). While [4] presents a preliminary description of our system, [7] describes the system in detail.
We have also experimented with an extremely difcult operating environment - buildings in the I.I.T. Bombay academic area. There are numerous trees and other unmodelled objects. Additionally, occlusions and improper lighting conditions also adversely affect the performance of the system. Fig. 6 and Fig. 7 show examples of experiments with the real buildings. We describe these in detail in [7], examine robustness issues, and state the limitations of the proposed method. Contact
References
|
![]()
|
||||||||||||||||||||||||||||||||||