Robotics Research Ibchnical Report Three-Dimensional Object Recognition from Single Two-Dimensional Images by David G. Lowe Technical Report No. 202 Robotics Report No. 62 February, 1986 •H e C -H O I o o U 4J New York University It Institute of Mathematical Sciences Computer Science Division 25 1 Mercer Street New York, NX 1 00 1 2 y Three-Dimensional Object Recognition from Single Two-Dimensional Images by David G. Lowe Technical Report No. 202 Robotics Report No. 62 February, 1986 Courant Institute of Mathematical Sciences New York University 251 Mercer Street New York, NY 10012 lowe@nyu.arpa Work on this paper has been supported by National Science Foundation grant DCR-8502009. THREE-DIMENSIONAL OBJECT RECOGNITION FROM SINGLE TWO-DIMENSIONAL IMAGES David G. Lowe Courant Institute of Mathematical Sciences New York University 251 Mercer St., New York, NY 10012 Lowe@nyu.arpa Abstract A computer vision system has been implemented that can recognize three-dimensional ob- jects from unknown viewpoints in single gray-scale images. Unlike most other approaches, the recognition is accomplished without any attempt to reconstruct depth information bottom-up from the visual input. Instead, three other mechanisms are used that can bridge the gap between the two-dimensional image and knowledge of three-dimensional objects. First, a process of perceptual orgsinization is used to form groupings and struc- tures in the image that are likely to be invariant over a wide range of viewpoints. Second, a probabilistic ranking method is used to reduce the size of the search space during model based matching. Finally, a process of spatial correspondence brings the projections of three-dimensional models into direct correspondence with the image by solving for un- known viewpoint and model parameters. A high level of robustness in the presence of occlusion and missing data can be achieved through full apphcation of a viewpoint consis- tency constraint. It is argued that similar mechanisms and constraints form the basis for recognition in human vision. 1. Introduction Much recent research in computer vision ha^ been aimed at the reconstruction of depth information from the two-dimensional visual input. An assumption underlying some of this research is that the recognition of three-dimensional objects can most easily be carried out by matching against reconstructed three-dimensional data. However, there is reason to believe that this is not the primary pathway used for recognition in human vision and that most practical applications of computer vision could similarly be performed without bottom-up depth reconstruction. Although depth meeisurement has an important role in certain visual problems, it is often unavailable or is expensive to obtain. General-purpose vision must be able to function effectively even in the absence of the extensive information required for bottom-up reconstruction of depth or other physical properties. In fact, human vision does function very well at recognizing images, such as simple line drawings, that lack any reliable clues for the reconstruction of depth prior to recognition. This capability also parallels many other areas in which human vision can make use of partial and locally ambiguous information to achieve reliable identifications. This paper presents several methods for bridging the gap between two-dimensional images and knowledge of three- dimensional objects without any preHminary derivation of depth. Of equal importance, these methods address the critical problem of robustness, with the ability to function in spite of missing data, occlusion, and many forms of image degradation. How is it possible to recognize an object from its two-dimensional projection when we have no prior knowledge of the viewpoint from which we will be seeing it? An important role is played by the process of perceptual organization, which detects groupings and structures in the image that are likely to be invariant over wide ranges of viewpoints. While it is true that the appearance of a three-dimensional object can change completely as it is viewed from different viewpoints, it is also true that many aspects of an object's projection remain invariant over large ranges of viewpoints (examples include instances of connectivity, collinearity, parallelism, texture properties, and certain symmetries). It is the role of perceptual organization to detect those image groupings that are unlikely to have axisen by accident of viewpoint or position. Once detected, these groupings can be matched to corresponding structures in the objects through a knowledge-based matching process. It is possible to use probabilistic reasoning to rank the potential matches in terms of their predicted reliability, thereby focusing the search on the most reliable evidence present in a particular image. Unfortunately, the matches based on viewpoint-invariant eispects of each object are by their nature only partially reliable. Therefore, they are used simply as "trigger features" to initiate a search process and viewpoint-dependent analysis of each match. A quantitative method is used to simultaneously determine the best viewpoint and object parameter values for fitting the projection of a three-dimensional model to given two-dimensional features. This method allows a few initial hypothesized matches to be extended by making accurate quantitative predictions for the locations of other object features in the image. This provides a highly reliable method for verifying the presence of a particular object, since it can make use of the spatial information in the image to the full degree of available resolution. The final judgement as to the presence of the object can be based on only a subset of the predicted features, since the problem is usually greatly overconstrained due to the large number of visual predictions from the model compared to the number of free parameters. Figure 1 shows a diagram of these components and contrasts them with methods based upon depth reconstruction. These methods for achieving recognition have been combined in a fuctioning vision system named SCERPO (for Spatial Correspondence, Evidential Reasoning, and Percep- tual Organization). This initial implementation uses simplified components at a number of levels, for example by performing matching only between straight line segments rather than arbitrary curves. However, even this initial system exhibits many forms of robust- ness, with the ability to identify objects from any viewpoint in the face of partial occlusion, missing features, and a complex background of unrelated image features. The current sys- Depth Reconstruction Imago Features M o u e o 3 o 3 a N « 3 s o Cl. « a \ f > ' \ ' > ' ^ ' 2 Vz D Sketch Recogniti< >n (matcbing in 3-D domain) > f 3-D Object Models J Knowledge-based Vision Image Features P»rc*ptual organization 2-D Perceptual Groupings Search and evidential reasoning SpaUal correspondec . 3-D Object Models Figure 1: On the left is a diagram of a commonly accepted model for visual recognition based upon depth reconstruction. Many processes cooperate to produce a depth representation (also known as the 2 2-D sketch), and recognition is then performed within the three-dimensional do- main. This paper presents instead the model shown on the right. A process of perceptual or- ganization produces groupings and structures in the image that are likely to be invariant over wide ranges of viewpoint. A probabilistic search process produces hypothesized matches between these groupings and components of object models. These models are brought into accurate spatia, correspondence with the original image features by solving for viewpoint and object parameters resulting in reliable verification. tem has a significant level of performance relative to other model-bcised vision systems, including those that axe based upon the accurate derivation of depth measurements prior to recognition. In addition, it provides a framework for incorporating numerous poten- tial improvements in image description, perceptual grouping, knowledge indexing, object modeling, ajid parameter solving, with resulting potential improvements in performance. Many of the components of this system can be related to corresponding capabilities c: human vision, as will be described at the relevant points in this paper. The following section exaxnines the psychological validity of the central goal of the system, which is to perform recognition directly from single gray-scale images. 2. The Role of Depth Reconstruction in Human Vision There is a widespread assumption within the computer vision and psychology communi- ties that the recognition of three-dimensional objects is based on an initial derivation of depth measurements from the two-dimensional image [9, 19|. However, this assumption seems to be based more upon the perceived difficulty of achieving recognition from a two- dimensional projection than from any convincing psychological data. In this paper we will argue that human capabilities for recognition are much more general than this restricted model would suggest, and that in fact the need for depth reconstruction is the exception rather than the rule. It is true that human vision contains a number of capabilities for the bottom-up derivation of depth measurements, such as stereo and motion interpreta- tion, and these presumably have important functions. However, biological visual systems have many objectives, so it does not follow that these components are central to the spe- cific problem of visual recognition. In fact, the available evidence would seem to strongly suggest the opposite. One difficulty with these methods for depth reconstruction is that the required infor- mation is often unavailable or requires an unacceptably long interval of time to obtain. Stereo vision is only useful for objects within a restricted portion of the visual field and range of depths for any given degree of eye vergence, and is never useful for distant objects. At any moment, most parts of a scene will be outside of the limited fusional area. Motion information is available only when there is sufficient relative motion between observer and object, which in practice is also usually limited to nearby objects. Recognition times are usually so short that it seems unlikely that the appropriate eye vergence movements or elapsed time measurements could be taken prior to recognition even for those cases in which they may be useful. Depth measurements from shading or texture are apparently restricted to special cases such as regions of approximately uniform reflectance or regular texture, and they lack the quantitative accuracy or completeness of stereo or motion Secondly, human vision exhibits an excellent level of performance in recognizing images-such as simple line drawings-in which there is very little potential for the bottom- up derivation of depth information. Whatever mechanisms are being used for line-drawing recognition have presumably developed from their use in recognizing three-dimensional scenes. The common assumption that line-drawing recognition is a learned or cultural phe- nomena '^ not -pported by the evidence. In a seemingly definitive experiment, Hochberg and Brooks [13] describe the case of a 19-month-old human baby who had had no previous exposure to any kinds of two-dimensional images, yet was immediately able to recognize ord nary line drawings of known objects. It is true that there has been some research on the uTualJ Ld o'; 'a '/""°" ""^"' '^ ^'^ ^"^^^^ l^^f- "°— ' *hese methods usually lead to sparse, under-constrained relationships in depth rather than to something T^^^:"l V:1':'- l^^ !''^^^°"' ''-^ -^^^^^^ ^PP-^ -^ ^^ ^P-'^. cases and ii is oilen not possib e to te wh rh nart;/-,,!,^ ;„f i- K uu ten wnicn particular inference applies to a particular case For example, one often-discussed inference is the n^P nf r.^ f- , • X X- f ,. , liiieieuce IS ine use ot perspective convergence to derive the orientation of ines that are oarallpl in tK,-^^ a- ■ , uenve tne parallel in three-dimensions; however, given a number of lines in the image that are converging to a common point, there is usually no effective way to distinguish convergence due to perspective effects from the equally common case of lines that are converging to a common point in three-dimensions. In this paper we will make use of many of the same inferences that have previously been proposed for deriving depth, but they will instead be used to generate two-dimensional groupings in the image that are used directly to index into a knowledge base of three-dimensional objects. Finally, and of most relevance for many applications of computer vision, there has been no clear demonstration of the value of depth information for performing recognition even when it is available. The recognition of objects from complete depth images, such as those produced by a laser scanner, has not been shown to be easier than for systems that begin only with the two-dimensional image. This paper will describe methods for directly comparing the projection of three-dimensional representations to the two-dimensional im- age without the need for any prior depth information. Since the final verification of an interpretation can be performed by comparing the projected knowledge with each available image to the full accuracy of the data, there is nothing to be gained at this stage from any depth information that is derivative from the original images. The one remaining issue is whether there is some way in which the depth information can significantly speed the search for the correct knowledge to compare to the image. Of course, none of the above is meant to imply that depth recovery is an unimportant problem or lacks a significant role in human vision. Depth information may be crucial for the initial stages of visual learning or for acquiring certain types of knowledge about imfamiliaj structures. It is also clearly useful for making precise measurements as an aid to manipulation or obstacle avoidance. Recognition may sometimes leave the precise position in depth undetermined if the absolute size of an object is unknown. Human stereo vision, with its narrow fusional range for a given degree of eye vergence, seems to be particularly suited to making these precise depth measurements for selected nearby objects as an aid to manipulation and bodily interaction. However, it seems likely that the role of depth recovery in common instances of recognition has been overstated. 3. Solving for Spatial Correspondence Many areas of artificial intelligence are aimed at the interpretation of data by finding consistent correspondences between the data and prior knowledge of the domain. In this paper, we will begin by defining the consistency conditions forjudging correspondence be- tween image data and three-dimensional knowledge. Unlike many other areas of artificial intelligence, an important component of this knowledge is quantitative spatial informa- tion that requires specific mathematical techniques for achieving correspondence. The particular constraint that we wish to apply can be stated as follows: The viewpoint consistency constraint: The locations of all projected model fea- tures in an image must be consistent with projection from a single viewpoint. The effective application of this powerful constraint will allow a few initial matches to be bootstrapped into quantitative predictions for the locations of many other features, leading to the reliable verification or rejection of the initial matches. Later sections of this paper will deal with the remaining problems of recognition, which involve the generation of primitive image structures to provide initial matches to the knowledge base and the algorithms and control structures to actually perform the search process. The physical world is three-dimensional, but an image contains only a two-dimensional projection of this reality. It is straightforward mathematically to describe the process of projection from a three-dimensional scene model to a two-dimensional image, but the inverse problem is considerably more difficult. It is common to remark that this inverse problem is underconstrained, but this is hardly the source of difficulty in the case of visual recognition. In the typical instance of recognition, the combination of image data and prior knowledge of a three-dimensional model results in a highly overconstrained solution for the vmknown projection and model parameters. In fact, we must rely upon the overconstrained nature of the problem to make recognition robust in the presence of missing image data and measxirement errors. So it is not the lack of constraints but rather their interdependent and non-linear nature that makes the problem of recovering viewpoint and scene pzu-ameters difficult. The difficulty of this problem has been such that few vision systems have made use of quajititative spatial correspondence between three-dimensional models and the image. Instead it has been common to rely on qualitative topological forms of correspondence, or else to produce three-dimensional depth measurements that can be matched directly to the model without having to account for the projection process. Our goal, then, is to carry out a quantitative form of spatial reasoning to provide a two-way link between image measurements and the object model. Matches between the model and some image features can be used to constrain the three-dimensional position of the model and its components, which in turn leads to further predictions for the locations of model features in the image, leading to more matches and more constraints. The problem of generating the few initial matches to begin this process will be dealt with in later sections of this paper. Here we will describe the spatial reasoning process that relates the image measurements to three-dimensional constraints. The precise problem we wish to solve is the following: given a set of known corre- spondences between three-dimensional points on a model and points in a two-dimensional image, what axe the values of the unknown projection and model parameters that will result in the projection of the given model points into the corresponding image points. The unknown parameters might include the position and orientation of the object in three dimensions, the focal length of the camera, and various degrees of freedom in the model description, such as articulations or variable dimensions. We will later extend this prob- lem description to allow for the least-squares solution of overdetermined constraints, and to allow for matches between corresponding lines (without concern for the position of endpoints) rather than just points. There has been some previous work on solving for the position of a rigid three- dimensional object given matches between points on the model and points in the image. This problem must be solved in order to bring photographs into correspondence with mapping data in the field of photogrammetry. An analytic solution, known as the Church method, is presented in the photogrammetry literature [28], but this solution involves non- linear equations which must themselves be solved by iterative numerical methods. The current preferred technique in photogrcunmetry is to use the same linearization and it- erative methods that will serve as our starting point in this paper. Fischler and Bolles [7] have presented another analytic solution to the problem, but one which also requires iterative numerical solution. In addition, they have presented useful information on the conditions under which multiple solutions can arise. A different approach was taJcen in the ACRONYM computer vision system [4], which used a general symbolic equation solver to place bounds on projection and model parameters from image measurements. However, the equations for the general projection problem were often much too difficult for exact solution, so sub-optimal bounds would be produced. The approach taken in this paper will be to linearize the projection equations and apply Newton's method for the necessary number of iterations. A repjirameterization of these equations is iised to simplify the cal- culation of the necessary derivatives. This allows us to efficiently solve not only the basic rigid-body problem studied in photogrammetry, but also to extend the method to variable model parameters Euid forms of correspondence other than the matching of points. Application of Newton's method We can describe the projection of a three-dimensional model point p into a two-dimensional image point (i',y') with the following equations: {x,y,z) = R{p- T) (x,y)=(^-,- where T is a 3-D translation vector and R is a. rotation matrix which transform p in the original model coordinates into a point {x,y,z) in camera-centered coordinates. These axe combined in the second equation with a parameter / proportional to the camera focal length to perform perspective projection into an image point (i',y'). Our task is to solve for T, R, and possibly /, given a number of model points and their corresponding locations in an image. In order to apply Newton's method, we must be able to calculate the partial derivatives of x' and y' with respect to each of the unknown parameters. However, it is difficult to calculate these partial derivatives for this form of the projection equation. In particular, this formulation does not describe how to represent the rotation R in terms of its three underlying parameters. Many previous attempts to solve the viewpoint determination problem have treated the rotation as consisting of more than three parameters, which leads to the requirement for more image data than is actually needed and to poor techniques for handling errors. The partial derivatives with respect to the translation parameters can be most ezisily calculated by first reparameterizing the projection equations to express the translations in terms of the camera coordinate system rather than model coordinates. This can be described by the following equations: {x,y,z) = R{p) (^■■v')=(^.«..,i^-A) Here the variables J? and / remain the same as in the previous transform, but the vector T has been replaced by {Dx,Dy,Dz), where the two transforms are equivalent when In this new parameterization, Dx and Dy simply specify the location of the object on the image plane and Dz specifies the distance of the object from the camera. As will be shown below, this formulation makes the calculation of partial derivatives with respect to the translation parameters almost trivial. We are still left with the problem of representing the rotation R in terms of its three underlying parameters. Our solution to this second problem is based on the realization that the Newton method does not in fact require an explicit representation of the individual parameters. All that is needed is some way '-o modify the original rotation in mutually orthogonal directions and a way to calculate partial derivatives of image features with respect to these correction parameters. Therefore, we have chosen to take the initial specification of i? as given and add to it incremental rotations y and 4>z about the X, y and z axes of the current camera coordinate system. In other words, we maintain R in the form of a 3x3 matrix rather than in terms of 3 explicit parameters, and the corrections are performed by matrix multiplication with correction rotation matrices rather than by adding corrections to some underlying parameters. It is fast and easy to compose rotations, and these incremental rotations are fairly independent of one another if they are small. The Newton method is now carried out by correcting errors in x' and y' by calculating the optimum correction rotations (^x, y and z = -y. The first table in Figure 2 gives these derivatives for all combinations of variables. Given this parameterization it is now straightforward to accomplish our original ob- jective of calculating the partial derivatives of i' and y ' with respect to each of the original camera parameters. For example, our transform tells us that: z + Dz so Vx dx' _ Also, dx' _ f dx fx dz but, from the table in Figure 2, we know that dx dz = z and -— - = —I d4)y d(f)y and, for simplicity, we will substitute 1 z + Dz c = givmg, dx' dz z -\- Dz dz = -fey- All the other derivatives can be calculated in a similar way. The table in Figure 2 gives the derivatives of x' and y' with respect to each of the seven parameters of our camera model, again substituting c = {z + Dz)~^ for simplicity. Given these partial derivatives of x' and t/', the application of Newton's method is straightforward. For each point in the model which should match against some correspond- ing point in the image, we first calculate the projection of the model point amd measure the error in its position compared to the given image point. The x and y components of the error can be used independently to create separate linearized constraints. For example, making use of the x component of the error, E^, we create an equation which expresses this error as the sum of the products of its partial derivatives times the unknown error correction values, A's: dx' dx' dx' dx' ■ADx+ ^-p^^Dy + ^jz-ADz+ ^— A(^x dDx dDy ' dDz d4>^ dx' . . dx Using the same point we create a similar equation for its y component, so for each point correspondence we derive two equations. From three point correspondences we can derive six equations and produce a complete linear system which can be solved for all six camera model corrections (we are assuming in this example that the camera paraxneter / is either given, or can be approximated by a large value). After each iteration the A terms should shrink by about one order of magnitude, and no more than a few iterations should be needed even for high accuracy. Unknown model parameters, such as variable lengths or 9 X y z ^x 0 —z y ^ -y X 0 x' y' Dx 1 0 Dy 0 1 A -fc'^x -fc^y .^ -fc^xy ~/c(^+-cy2) ^ -fey /ex f ex cy Figure 2: The table at the top gives the partial derivatives of x, y and z with respect to counterclockwise rotations (^'s (in radians) about the coordinate axes. The table below gives partial derivatives of x' and y' with respect to each of the camera transform parameters. eingles, caji also be incorporated. In the worst case, we can always calculate the partial derivatives with respect to these paraimeters by using standard numerical techniques that slightly perturb the parameters cind measure the resulting change in projected locations of model points. However, in most cases it is possible to specify the three-dimensional directional derivative of model points with respect to the parameters, and these can be translated into image derivatives through projection. Examples of the solution for vari- able model parameters simultaneously with solving for viewpoint have been presented in previous work [15]. In most applications of this method we will be given more correspondences between model and image than are strictly necessary, and we will want to perform some kind of best fit. In this case the Gauss least-squares method can easily be applied. The matrix equations described above can be expressed as 1J][A1 ^ \E] where [J] is the Jacobian matrix of partial derivatives, [A] is the vector of unknown corrections, and [E\ is the vector of error terms. When this system is overdetermined, we can perform a least-squares fit of the errors simply by solving the corresponding normal equations: 10 {jnj][A\ = [jriE] where [J]-^[J] is square and has the correct dimensions for the vector [A]. Another important extension to the basic algorithm is to allow it to use line-to-line correspondences in addition to point-to-point ones. This is important in practice because low-level vision routines are relatively good at finding the transverse locations of lines but are much less certain about exactly where the lines terminate. What we need to do is express our errors in terms of the distance of one line from another, rather than in terms of the error in the locations of points. The solution is to measure as our errors the perpendicular distance of each endpoint of the model line from the corresponding line in the image, and to then take the derivatives in terms of this distance rather than in terms of x' or y'. In order to express the perpendicular distance of a point from a line it is useful to first express the line as an equation of the following form, in which m is the slope: —m 1 v/m2 + 1 y/m^+1 In this equation d is the perpendicular distance of the line from the origin. If we substitute some point (x',y') into the left side of the equation and calculate the new value of d for this point (call it d'), then the perpendicular distance of this point from the line is simply d — d'. What is more, it is easy to calculate the derivatives of d' for use in the convergence, since the derivatives of d' are just a linear combination of the derivatives of x' and y' as given in the above equation, amd we already know how to calculate the i' and y' derivatives from the solution given for using point correspondences. The result is that each line-to-line correspondence we are given between model and image gives us two equations for our linear system — the scime amount of information that is conveyed by a point-to- point correspondence. The same basic method could be used to extend the matching to arbitrary curves rather thaji just straight line segments. In this case, we would assume that the curves are locally linear and would minimize the perpendicular separation at selected points as in the straight line case, using iteration to compensate for non-linearities. For curves that are curving tightly with respect to the current error measurements, it would be necessary to match points on the basis of orientation and curvature in addition to simple perpendicular projection. However, the current implementation of SCERPO is limited to the matching of straight line segments, so these extensions to include the matching of arbitrary curves remain to be implemented. The use of paraxneter determination for matching The mathematical methods for parameter determination presented above need to be inte- grated into a matching algorithm that can extend a few initial matches and return a reliable answer as to the presence or absence of the object at the hypothesized location. Some of the implementation details that must be worked out include the choice of a representa- tion for object models, the calculation of a starting viewpoint to initiate Newton iteration, 11 and the methods for making use of the initial parameter determination to generate new matches between image and object features. The object models used in the current implementation of SCERPO consist simply of a set of 3-D line segments. A primitive form of hidden line elimination is performed by attaching a visibility specification to each line segment. The visibility specification contains a boolean combination of hemispheres from which the segment is visible. Each hemisphere of directions is represented by a unit vector, so that the visibility of the segment can be very efficiently computed by simply checking the sign of the dot product of this vector with the vector pointing to the camera position. It is important that the visibility calculation be fast, as it is performed in the inner loop of the matching process. However, this form of hidden line elimination is only an approximation, since it does not allow for partial occlusion of a line segment or for calculating occlusion between objects. It would also need to be extended to allow for models with variable internal parameters. As with other parts of the matching process, we rely upon the overall robustness of the system in the face of missing or extra features to compensate for occasional errors in hidden-line determination. As with all applications of Newton's method, convergence of the viewpoint solving algorithm is guaranteed only for starting positions that are "sufficiently close" to the final solution. Fortunately, in this problem several of the parameters (scaling and translation in the image plane) axe exactly lineaj-, while the other parameters (rotation and perspective effects) eire approximately linear over wide ranges of values. In practice, we have found that as long as the orientation parameters are within 60 degrees of their correct values, almost any values can be chosen for the other parameters. Reasonable estimates of the viewpoint parajneters can be easily computed from the same matches that will be used for convergence. Since most parts of the model are only visible from a limited range of viewpoints, we can select an orientation in depth that is consistent with this range for each object feature being matched. Orientation in the image plane can be estimated by causing any line on the model to project to a line with the orientation of the matching line in the image. Similarly, translation can be estimated by bringing one model point into correspondence with its matching image point. Scale (i.e., distance from the camera) can be estimated by examining the ratios of lengths of model lines to their corresponding image lines. The lines with the minimum value of this ratio are those that axe most parallel to the image plane and have been most completely detected, so this ratio can be used to roughly solve for scale under this assumption. These estimates for initial values of the viewpoint parameters are all fast to compute and produce values that are typically much more accurate than needed to assure convergence. Further improvements in the range of convergence could be achieved through the application of standard numerical techniques, such as damping [6]. A more difficult problem is the possible existence of multiple solutions. Fischler and Belles [7] have shown that there may be as many as four solutions to the problem of matching three image points to three model points. Many of these ambiguities will not occur in practice, given the visibility constraints attached to the model, since they involve ranges of viewpoints from which the image features would not be visible. Also, in most cases the method will be working with far more than the minimum amount of data, so 12 the overconstrained problem is unlikely to suffer from false local minima during the least- squares process. However, when working with very small numbers of matches, it may be necessary to run the process from several starting positions in an attempt to find all of the possible solutions. Given several starting positions, only a couple of iterations of Newton's method on each solution are necessary to determine whether they are converging to the same solution and can therefore be combined for subsequent processing. Finally, it should be remembered that the overall system is designed to be robust against instances of missing data or occlusion, so an occasional failure of the viewpoint determination should lead only to an incorrect rejection of a single match and an increase in the amount of search rather than a final failure in recognition. Extending the initial set of matches Once the initial set of hypothesized matches has been used to solve for the viewpoint parameters, this estimate of viewpoint can be used to predict the locations of other model features in the image and thereby extend the match. The predictions of these image locations involves simply projecting the model features using the current set of viewpoint parameters. The more difficult problem is to select matches in the image that are consistent with these predictions. Rather than setting arbitrary thresholds on the error range within which image segments must fall to be considered in agreement with a predicted model segment, an iterative approach is taken. First, each segment is projected, and the image data structure is searched to select potential matches. All line segments in the image are indexed according to location and orientation, so this sezirch can be carried out efficiently. The potential matches aire then ranked according to the degree to which they agree with the prediction in terms of length, orientation, transverse position, and endpoint position. Endpoint positions are allowed to be anywhere within the predicted segment (to allow for partial detection of a line), but are penalized for extending beyond the predicted endpoints. After evaluating each potential match for a given prediction, the top-ranked match is strongly penalized if the second-ranked match has a similar evaluation. The purpose of this penalty is to avoid committing ourselves to either choice of an ambiguous match if there is some less ambiguous alternative from some other model prediction. At each stage of the iteration we select only non-ambiguous matches or else the single least-ambiguous match from all of the model predictions to extend the set of correspondences. These are then added to the least-squares solution to update the estimate of viewpoint. By the time a number of the most reliable matches have been found, the viewpoint estimation will be based on a substantial amount of data and should be accurate enough to choose correctly between the more ambiguous alternatives. The set of matches is repeatedly extended in this way until no more can be found. This iterative matching procediire has the appealing property of using the easy cases to provide better viewpoint estimates to disambiguate the more difficult situations. In our examples, it was common for a number of closely spaced parallel lines to be detected near the edge of a curved object due to specular reflection. These extra edges provide many 13 false candidates for matching at the approximate position and orientation of the true edge. But by using less ambiguous matches to solve as precisely as possible for the viewpoint, the correct match could be chosen on the basis of accurate position estimates. The view^- point determination procedure could be improved by attempting to discard the points that deviated the most from the least-squares solution and reconverging on the remaining data. This would allow the system to converge on a consensus viewpoint estimate that was not influenced by small errors in modeling or feature detection. The final judgement as to the presence of the object is based simply on the degree to which the final viewpoint estimate is overconstrained. Since only three edges are needed to solve for viewpoint, each further match adds to the verification of the presence of the object. In addition, the least-squares solution provides an estimate of the stzoidard deviation of the error. Given sufficiently detailed models, correct instances of recognition should be greatly overconstrained even in the face of partial occlusion and other missing features, so the precise threshold for rejection should be unimportant. 4. Perceptual Organization The methods for achieving spatial correspondence presented in the previous section en- force the powerful constraint that all parts of an object's projection must be consistent with a single viewpoint. This constraint allows us to bootstrap just a few initial matches into a complete set of quantitative relationships between model features and their image counterparts, and therefore results in a reliable decision as to the correctness of the original match. The problem of recognition has therefore been reduced to that of providing ten- tative matches between a few image features and an object model. The relative efficiency of the viewpoint solution means that only a small percentage of the proposed matches need to be correct for acceptable system performance. In fact, when matching to a single, rigid object, one could imagine simply taking all triplets of nearby line segments in the image and matching them to a few sets of nearby segments on the model. However, this would clearly result in unacceptable amounts of search as the number of possible objects increases, and when we consider the capabilities of human vision for making use of a vast database of visual knowledge it is obvious that simple search is not the answer. This initial stage of matching must be based upon the detection of structures in the image that can be formed bottom-up in the absence of domain knowledge, yet must be of sufficient specificity to serve as indexing terms into a database of objects. Given that we often have no prior knowledge of viewpoint for the objects in our database, the indexing features that are detected in the image must reflect properties of the objects that are at least partially invariant with respect to viewpoint. This means that it is useless to look for features of particular sizes or angles or other properties that are highly dependent upon viewpoint. A second constraint on these indexing features is that there must be some way to distinguish the relevant features from the dense background of other image features which could potentially give rise to false instances of the structure. Through an accident of viewpoint or position, three-dimensional elements that are unrelated in 14 the scene may give rise to seemingly significant structures in the image. Therefore, an important function of this early stage of visual grouping is to distinguish as accurately as possible between these accidental and significant structures. We can summarize the conditions that must be satisfied by perceptual grouping operations as follows: The viewpoint invariance condition: Perceptual features must remain stable over a wide range of viewpoints of some corresponding three-dimensional structure. The detection condition: Perceptual features must be sufficiently constrained so that accidental instances are unlikely to arise. Although little-studied by the computational vision community, the perceptual orga- nization capabilities of human vision seem to exhibit exactly these properties of detecting viewpoint invariant structures and calculating varying degrees of significance for individual instances. These groupings are formed spontaneously and can be detected immediately from among large numbers of individual elements. For example, people will immediately detect certain instances of clustering, connectivity, collinearity, parallelism, and repetitive textures when shown a large set of otherwise rcindomly distributed image elements (see Figure 3). This grouping capability of human vision was studied by the early Gestalt psy- chologists [26] and is also related to research in texture description (18, 29]. Unfortunately, this important component of humcin vision has been missing from almost all computer vision systems, presumably because there has been no clear computational theory for the role of perceptual organization in the overall fuctioning of vision. A basic goal underlying research on perceptual organization has been to discover some principle that could unify the vairious grouping phenomena of human vision. The Gestaltists thought that this underlying principle was some basic ability of the human mind to proceed from the whole to the part, but this lacked a computational or predictive for- mulation. Later research summarized many of the Gestaltists' results with the observation that people seem to perceive the simplest possible interpretation for any given data [12]. However, any definition of simplicity has depended entirely on the language that is used for description, and no single language has been found to encompass the range of grouping phenomena. Greater success has been achieved by basing the analysis of perceptual orga- nization on a functional theory which assumes that the purpose of perceptual organization is to detect stable image groupings that reflect actual structure of the scene rather than image-dependent properties. This parallels other aiezis of early vision in which a major goal is to identify image features that are stable under changes in imaging conditions. Derivation of grouping operations Given these functional goals, the computational specification for perceptual organization is to diff'erentiate groupings that arise from the structure of a scene from those that arise due to accidents of viewpoint or positioning [27]. This does not lead to a single metric for evaluating the significance of every image grouping, since there are many factors that contribute to estimating the probability that a particular grouping could have arisen by 15 Figure 3: This figure illustrates the human ability to spontaneously detect certain groupings from among a background of similar elements. This figure contains three groupings resulting from parallelism, collinearity, and endpoint proximity (connectivity). accident. However, by combining these vajious factors and making use of estimates of prior probabilities for various classes of groupings, it is possible to derive a computational account for the various classes of grouping phenomena. An extensive discussion of these issues has been presented by the author in previous work [16], but here we will exaimine the more practical question of applying these methods to the development of a pzo-ticular vision system. We will simplify the problem by looking only at groupings of straight line segments detected in an image and by considering only those groupings that are bzised upon the properties of proximity, parallelism, and collinearity. A strong constraint on perceptual organization is provided by the viewpoint invariance condition, since there are only a relatively few types of two-dimensional image relations that are even partially invariant with respect to changes in viewpoint of a three-dimensional scene. For example, it would be pointless to look for lines that form a right angle in the image, since even if it is common to find lines at right angles in the three-dimensional scene they will project to right angles in the image only from highly restricted viewpoints. Therefore, even if an approximate right angle were detected in the image, there would be little basis to expect that it came from a right angle in the scene as opposed to lines at 16 any other three-dimensional angle. Compare this to finding lines at a 180 degree angle to one another (i.e., that are collinear). Since collinear lines in the scene will project to coUinear lines in the image from virtually all viewpoints, we can expect many instances of collinearity in the image to be due to collinearity in three dimensions. Likewise, proximity and parallelism are both preserved over wide ranges of viewpoint. It is true that parallel lines in the scene may converge in the image due to perspective, but many instances of parallelism occupy small visual angles so that the incidence of approximate parallelism in the image can be expected to be much higher than simply those instances that arise ciccidentally. In summary, the requirement of partial invariance with respect to changes in viewpoint greatly restricts the clzisses of image relations that can be used as a basis for perceptual organization. If we were to detect a perfectly precise instance of, say, collinearity in the image, we could immediately infer that it arose from an instance of collinearity in the scene. That is because the chance of perfect collinearity arising due to an accident of viewpoint would be vanishingly small. However, real image measurements include many sources of uncertainty, so our estimate of significance must be based on the degree to which the ideal relation is achieved. The quantitative goal of perceptual organization is to calculate the probability that an image relation is due to actual structure in the scene. We can estimate this by calculating the probability of the relation arising to within the given degree of accuracy due to an accident of viewpoint or random positioning, and aissuming that otherwise the relation is due to structure in the scene. This could be made more accurate by taking into account the prior probability of the relation occurring in the scene through the use of Bayesiaii statistics, but this prior probability is seldom known with any precision. Grouping on the basis of proximity We will begin the analysis of perceptual organization by looking at the fundamental image relation of proximity. If two points axe close together in the scene, then they will project to points that are close together in the image from all viewpoints. However, it is also possible that points that are widely separated in the scene will project to points arbitrarily close together in the image due to an accident of viewpoint. Therefore, as with all of the cases in which we attempt to judge the significance of perceptual groupings, we will consider a grouping to be significaint only to the extent that it is unlikely to have arisen by ax:cident. An important example of the need to evaluate proximity is when attempting to form connectivity relations between line segments detected in an image. The proximity of the endpoints of two line segments may be due to the fact that they are connected or close together in the three-dimensional scene, or it may be due to a simple accident of viewpoint. We must calculate for each instance of proximity between two endpoints the probability that it could have arisen from unrelated lines through an accident of viewpoint. Since we often have no prior knowledge regarding the scene and since the viewpoint is typically unrelated to the structure of the three-dimensional objects, there is little basis for picking a biased background distribution of image features against which to judge significance. Therefore, this calculation will be based upon the assumption of a background of line 17 Proximity of endpolnts II Parallelism Collinearlly Figure 4: These diagrams show some of the variables that are used for calculating the probabili- ties that the given relations would arise by accident between randomly distributed line segments. segments that is uniformly distributed in the image with respect to orientation, position, and scale. Given these assumptions, the expected number of endpoints, N, within a radius r of a given endpoint is equal to the average density of endpoints per unit area, d, multiplied by the area of a circle with radius r (see Figure 4): N = dnr^ For values of N much less than 1, the expected number is approximately equal to the probability of the relation arising accidentally. Therefore, significcince varies inversely with TV. It also follows that significance is inversely proportional to the square of the separation between the two endpoints. However, the density of endpoints, d, is not independent of the length of the line segments that are being considered. Assuming that the image is uniform with respect to scale, changing the size of the image by some arbitrary scale factor should have no 18 influence on our evaluation of the density of line segments of a given length. This scale independence requires that the density of lines of a given length vary inversely according to the square of their length, since halving the size of an image will decrease its area by a factor of 4 and decrease the lengths of each segment by a factor of 2. The same result can be achieved by simply measuring the proximity r between two endpoints as proportional to the length of the line segments which participate in the relation. If the two line segments axe of different lengths, the higher expected density of the shorter segment will dominate that of the longer segment, so we will base the calculation on the minimum of the two lengths. The combination of these results leads to the following evaluation metric. Given a separation r between two endpoints belonging to line segments of minimum length /: /2 We are still left with a unitless constant, D, specifying the scale-independent density of line segments (the factor 2 accounts for the fact that there are 2 endpoints for each line segment). Since the measures of significance will be used mostly to rank groupings during the search process, the value chosen for a constant factor is of little importance because it will have no influence on the rankings. However, for our experiments we have somewhat arbitrarily assigned D the value 1. In fact, given a fairly dense set of segments with independent orientations and positions, and given the constraint that they do not cross one einother, this scale independent meeisure will have a value close to 1. This formula does a reasonable job of selecting instances of endpoint proximity that seem perceptually significant. Our concern with uniformity across changes in scale has had an important practical impact on the algorithm. It means that the algorithm will correctly pick out large-scale instances of connectivity between long segments, even if there are many short segments nearby which would otherwise mask the instance of endpoint proximity. This capability for detecting groupings at multiple scales is an important aspect of perceptual organization in human vision. Grouping on the basis of parallelism A similar measure can be used to decide whether an approximate instance of parallelism between two lines in the image is likely to be non-accidental in origin. Let /i be the length of the shorter line and I2 be the length of the longer line. In order to measure the average separation 5 between the two lines, we calculate the perpendicular distance from the longer line to the midpoint of the shorter line. As in the case for evaluating proximity, we assume that the density of line segments of length greater than /i is d = D/ll, for a scale- independent constant D. Then, the expected number of lines within the given separation of the longer line will be the area of a rectangle of length I2 and width 25 multiplied by the density of lines of length at least li. Let 0 be the magnitude of the angular difference in radians between the orientations of the two lines. Assuming a uniform distribution of orientations, only 26/7r of a set of lines will be within orientation ^ of a given line. Therefore, the expected number of lines within the given separation and angular difference will be: 19 _ f2sl2D\ (2B\ _ ADOs ADOslt 2 As in the previous case, we assign D the value 1 and assume that significance is inversely proportional to E, Grouplns on the basis of coUinearity Measuring the probability that an instance of coUinearity has arisen by accident shares many features in common with the case of parallelism. In both cases, the ideal relation would involve two line segments with the same orientation and with zero separation per- pendicular to the shared orientation. However, in the case of parallelism the line segments are presumed to overlap in the direction parallel to their orientation, wherejis in coUinearity the segments are expected to be separated along the direction of their orientation with an intervening gap. Let g be the size of this gap (the sepaxation of the endpoints). As in the case of parallelism, let s be the perpendicular distance from the midpoint of the shorter line segment, /i, to the extension of the longer line segment, Ij- These bounds determine a rectangulzu" region of length g + li and width 2s within which other lines would have the same degree of proximity. Therefore, by analogy in other respects with the case of parallelism, we get 4D0s{g + h) ^= — ^i| Notice that this measure is independent of the length of the longer line segment, which seems intuitively correct when dealing with coUinearity. Implementation of the grouping opera.tions The subsections above have presented methods for calculating the significance of selected relationships between given pairs of straight line segments. The most obvious way to use these to detect all significant groupings in the image would be to test every pair of line segments and retain only those pairs which have high levels of significance. However, the complexity of this process would be O(rt^) for n line segments, which is too high for practical use in complex scenes. One method for limiting the complexity of this process is to realize that proximity is an important variable in all of the significance mecisures. Since significance decreases with the square of separation, two small segments that are widely separated in the image are unlikely to produce significant groupings regardless of their other characteristics (constraints on measurement accuracy limit the contribution that orientation or other measurements can make to judging significance). Therefore, complexity can be limited by searching only a relatively small region surrounding each segment for candidates for grouping. Since proximity is judged relative to the size of the component features, the size of the region that must be searched is proportional to the length of the line segment from which we are initiating the search. In order to make efficient use of these restrictions, all segments in 20 the image should be indexed in a grid-like data structure according to the position of each endpoint. For further efficiency, the segments in each element of this position matrix can be further indexed according to orientation and length. The use of this index allows ail groupings with interesting levels of significance to be detected in time that is essentially linear in the number of features. It is interesting to note that human vision also seems to limit the complexity of grouping in a similar way, although human vision apparently uses a more sophisticated method that also takes account of the local density of features [16]. Segmenta.tion of linked points into stra.ight line segments The examples above have dealt with the grouping of line segments. However, the derivatior of the line segments themselves is a very important segmentation problem that is base^ upon detecting significant instances of collinearity among edge points. Most common edge detection methods produce linked lists of points as their output (e.g., points which lie on the zero-crossing of an image convolved with a second-derivative operator). In order to carry out the higher forms of perceptual organization described above, these linked points, must be grouped into line or curve descriptions that make explict the significiant curvi- linear structures at all scales. The author has previously described a method for finding straight-line and constemt-curvature segmentations at multiple scales and for measuring their significance [16, Chap. 4]. However, here we will use a simplified method that selects only the single highest-significance line representation at each point along the curve. The significajice of a straight line fit to a list of points can be estimated by calculating the ratio of the length of the line segment divided by the mciximum deviation of any point from the line (the maximum deviation is always assumed to be at least two pixels in size to account for limitations on measurement accuracy). This measure will remain constant as the scale of the image is changed, and it therefore provides a scale-independen: measure of significance that places no prior expectations on the allowable deviations. This significance measure is then used in a modified version of the recursive endpoint subdivision method (see Figure 5). A segment is recursively subdivided at the point with mziximum deviation from a line connecting its endpoints (Figure 5(b,c)). This process is repeated until each segment is no more than 4 pixels in length, producing a binary tree of possible subdivisions. Then, unwinding the recursion back up the tree, a decision is made ai each junction as to whether to replace the current lower-level description with the single higher-level segment. The significance of every subsegment is calculated by its length- to-deviation ratio mentioned above (Figure 5(d)). If the maximum significance of any of the subsegments is greater than the significance of the complete segment, then the subsegments are returned. Otherwise the single segment is returned. The procedure will return a segment covering every point along the curve (Figure 5(e)). Finally, any segment: with a length-to-deviation ratio less than 4 are discarded. This algorithm is implemented in only 40 lines of Lisp code, yet it does a reasonable job of detecting the most perceptually significant straight line groupings in the linkec point data. An advantage compared to the methods traditionally used in computer visior (which usually set some prior threshold for the amount of "noise" to be removed from s. 21 Figure 5: This illustrates the steps of a scale-independent algorithm for subdividing a curve into its most perceptually significant straight line segments. The input curve is shown in (a) and the final segmentation is given in (f). curve), is that it will tend to find the same structures regardless of the size at which an object appears in an image. In addition, it will avoid breaking up long lines if its shorter constituents do not appear to have a stronger perceptual basis. 5. The SCERPO Vision System The methods of spatial correspondence and perceptual organization described above have been combined to produce a functioning system for recognizing known three-dimensional objects in single gray-scale images. In order to produce a complete system, other compo- nents must also be included to perform low-level edge detection, object modeling, match- ing, and control functions. Figure 6 illustrates the various components and the sequence of information flow. Figure 7 shows an example of the different stages of processing for an image containing two desk staplers shown from arbitrary viewpoints. The following paragraphs will describe how these stages are implemented and integrated. In order to provide image features for input to the perceptual grouping process, the first few levels of image analysis in SCERPO use established methods of edge detection, as shown in Figure 7(a)-(c). The 512x512 pixel image shown in Figure 7(a) was convolved 22 / ■ Edge Detection V^G and gradient threshold \f Line Segmentation Scale invortanl ^ >' \ /- Perceptual Grouping Collineonty, proximity, parallels V f Matching and Search Image groupings < = > Object \ 1 , / Model Verification Determination of viewpoint J Figure 6: The components of the SCERPO vision system and the sequence of computation. with a Laplacian of Gaussian function (a = 1.8 pixels) as suggested by the Marr-Hildreth [20] theory of edge detection. The zero-crossings of this function are shown in Figure 7(b). Of course, many of these zero-crossings do not correspond to significant edges in the image. We remove those corresponding to insignificant intensity changes by applying the Sobel gradient operator to the V'^G convolution. Only those points that are above a chosen gradient threshold and lie on a zero crossing are retained in Figure 7(c). These initial steps of processing were performed on a VICOM image processor under the Vsh software facility developed by Robert Hummel and Dayton Clark [5]. The VICOM can perform a 3x3 convolution against the entire image in a single video frame time. The Vsh software facility allowed the 18x18 convolution kernel required for our purposes to be automatically decomposed into 36 of the 3x3 primitive convolutions along with the 23 appropriate image translations and additions. More efficient implementations which apply a smaller Laplacian convolution kernel to the image followed by iterated Gaussian blur were rejected due to their numerical imprecision. The steps up to Figure 7(c) are performed on the VICOM, after which the zero-crossing image is transferred to a VAX 11/750 running UNDC 4.2 for subsequent processing. A program written in C reads the original image and produces a file of linked edge points. All other components are written in Franz Lisp. The next step of processing is to break the linked lists of zero-crossings into perceptu- ally significant line segments, as described in the previous section. The results of applying this algorithm are shown in Figure 7(d). The recognition problem now consists of search- ing among this set of about 250 straight line segments for subsets that are each spatially consistent with model edges projected from a single viewpoint. The straight line segments are indexed according to endpoint locations and orienta- tion. Then a sequence of procedures is executed to detect instances of collinearity, endpoint proximity (connectivity), and pajallelism. Each instance of these relations is assigned a level of significance using the formulas given above in the section on perceptual organiza- tion. These primitive relations could be matched directly against corresponding structures on the three-dimensional object models, but the search space for this matching can be reduced by first combining the primitive relations into larger, more complex structures. The lajger structures are found by searching among the initiaJ relations for specific combi- nations of relations. For example, more than two segments may have endpoints connected at the same junction, or a number of lines may be parallel to one another. One structure that played a role in the example of Figure 7 is the case of two parallel lines which also axe mutually connected to other segments at their endpoints, producing trapezoid-shaped structures. The significance of each of these larger structures is calculated by simply mul- tiplying together the probabilities of non-accidentalness for each component. These values are used to rank all of the groupings in order of decreasing significance, so that the search can begin with those groupings that are most perceptually significant and are least likely to have arisen through some accident. Figure 7(e) shows two of the highest-reinked per- ceptual groupings that were found in this example, but of course a large number of other groupings were detected. The SCERPO system makes use of much simpler groupings at this level thcin would be needed by a system that contained large numbers of object models. When only a few models are being considered for matching, it is possible to use very simple groupings because even with the resulting ambiguity there are only a relatively small number of potential matches to examine. However, with large numbers of object models, it would be necessary to find more complex viewpoint-invariant structures that could be used to index into the database of models. The best approach would be to make use of some form of evidential reasoning to combine probabilistic information from multiple sources to limit the size of the search space. This approach has been outlined by the author in earlier work [16, Chap. 6]. 24 Figure 7(a,b): The original image of some desk staplers, taken by a CCD camera at a resolution of 512X512 pixels, is shown in (a). This image was convolved with a V^G function (a = 1.8 pixels). The zero-crossings of this function are displayed in (b). 25 Figure 7(c,d[): Many zero-crossings correspond to insignificant changes in intensity. Therefore, the Sobel gradient operator is applied to the V'^G image, and only those zero-crossings with a gradient above a particular threshold are retained, as shown in (c). These remaining zero-crossings are linked on the basis of connectivity. A scale-independent segmentation algorithm selects the most perceptually significant straight line segments, as shown in (d). 26 Figure 7(e,f): The two highly-ranked perceptual groupings that were actually used to initiate successful recognition for each instance of the object are shown in (e). A large number of other groupings were also formed, (f) After solving for model viewpoint, selecting new segments most consistent with model predictions, and iterating, these segments were selected bls being consistent with one projection of the model. 27 ^;S3»WP-^- Figure 7(g,h): (g) The model is shown superimposed on the image from the final calculated viewpoint. After finding the first instance of the object, the identified segments are removed from consideration and the search continues among the remaining segments. The same model is shown from the second calculated viewpoint in (h). 28 Model matching The matching process consists of individually comparing each of the perceptual groupings in the image against each of the structures of the object model which is likely to give rise to that form of grouping. For each of these matches, the verification procedure is executed to solve for viewpoint, extend the match, and return an answer as to whether the original match was correct. Given the large number of potential matches and their varied potential for success, it is important to make use of a ranking method to select the most promising matches first. For example, every straight line detected in the image is a form of grouping that could be matched against every straight edge of the model, but this would involve a large amount of search. In general, the more complex a grouping is the fewer potential matches it will have against the model. Therefore, the only matches that are considered in the current implementation are those that involve image groupings that contain at least 3 line segments. This also has the important result that such a grouping will generally contain enough information to solve exactly for viewpoint from the initial match, which provides tight constraints to speed up the operation of the verification procedure. A more precise specification of the optimal ordering for the search process could be stated as follows. In order to minimize the search time, we would like to order our consid- eration of hypotheses according to decreasing values of Pk/^k, where Pjt is the probability that a particular hypothesis for the presence of object k is correct, and Wk is the amount of work required to verify or refute it. In general, increased complexity of a grouping will lead to fewer potential matches against different object features, and therefore will increase the probability Pk that any particular match is correct. However, this probability is also very dependent upon the particular set of objects that are being considered, since some features will function more effectively to discriminate among one particular set of objects than in another set. The most effective way to determine the optimal values of Pk for each potential match would be through the use of a learning procedure in which the ax:tual probability values are refined through experience in performing the recognition task. These probability adjustments would in essence be strengthening or weakening as- sociations between pairticular image features and particular objects. These object-specific values would be multiplied by the probability that a grouping is non-accidental to deter- mine the final estimate of P^. The Wk could be similarly learned through experience. The current implementation of SCERPO simply uses the complexity of a grouping as a crude estimate for these desired ranking parameters. In order to speed the runtime performance of the matching process, the viewpoint- invaxizint groupings that each model can produce in the image are precomputed off-line. The model is simply checked for three-dimensional instances of the three primitive im- age relations that axe detected during the perceptual grouping process: i.e., connectivity, collinearity, and parallelism. No attempt is made to find approximate instances of these relations, so in essence the relations axe implicitly specified by the user during model input. These relations are then grouped into the same types of larger structures that are created during the perceptual grouping process, and are stored in separate lists according to the type of the grouping. Any rotational symmetries or other aimbiguities are used to create 29 new elements in the lists of possible matches. The rimtime matching process therefore consists only of matching each image grouping against each element of a precomputed list of model groupings of the same type. One important feature of this matching process is that it is opportunistic in its ability to find and use the most useful groupings in any particular image. Given that there is no prior, knowled'je of specific occlusions, viewpoint, amount of noise, or failures in the detection process, it is important to use a run-time ranking process that selects among the actual groupings that are found in any paurticular image in order to make use of those that are most likely to be non-accidental. Since each view of an object is likely to give rise to m«iny characteristic perceptual groupings in an image, it is usually possible to find features to initiate the search process even when a substantia] portion of the object is occludedc The current implementation of SCERPO is at a very simple level in terms of the sophistication of perceptual grouping and matching, and as should be cleau" from the above discussion there are many opportunities for making these processes faster and more complete. As more types of groupings and other viewpoint- invar iajit features are added, the number of possible matches would increase, but the expected amount of search required for a successful match would decrease due to the increased specificity of the matches with the highest rankings. The implementation of the viewpoint-solving and verification process has already been described in detail in an earlier section. This is the most robust and reliable component of the system, and its high level of performance can compensate for many weaknesses at the earlier stages. The low probability of false positives in this component means that failure at the earlier levels tends to result simply in an increased search space rather than incorrect matches. Figure 7(f) shows a set of segments that have been matched against the model after the verification process has completed its cycle of viewpoint-solving and image matching. Figure 7(g) shows the model projected onto the image from its final calculated viewpoint. Once one instance of an object has been identified, its matched image segments are removed from consideration and the system continues to search for other objects. A second instance of the stapler was identified as shown in 7(h). The current implementation of SCERPO is designed as a research and demonstration project, and much further work would be required to develop the speed and generality needed for | many applications. Relatively little effort has been devoted to minimizing computation time. The image processing components require only a few seconds of com- putation on the VICOM image processor, but then the image must be transferred to a VAX 11/750 running UNIX 4.2 for further processing. It requires about 30 seconds for a program written in C to read the zero-crossing image ajid output a file of linked edge points. This file is read by a Franz Lisp routine, and all subsequent processing is per- formed within the Franz Lisp virtual memory environment. Segmentation into straight lines requires 40 seconds, indexing and grouping operations require about 1 minute and the later stages of matching and verification took 40 seconds for this example. There are numerous ways in which the code could be improved to reduce the required amount of computation time if this were a major goal. Each iteration of the crucial viewpoint-solving process requires at most several hundred floating point operations, so there is reason to 30 believe that a carefully coded version of the basic search loop could run at high rates of speed. 6. Directions for Future Research The most obvious direction in which to extend the current system^fe' tb^'^eile^alize the object models to include many new types of visual knowledge. These extehsiohs could include the modeling of moveable articulations, optional components, and other variable parameters in the models. The section on solving for spatial correspondence described methods for incorporating these extensions during the viewpoint-solving and matching process. However, further research would be required to to provide control for choosing which of many parameters to solve for first. Imagine, for example, that we had a generic model of the human face. The model would include small ranges of variation for the size and position of every feature, as well as many optional components such as a beard or glasses. However, given some tentative correspondences for say, the eyes and nose, we could use the expectation of bilateral symmetry and the mostly tightly constrained dimensions of our model to solve for approximate viewpoint. This would then suggest quite tightly constrained regions in which to search for other features, such as ears, chin, eyebrows, etc., each of which could be used to derive better estimates of viewpoint and the other pzu-ameters. The resulting values of these parameters could then be mapped into a feature sp2w:e and used to identify particular individuals, which in turn may lead to further detailed constraints and expectations. Some mechanisms for ordering these constraints were incorporated into the ACRONYM system [4]. The current implementation of SCERPO has used only an edge-based description of the image because that is a comparatively reliable and well-researched form of image analysis. But the same framework could incorporate majiy other dimensions of compar- ison between model and image, including areas such as surface modeling, texture, color, and shading properties. Further research would be required to detect viewpoint-invariant aspects of these properties during bottom-up image zinalysis. Many of the modeling and predictive aspects of these problems have been developed for use in computer graphics, but it may be necessary to find faster ways to perform these computations for use in a computer vision system. Once a number of different sources of information are used to achieve the optimal ordering of the search process, it is necessary to make use of general methods for combining multiple sources of evidence. The use of evidential reasoning for this problem has been discussed elsewhere by the author in some detail [16, Chap. 6]. These methods make use of prior estimates for the probability of the presence of each object, and then update these estimates as each new source of evidence is uncovered. Sources of evidence might include particular perceptual groupings, colors, textures, and contextual information. Context plays an important role in general vision, since most scenes will contain some easily- identified objects which then provide a great deal of information regarding size, location, and envirormient which can greatly ease the recognition of more difficult components of the scene. Evidential reasoning also provides an opportunity to incorporate a significant 31 form of leajning, since the conditional probability estimates can be continuously adjusted towards their optimal values as a system gains experience with its visual environment. In this way associations could be automatically created between particulaj objects ajid the viewpoint- invariant features to which they are likely to give rise in the image. The psychological implications of this research are also deserving of further study. Presumably-human vision does not perform a serial search of the type used in the SCERPO system. Iirstead. the brief tkne required for typical instances of recognition indicates that any search over a range of possible objects and parameters must be occurring in parallel. Yet even the humam brain does not contain enough computational power to search over every possible object at every viewpoint and position in the image. This can be demonstrated by the fact that even vague non-visual contextual clues caui decrease the length of time required to recognize degraded images [14]. Presumably, if a complete search were being performed in every instance, ajiy top-down clues that narrowed the search would have little effect. Given that the search is proceeding in parallel, the mechamisms used for ranking the search in SCERPO would instead be used to select a number of possibilities to explore in parallel, limited according to the available computational resources. This model for the recognition process suggests many psychophysical experiments in which average recognition times could be measured for different combinations of image data and contextual information. 7. Related Research on Model-Based Vision The methods used in the SCERPO system are based on a considerable body of previous research in model-based vision. The pathbreaking early work of Roberts [21] demonstrated the recognition of certain polyhedral objects by exa<:tly solving for viewpoint and object pareimeters. Matching was performed by searching for correspondences between junctions found in the scene and junctions of model edges. Verification was then bcised upon exact solution of viewpoint and model parameters using a method that required seven point- to-point correspondences. Unfortunately, this work was poorly incorporated into later vision research, which instead tended to emphasize non-quantitative and much less robust methods such as line-labeling. The ACRONYM system of Brooks [4] used a general symbolic constraint solver to calculate bounds on viewpoint and model parameters from image measurements. Match- ing was performed by looking for particular sizes of elongated structures in the image (known as ribbons) and matching them to potentially corresponding parts of the model. The bounds given by the constraint solver were then used to check the consistency of all potential matches of ribbons to object components. While providing an influential and very general framework, the actual calculation of bounds for such general constraints was mathematically difficult and approximations had to be used that did not lead to exact solu- tions for viewpoint. In practice, prior bounds on viewpoint were required which prevented application of the system to full three-dimensional ranges of viewpoints. Goad [10] has described the use of automatic programming methods to precompute a highly efficient search path and viewpoint-solving technique for each object to be recog- 32 nized. Recognition is performed largely through exhaustive search, but precomputation of selected parameter ranges allows each match to place tight viewpoint constraints on the possible locations of further matches. Although the search tree is broad at the highest levels, after about 3 levels of matching the viewpoint is essentially constraihed to a single position and little further search is required. The precomputation nbl only allows the fast computation of the viewpoint constraints at runtime, but it also can be used at the fbwest levels to perform edge-detection only within the predicted bounds and at'^the mliiimura required resolution. This research has been incorporated in an industrial computer vi- sion system by Silma Inc. which has the remarkable capability of performing all aspects of three-dimensional object recognition within as little as 1 second on a single microprocessor. Because of their extreme runtime efficiency, these precomputation techniques are likely to remain the method of choice for industrial systems dealing with small numbers of objects. Other closely related research on model-based vision has been performed by Shirai [23] and Walter A: Tropf [25]. There has also been a substantial amount of research on the interpretation of range data and matching within the three-dimensional domain. While we have argued here that most instances of recognition can be performed without the preliminary reconstruction of depth, there may be industrial applications in which the measurement of many precise three-dimensional coordinates is of sufficient importance to require the use of a scanning depth sensor. Grimson & Lozajio-Perez [llj have described the use of three-dimensional search techniques to recognize objects from range data, and describe how these same methods could be used with tactile data, which naturally occurs in three-dimensional form. Further significant research on recognition from range data has been carried out by Bolles ct al. [3] and Faugeras [8]. Schwartz & Sharir [22] have de- scribed a fcist algoritlun for finding an optimed least-squares match between arbitrary curve segments in two or three dimensions. This method has been combined with the efficient indexing of models to demonstrate the recognition of large numbers of two-dimensional models from their partially obscured silhouettes. This method also shows much promise for extension to the three-dimensional domain using range data. 8. Conclusions One goal of this paper has been to describe the implementation of a particulair computer vision system. However, a more important objective for the long-term development of this line of research has been to present a general framework for attax;king the problem of visual recognition. This framework does not rely upon any attempt to derive depth measurements bottom-up from the image. Instead, the bottom-up description of an image is aimed at producing viewpoint-invariant groupings of image features that can be judged unlikely to be accidental in origin even in the absence of specific information regairding which objects may be present. These groupings are not used for final identification of objects, but rather serve as "trigger features" to reduce the Jimount of search that would otherwise be required. Actual identification is b«ised upon the full use of the viewpoint consistency constraint, and maps the object-level data right back to the image level without any need for the intervening grouping constructs. This interplay between viewpoint-invariant analysis for bottom-up 33 processing and viewpoint-dependent analysis for top-down processing provides the best of both worlds in terms of generality and accurate identification. Many other computer vision systems have experienced problems because they attempt to use viewpoint-specific features early in the recognition process or because they attempt to identify an object simply on the basis of viewpoint-invariant characteristics. The many quantitative constraints generated by th^' viewpoint cousi£,tency analysis allow for robust performance even in the presence of only partial image data, which is one of the most basic hallmarks of human vision. There has been a tendency in computer vision to concentrate on the low-level aspects of vision because it is presumed that good data at this level is a prerequisite to reasonable performance at the higher levels. However, without any widely accepted framework for the higher levels, the development of the low level components is proceeding in a vacuum without an explicit measure for what would constitute success. This situation encourages the idea that the purpose of low-level vision should be to recover explict physical properties of the scene, since this goal can at least be judged in its own terms. But recognition does not depend on physical properties so much as on stable visual properties. This is necessary so that recognition can occur even in the absence of the extensive information that would be required for the bottom-up physical reconstruction of the scene. If a widely accepted framework could be developed for high-level visual recognition, then it would provide a whole new set of criteria for evaluating work at the lower levels. We have suggested examples of such criteria in ternis of viewpoint invarijmce and the ability to distinguish significant features from accidental instances. If such a framework were adopted, then rapid advances could be made in recognition capabilities by independent research efforts to incorporate many new forms of visual information. Acknowledgments This research was supported by NSF grant DCR-8502009. Implementation of the SCERPO system relied upon the extensive facilities and software of the NYU vision laboratory, which are due to the efforts of Robert Hummel, Jack Schwartz, and many others. Robert Hum- mel, in particular, provided many important kinds of technical and practical assistance during the implementation process. Mike Overton provided help with the numerical as- pects of the design. Much of the theoretical basis for this research was developed while the author was at the Stanford Artificial Intelligence Laboratory, with the help of Tom Binford, Rod Brooks, Chris Goad, David Marimont, Andy Witkin, and many others. References [l] Barnard, Stephen T., "Interpreting perspective images," Artificial Intelligence, 21 (1983), 435- 462. [2] Barrow, H.G. and J.M. Tenenbaum, "Interpreting line drawings as three-dimensional surfaces," Artificial Intelligence, 17 (1981), 75-116. [3] Bolles, R.C., P. Horaud, and M.J. Hannah, "3DPO: A three-dimensional part orientation 34 system," Proc. of 8th International Joint Conf. on Artificial Intelligence (Karlsruhe, West Germany, 1983), 1116-1120. 4] Brooks, Rodney A., "Symbolic reasoning among 3-D models and 2-D images," Artificial Intel- ligence, 17 (1981), 285-348. 5] Clark, Dayton and Robert Hummel, "VSH user's manual: an .jpagf^Rroce^iri^ enyironrnent," Robotics Research Technical Report, Courant Institute, New. yq^l^JJpiversity (September 1984). '."*" I ■ ' 6] Conte, S.D. and Carl de Boor, Elementary Numerical Analysis: An Algorithmic Approach, Third Edition (New York: McGraw-Hill, 1980). 7] Fischler, Martin A. and Robert C. Bolles, "Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography," Communications of the ACM, 24, 6 (1981), 381-395. 8] Faugeras, O.D., "New steps toward a flexible 3-D vision system for robotics," Proc. of 7th International Conference on Pattern Recognition (Montreal, 1984), 796-805. 9] Gibson, J. J., The Ecological Approach to Visual Perception (Boston: Houghton Mifflin, 1979). 10] Goad, Chris, "Special purpose automatic programming for 3D model-bjised vision," Proceed- ings ARPA Image Understanding Workshop (1983). 11] Crimson, Eric, and Thomis Lozano-P6rez, "Model-based recognition and localization from sparse range or tactile data," MIT AI Memo 738 (August 1983). To appear in Int. Journal of Robotics Research. 12) Hochberg, Julian E., "Effects of the Gestalt revolution: The Cornell symposium on percefv tion," Psychological Review, 64, 2 (1957), 73-84. 13] Hochberg, Julian E. and Virginia Brooks, "Pictorial recognition as an unlearned ability: A study of one child's performance," American Journal of Psychology, 75 (1962), 624-628. 14] Leeper, Robert, "A study of a neglected portion of the field of learning — the development of sensory organization," Journal of Genetic Psychology, 46 (1935), 41-75. 15] Lowe, David G., "Solving for the parameters of object models from image descriptions," Proc. ARPA Image Understanding Workshop (College Park, MD, April 1980), 121-127. 16] Lowe, David G., Perceptual Organization and Visual Recognition (Boston, Mass: Kluwer Aca- demic Publishers, 1985). 17] Lowe, David G., and Thomas O. Binford, "The recovery of three-dimensional structure from image curves," IEEE TVaas. on Pattern Analysis and Machine Intelligence, 7, 3 (May 1985), 320-326. 18] Marr, David, "E^rly processing of visual information," Philosophical Transactions of the Royal Society of London, Series B, 275 (1976), 483-524. 19] Marr, David, Vision {Sbld. Francisco: W.H. Freeman and Co., 1982). 20] Marr, David, and Ellen Hildreth, "Theory of edge detection," Proc. Royal Society of London, B, 207 (1980), 187-217. [21] Roberts, L.G., "Machine perception of three-dimensional objects," in Optical and Electro- optical Information Processing, Tippet et al., EMs. (Camibridge, Mfiss.: MIT Press, 1966), 159-197. 35 [22] Schwartz, J.T. and M. Sharir, "Identification of partieilly obscured objects in two dimensions by matching of noisy characteristic curves," Tech. Report 165, Courant Institute, New York University (June 1985). [23] Shirai, Y., "Recognition of man-made objects using edge cues," in Computer Vision Systems, A. Hanson, E. Riseman, eds. (New York: Academic Press, 1978). [24] Stevens, Kent A., "The visual interpretation of surface contours," Artificial Intelligence, 17 (1981), 47-73. [25] Walter, I. and H. Tropf, "S-D recognition of randomly oriented parts," Proceedings of the Third International Conf. on Robot Vision and Sensory Controls (November, 1983, Cambridge, Mass.), 193-200. [26] Wertheimer, Max, "Untersuchungen zur Lehe von der Gestalt 11," Psychol. Forsch., 4 (1923). Translated as "Principles of perceptual organization" in Readings in Perception, David Beard- slee and Michael Wertheimer, Eds., (Princeton, N.J.: 1958), 115-135. [27] Witkin, Andrew P. amd Jay M. Tenenbaum, "On the role of structure in vision," in Human and Machine Vision, Beck, Hope &, Rosenfeld, Eds. (New York: Academic Press, 1983), 481-543. [28] Wolf, Paul R., Elements of Photogrammetry (New York: McGraw-Hill, 1983). [29] Zucker, Steven W., "Computational and psychophysical experiments in grouping: Early ori- entation selection," in Human and Machine Vision, Beck, Hope &, Rosenfeld, Eds. (New York: Academic Press, 1983), 545-567. 36 This book may bo kept "ilPR 1 1 IQQA FOURTEEN DAYS A fine will be charged for each day the book is kept overtime. 1 GAYLORD 142 ^«(NTeO IN US * NYU COMPSCI TR-202 c.2 Lowe, David G Three-dimensional object recognition from single two-dimensional images. i- NYU COMPSCI TR-202 c.2 Lowe, David G -Three-dimensional object recognition from single two-dimensional images. LIBRARY N.Y.U. Courant Institute of Mathematical Sciences 251 Mercer St. New York, N. Y. 10012