| ||||||||||||||||||
| ||||||||||||||||||
|
||||||||||||||||||
|
- Edwin Abbott The first in a long series of machine vision experiments began outside Harkers café in the centre of York in the autumn of 2000. York is quite an old city crammed with odd looking buildings of various persuasions. My plan was to take photos of them from different angles, and then devise a program to register them together and produce 3D models in a completely automated way. Previously I'd downloaded a demonstration program from a web site which purported to do exactly this, but after spending hours manually registering equivalent points the results were not exactly impressive. They looked less like a building, and more like a demolition zone. There had to be a better way.
Getting two or more photos of a building from different angles was more problematic than I'd expected. It was very difficult to make sure that the photos were taken at exactly the same orientation, and there seems to be a natural psychological tendency to focus in upon some conspicuous feature, such as a doorway or window. Pictures which I'd thought I'd taken in parallel when taken home and downloaded turned out not to be. What's more, I was beginning to get odd looks from passing tourists. I decided to scale things down a bit. Using a pair of webcams I could do something similar, but on a miniature desktop scale and in a way which was far more reliable and repeatable and didn't involve standing around in the rain. I used the Quickcam Express cameras, along with the SDK which came along with them. I named the new project Sentience. As it turned out, the SDK was pretty dreadful. With both cameras connected to a PC at best it allowed me to switch between them with an approximate five second delay between taking one picture and the next. Provided that the scene was completely static this was ok, but for most practical purposes it was totally unacceptable. I needed to be able to take images from both cameras at more or less the same time, or with a negligible delay. Here the troubles began. I consulted many people about the issue, including the manufacturers, but they all said that it was impossible to run two cameras of an identical model at the same time on the same PC. This was because the SDK used an ancient system called "video for windows" (VFW) from a distant time when Windows 95 was still called "Chicago" and when Bill Gates thought the Internet was just a passing fad. Back then even connecting a single camera to a PC was something of an achievement. VFW in its simple-minded way expected there to be only one camera per software driver. If you tried to connect more it just gave up. It actually took me over a year to figure out how to overcome this problem. It's entirely possible that when people said it was impossible they were right, at the time, since the technology was only just beginning to emerge. The solution involved using something called Direct Show, which is part of DirectX – the system used to handle graphics in Windows. Compared to the equivalent system on the Linux operating system, not surprisingly called "Video 4 Linux" or V4L, Direct Show is a nightmare of a system. Microsoft had obviously tried to build a system which was all things to all men, but ended up impressing nobody. The machinations of Direct Show leave even seasoned programmers looking blank faced. In desperation throughout the course of that year I had tried developing under Linux (GCC compiler and Kdevelop), using a dynamic programming algorithm originally developed by Stan Birchfield. Amazingly, the Linux system worked, but the algorithm itself left a lot to be desired and results were often significantly worse than advertised. Actually, this is a common theme with many vision algorithms, as I've frequently found to my cost. The only problem with Linux was that it didn't support the compression algorithms used by the webcams (presumably because of patenting issues), so all images were transferred from the cameras to the PC as bitmaps, which was painstakingly slow. Actually, this issue probably no longer applies, especially if firewire or USB2 cameras are used which have no image compression.
The problem of matching two camera images together to give distance information is known as the stereo correspondence problem. For most of the points in one image there also exist corresponding points in the other image (unless they are outside the field of vision, or obscured by closer objects). To the casual observer this may seem like a trivially easy task, but this is only because as humans our brains are custom built to solve these kinds of everyday perception problem. We are able to perform these tasks with such rapidity and ease that we're not even consciously aware that we're doing it, although as it turns out a sizeable chunk of the back end of our brains is dedicated to doing just this. The history of research in this area goes back as far as you want to trace it, but the first computer based stereo vision work was done by the AI heavyweight Hans Moravec at the Stanford Research Institute in the 1970s. Many others have also devised their own methods, but Moravec's system still stands today as the non plus ultra. Most people doing research on computer vision imagine that the world is like a flat sheet of paper. Data captured by digital imaging, or by the retina, does indeed have a two dimensional structure but the information contained within it does not. Biological vision systems are fundamentally designed to cope with three dimensional information, so its hardly surprising that few artificial systems are yet able to achieve the levels of competence seen in the rest of the animal kingdom. Being able to escape from this 2D mind-set, from regions, lines and above all templates, is a critical skill needed in order to develop general purpose vision systems which can be used in realistic everyday situations, rather than the highly contrived environments of factory inspection lines. From trying numerous different algorithms, including the simple one described by Moravec it was clear that I was doing something wrong. Most of the time I was just getting noisy, fuzzy depth images which looked little better than a heavy snow shower. Probably many of the existing researchers were only getting good results because of the quality of their camera equipment and the controlled nature of the environments in which they operated. However, I knew that in the end there was more to it than precision optics and fancy frame grabbers. After all, biological vision systems use components which are inherently noisy and inaccurate and yet we are all able to construct from this raw visual data a perception of our surroundings which appears to be completely clear and precise. Our brains must be using very simple yet powerful methods for coping with noise and variation, and whatever these are could surely be replicated artificially. It wasn't until I started reading about the initial stages of visual processing in the brain that I began to make any real progress. The primary visual area, known as V1 (although nothing to do with flying bombs), consists of a set of neural groups which are orientation selective. This means that they respond in a specialised way to small areas of the image which appears on your retina in such a way as to give a local indication of direction. Think of it as being like a local gradient, or the angle at which a spinning top comes to rest. This information is highly invariant to changes in lighting and contrast, and means that stereo correspondence can be performed between two images even without any form of colour correction or other adjustments. Much of the computational expense traditionally incurred by noise filtering algorithms can be dispensed with entirely, without loss of quality.
From vision to imagination
Using these methods I was able to get performance comparable to or considerably exceeding that of algorithms devised by other people, using only very low cost off the shelf hardware. But this isn't the end of the Sentience story. Being able to tell how far away things are from the cameras is only the beginning.
There is an important difference between systems which simply calculate a depth map (such as the ones supplied by Point Grey) and general purpose vision. In a general purpose vision system data supplied from the cameras must be continuously synchronised with an internal three dimensional model, and here synchronised is the key term. As you're reading this you may think that you're pretty well aware of what's going on in your immediate surroundings, but that isn't strictly true. At any point in time you are only actually seeing a small detailed area at the centre of your retina, surrounded by a rather blurry low quality periphery. Your eyes dart about unconsciously twice every second, continuously sampling and re-sampling your visual surroundings and out of this sporadic and incomplete data your brain is able to synthesise an apparently clear and consistent visual scene. But before you start thinking that you're living in an entirely fictitious world, don't worry. The synthesis which your brain is creating better be a very close approximation of what's really out there, otherwise whenever people get behind the wheel of a motor car there would be a lot of mayhem. Much of my working experience in industrial automation involved synchronising one system closely with another, and so this concept isn't at all alien to me.
Over the next two years I plan to use this vision system for a variety of different applications. The first involves the development of spatial awareness capabilities for both my robots, with the ultimate aim of self-awareness and forms of general machine intelligence. I also hope to develop a stereo-based navigation and object recognition system for the small mobile robots produced by White Box Robotics. There are also potential applications in biometrics, human-computer interaction, security systems, entertainment and a collision avoidance safety system for motor vehicles. ReferencesRobot Spatial Perception by Stereoscopic Vision and 3D Evidence
Grids, Hans Moravec, Robotics Institute Depth Discontinuities by Pixel-to-Pixel Stereo, Stan Birchfield and Carlo Tomasi, Proceedings of the Sixth IEEE International Conference on Computer Vision, Mumbai, India, pages 1073-1080, January 1998 Stereo Vision and Occupancy Grids, Don Murray, May 1998 The Quest for Consciousness: A neurobiological approach, Christof Koch, 2003
Submitted: 26/06/2005 Article content copyright © Bob Mottram, 2005.
|
|
|||||||||||||||||
All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -