Instructor (Oussama Khatib):Okay. Letís get started. So today itís really a great opportunity for all of us to have a guest lecturer. One of the leaders in robotics vision. A Gregory Hager from John Hopkins who will be giving this guest lecture. On Monday, I wanted to mention that on Wednesday we have the mid-term in class. Tonight and tomorrow we have the review sessions, so I think everyone has signed on for those sessions. And next Wednesday the lecture will be given by a former Ph.D. student from Stanford University, Krasimir Kolarov who will be giving the lecture on trajectories and inverse kinematics. So welcome back.
Guest Instructor (Gregory Hager):Thank you. So it is a pleasure to be here today and thank you, Oussama, for inviting me. So Oussama told me heíd like me to spend the lecture talking about vision and as you might guess thatís a little bit of a challenge. At last count, there were a few over 1,000 papers in computer vision in peer reviewed conferences and journals last year. So summarizing all those in one lecture is a bit more than I can manage to do. But what I thought I would do is try to focus in specifically on an area that Iíve been interested in for really quite a long time. Namely, what is the perception in sensing you need to really build a system that has both manipulation and mobility capabilities? And so really this whole lecture has been designed to give you a taste of what I think the main components are and also to give you a sense of what the current state of the art is and, again, itís obviously with the number of papers produced every year declining the state of the art is difficult, but at least give you a sense of how to evaluate the work thatís out there and how you might be able to use it in a robotic environment. And so really, I want to think of it as answering just a few questions or looking at how perception could answer a few questions. So the simplest question you might imagine trying to answer is where am I relative to the things around me? You turn a robot on, it has to figure out where it is and, in particular, be able to move without running into things, be able to perform potentially some useful tasks that involves mobility. The next step up once youíve decided where things are is youíd actually like to be able to identify where you are and what the things are in the environment. Clearly, the first step toward being able to do something useful in the environment is understanding the things around you and what you might be able to do with them. The third question is, once I know what the things are how do I interact with them? So thereís a big difference between being able to walk around and not bump into things and being able to actually safely reach out and touch something and be able to manipulate it in some interesting way. And then really the last question, which Iím not gonna talk about today, is how do I actually think about solving new problems that in some sense were unforeseen by the original designer of the system? So itís one thing to build a materials handling robot where youíve programmed it to deal with the five objects that you can imagine coming down the conveyor line. Itís another thing to put a robot down in the middle of the kitchen and say hereís a table, clear the table, including china, dinnerware, glasses, boxes, things that potentially itís never seen before. It needs to be able to manipulate safely. That, I think, is a problem I wonít touch on today, but Iíll at least give some suggestions as to where the problems lie. So I should say, again, Iím gonna really breeze through a lot of material quickly, but at the same time this is a class, obviously, if youíre interested in something, if you have a question, if Iím mumbling and you canít understand me just stop and weíll go back and talk in more depth about whatever I just covered. So with that, the topics Iíve chosen today are really in many ways from bottom, if you will, to top. From low-level capabilities to higher level. Our first computational stereo, a way of getting the geometry in the environment around you, feature detection and matching, a way of starting to identify objects and identify where you are, and motion tracking and visual feedback, how do you actually use the information from vision to manipulate the world. And, again, I think the applications of those particular modules are fairly obvious in robotic mobility and manipulation. So, again, let me just dive right in. Actually, how many people here have taken or are taking the computer vision course that is being taught? Okay. So a few people. For you this will be a review. You can, I guess, hopefully bone up for the mid-term whenever thatís gonna happen or something. And hopefully I donít say anything that disagrees with anything that Yanna has taught you. But so what is computational stereo? Well, computational stereo quite simply is a phenomenon that youíre all very familiar with. Itís the fact that if you have two light-sensing devices, eyes, cameras, and you view the same physical point in space and thereís a physical separation between those viewpoints you can now solve a triangulation problem. I can determine how far something is from the viewing sensors by finding this point in both images and then simply solving a geometric triangulation problem. Simple. In fact, thereís a lot of stereo going on in this room. Pretty much everybody has stereo, although oddly about 10 percent of the population is stereo blind for one reason or another. So it turns out that in this room there are probably three or four of you who actually donít do stereo, but you compensate in other ways. Even so, having stereo particularly in a robotics system would be a huge step forward. Sorry for the colors there. I didnít realize it had transposed them. So when youíre solving a stereo problem in computer vision there are really three core problems. The first problem is one of calibration. In order to solve the triangulation problem I need to know where the sensors are in space relative to each other. A matching problem. So remember in stereo what Iím presented with is a pair of images. Now, those images can vary in many different ways. Hopefully they contain common content. What I need to do is to find the common content, even though there is variation between the images, and then finally reconstruction. Once Iíve actually performed the matching, now I can reconstruct the three-dimensional space that Iím surrounded by. So Iíll talk just briefly about all three of those. So, first, calibration. So, again, why do we calibrate? Well, we calibrate for actually any number of reasons. Most important is that we have to characterize the sensing devices. So, in particular, if you think about an image youíre getting information in pixels. So if you say that thereís some point in any image, itís got a pixel location. Itís just a set of numbers at a particular location. What youíre interested in ultimately is computing distance to something in the world. Well, pixels are one unit. Distances are a different unit. Clearly, you need to be able to convert between those two. So typically in a camera there are four important numbers that we use to convert between them to scale factors that convert from pixels to millimeters and two point two numbers that characterize the center projection in the image. So the good news for you is that there are a number of good toolkits out there that let you do this calibration. They let you characterize whatís called the intrinsic or internal parameters of the camera. In addition, we need to know the relationship of the two cameras to each other. Thatís often called the extrinsic or external calibration. Thatís also something that you can get very good toolkits to solve. Thereís a toolkit for MATLAB, in fact, that solves it quite well. So calibration really is just getting the geometry of the system. So weíre set up and weíre ready to go. Now, for the current purposes letís assume that we happen to have a very special geometry. So our special geometry is going to be a pair of cameras that are parallel to each other and the image claims are co plainer and, in fact, the scan lines are perfectly aligned with each other. So if I look at a point in the left image and if I wanted to find the same corresponding point in the right image at some physical point in the world itís gonna be on the same row and, in fact, thatís gonna be true for all the rows of the camera. So itís a really convenient way to think about cameras for the moment. So for a camera system like that, solving the stereo problem really is from a geometric standpoint simple. So what did I say? Iíve got a point on one camera line; Iíve got a point in another camera, the other camera the same line. What I can do is effectively solve triangulation by computing the difference in the coordinates between those two points. Again, not to go into great detail, but I can write down the equations of perspective projection, which Iíve done here, for two cameras for what I call now the X coordinate. So, in fact, on this last slide I forgot to point it out, but Iím gonna use a coordinate system, in fact, through this talk, which is X going to the right, Y going down in the image, and, I guess, most of you should be able to figure out which direction Z goes once Iíve told you those two things, right? Where does Z go? Out the camera. So Z is heading straight out of the camera lens. So thatís the coordinate system weíre going to be dealing with. So the things that we can find fairly easily in some sense are the X and Yís of the point. The unknown is the Z. So whenever I say depth you can think of Iím trying to compute the Zís. So I can write down perspective projection for a left camera and a right camera, which Iíve done here. They are offset by some base line, which Iíve called B. Iíve also got the Y projection, but it turns out to be the same for both cameras because Iíve scan line aligned them. So Iíve got three numbers, XL, XR, and Y. Iíve got three unknowns X, Y, and Z. A little algebra allows us to solve for the depth Z as a function of disparity, which, again, is the difference between the two coordinates. The base line of the camera and this internal scaling factor, which allows us to go from pixels to millimeters. Just a couple things to notice about this. Depth is inversely proportional to disparity. So the larger the disparity the smaller the depth. Makes sense. I get closer like my eyes have to keep going like this more and more and more to be able to see something. Itís proportional to baseline. If I could pull my eyes out of my head and spread them apart, I could get better accuracy and itís also proportional to the resolution of the imaging system. So if you put all that together you can start to actually think about designing stereo systems from very small to very large that operate at different distances and with different accuracies. The other thing to point out here is that depth being inversely proportional to disparity means that close to the camera we actually get very good depth resolution. As we get further and further away our ability to resolve depth by disparity goes down drastically. You see out here at a distance of ten meters one disparity level is all ready tens maybe even hundreds of centimeters of distance. So stereo is actually good in here. Itís very hard out there. And, in fact, anybody happen to know the human stereo system what its optimal operating point is? Anybody ahead of class talk about that? Itís right about here. Itís about 18 inches from your nose. Right in this point, you have great stereo acuity. You get in here and your eyes start to hurt. You get out here past about an arm length and it just turns off. You actually donít use stereo at long distances. Youíre really just using it in this workspace and, of course, it makes sense. Weíre trying to manipulate, right? Well, I made a strong assumption about the cameras. I said that they had this very special geometry. And back in the good old days when I was your age Ė those are the good days by the way just so youíre all aware of that fact. Youíre in the good days right now. Enjoy them. Back in the good old days, we actually used to try to build camera systems that had this geometry. Because if you start to change that geometry you no longer get this nice scan line property. So, in fact, if I rotate the cameras inward, if I start to look at the relationship between corresponding points I start to get these rays coming up. So I pick a point here the line is now some slanted line. Well, it turns out luckily one of the things that really have been nailed down in the last decade or two is that that doesnít matter. We can always take a stereo pair that looks like this and we can resample the images so itís a stereo pair that looks like that and, in fact, by doing this calibration process we get it. So the good news is I can almost always think about the cameras being these very special scan line aligned cameras. And so everything Iím gonna say from now on is going to pretty much rely on the fact that Iíve done this so-called rectification process. So, again, a very nice abstraction. Iíll just point out, again Iím not gonna talk a lot about it, but the relationship that I just described. So how do I Ė if I have a point in one camera how do I know the line to look for it on the second camera? Well, itís a function of the relationship, the location, and the translation between the two cameras. And it turns out, this is, again, something thatís really been nailed down in the last couple of decades; I can estimate this from images. So, in fact, I could literally take a pair of cameras, put them in this room, do some work, and I could figure out their geometric relationship without having any special apparatus whatsoever. What it also means is instead of doing stereo I could literally just take a video camera and walk like this, process the video images and effectively do stereo from a single camera using the video. And thatís because I can estimate this matrix E in that relationship up there, which if we worked out what E was it turns out to contain the rotation and translation between the two camera systems. And once I know that I can do my rectification and I can do stereo. So actually stereo is a special case in some sense of taking a video and processing it to get motion and structure at the same time. So, again, thatís a whole lecture we wonít go in there, but suffice to say that from a geometric point-of-view, we can actually deal with cameras now in a very general way with relatively little operator assumptions about how they start. Okay. So geometry calibration is done. I now want to reconstruct. But to do reconstruction I need to do matching. I need to look at a pair of images and say, hey, thereís a point here and that same point is over there and I want to solve the triangulation problem. So there are two major approaches, feature based and region based matching. So feature based, obviously, depends on features. I could run an edge detector in this room. I could find some nice edges, the chairs, the edge of people, the floor, and then I could try to match these features between images and now for every feature Iím gonna get depth. So if you do that you end up with these kind of little stick figure type cartoons here. So here itís a set of bookshelves and this is a result of running a feature based stereo algorithm on that. So you can see on the one hand itís actually giving you kind of the right representation, right? The major structures here are the shells and itís finding the shells. But you notice that you donít get any other structure. Youíre just getting those features that you happen to pull out of the image. So the other approach is to say forget about features. Let me try to find a match for every pixel in the image. The so-called dense depth map. Every pixelís got to have some matching pixel or at least up to some occlusion relationships. So letís just look for them and try to find them. And so this is the so-called region matching method. Iím gonna actually pick a pixel plus a support region and try to find the matching region in another image. This has been a cottage industry for a very long time and so people pick their favorite ways of matching their favorite algorithms to apply to those matches and so on and so forth. So, again, a huge literature in doing that. There are a few match matrixes, which have come to be used fairly widely. Probably the most common one is something called the sum of absolute differences. Itís right up there. Or more generally Iíve got something called zero mean SAD or sum of absolute differences. Thatís probably the most widely used algorithm. Not the least of which because itís very easy to implement in hardware and itís very fast to operate. So you take a region and you take the difference. Take a region, take a region, take their difference, take their absolute value, sum those and assume if they match that that difference is gonna be small. If they donít match itís gonna be big. All the other matrix I have up here are really different variations on that theme. If you take this region, take that region, compare them, and try to minimize or maximize some matrix. So if I have a match measured now I can look at correspondence. Again, I get to use the fact that things are scan lined aligned. So when I pick a point in the left image or a pixel in the left image I know Iím just gonna look along a single line in the right image, not the whole image. So simple algorithm for every row, for every column, for every disparity. So for every distance that Iím gonna consider essentially. Iím now gonna compute my match metric in some window. Iíll record it. If it happens to be better than the best match Iíve found for this pixel Ė Iím sorry. If itís better than the best match Iíve found for this pixel Iíll record it. If not I just go around and try the next disparity.
So you work this out whatís Ė how much computing are you doing? Well, itís every pixel, rows, and columns, every disparity. So, for example, for my eyes, if Iím trying to compute on kind of canonical images over a good working range, you know, three feet to a foot, maybe 100 disparities, 120 disparities. So rows, columns, 120 disparities, size of the window, which might be, letís say, 11 by 11 pixels is a pretty common number. So thereís a lot of computing in this algorithm. A lot, a lot of computing. And, in fact, up until maybe ten years ago even just running a stereo algorithm was a feat. Just point out that it turns out that the way I just described that algorithm, although intuitive, is actually not the way that most people would implement it.
Thereís a slightly better way to do it. This is literally MATLAB code, so for anybody in computer vision this is the MATLAB to compute stereo on an image. You can see the main thing is Iím actually looping first over the disparities and then Iím looping over effectively rows and columns, doing them all at once. The reason for doing this without going into details is you actually save a huge number of operations in the image. So if you ever do implement stereo, and I know there are some people in this room who are interested in it, donít ever do this. Itís the wrong thing to do. Do it this way. Itís the right way to do it. If you do this you can actually get pretty good performance out of the system.
Just one last twist to this. So one of the things thatís gonna happen if you ever do this is itís gonna Ė youíre gonna be happy because itís gonna work reasonable well, probably pretty quickly. And then youíre gonna be unhappy because youíre gonna see itís gonna work well in a few places and itís not gonna work in other places. Someone in computer vision can give me an example of a place where this algorithm is not gonna work or is unlikely to work? A situation? A very simple situation.
Student:[Inaudible]Guest Instructor (Gregory Hager):Well, if the Ė but Iím assuming rectification.
Guest Instructor (Gregory Hager):Yeah. The boundaries are gonna be a problem, right? Because Iím gonna get an occlusion relationships will change the pixels so that good. Occlusions are bad. How about this? I look at a white sheet of paper. What am I going to be able to match between the two images? Almost nothing, right? So you put a stereo algorithm in this room itís gonna do great on the chairs, except at the boundaries. But thereís this nice white wall back there and thereís nothing to match. So itís not gonna work everywhere. Itís only going to work in a few places. How can you tell when itís working? Itís one thing to have an algorithm that does something reasonable. Itís another to know when itís actually working.
The answer turns out to be a simple check is I can match from left to right and I can match from right to left. Now, if the systemís working right they should both give the same answer, right? I can match from here to here or here to here. It shouldnít make any difference. Well, when the right Ė when you have good structure to the image that will be the case, but if you donít have good structure to the image it turns out you usually are just getting random answers and so I said something like 100 disparities, right? So the odds that you picked the same number twice out of 100 is really pretty small. So it turns out that almost always the disparities differ and you can detect that fact and itís called the left-right check.
On a multicore processor itís great because you have one core doing left to right and one core doing right to left. At the end they just meet up and you check their answers. Hereís some disparity match. These are actually taken from an article by Cork and Banks some years ago just comparing different metrics. Here I just happen to have two. One is so-called SSD, semi-squared differences. The other is zero mean normalized cross-correlation. A couple of just quick things to point out. So, like, on the top images you can see that the two metrics that they chose did about the same. You can also see that youíre only getting about maybe 50 percent data density. So, again, theyíve done the left-right check. Half the data is bad.
So kind of another thing to keep in mind. Fifty percent good data. Not 100 percent, not 20 percent either. Fifty percent good data maybe. Second row, why is there such a difference here? Well, thereís a brightness difference. I donít know if you can see, but the right image is darker than the left. Well, if youíre just matching pixels that brightness difference shows up as just a difference itís trying to account for. In the right column theyíve taken so-called zero meaning. Theyíve subtracted the mean of the neighborhood of the pixels. Gets rid of brightness differences, pulls them into better alignment in the brightness range, and so they get much better density. This is just an artificial scene. Pick the most interesting thing there is itís an artificial scene. Youíve got perfect photometric data and it still doesnít do perfectly because of these big areas where thereís no texture. It doesnít know what to do. So itís producing random answers. The left-right check throws it out.
Student:So left-right check does not have any one guarantee?
Guest Instructor (Gregory Hager):There are no guarantees. Thatís correct.
Guest Instructor (Gregory Hager):It could make a mistake, but the odds that it makes a mistake it turns out are quite small. It actually is Ė I should say itís a very reliable measure because usually itís gonna pick the wrong pixels. Occasionally, by chance, itíll say yes, but those will be isolated pixels typically. Usually if youíre making a mistake you donít make a mistake on a pixel. You make a mistake on an area and what youíre trying to do is really kind of throw that area away. But a good point. It will Ė itís not perfect. But actually in the world of vision itís one of the nicer things that you get for almost free. I just say that these days a lot of what you see out there is real time stereo. So starting really about ten years ago people realized if theyíre smart about how they implement the algorithms they can start to get close to real time.
Now you can buy systems that run pretty much in software and produce 30 frames a second stereo data. Now, itís kind of just a mention that Ė I said that the data is not perfect. Well, the dataís not perfect and as a result if you could imagine what weíd like to do is take stereo and build a geometric model. All right. Iíd like to do manipulation. I want to take this pan and I want to have a 3-D model and then I want to generate or manipulate 3-D. Well, you saw those images. Stereotypically doesnít produce data thatís good enough to give you a high precision completely dense 3-D models.
Itís an interesting research problem and, in fact, people are working on it, but itís a great modality if you just want to kind of rough 3-D description of the world. And you can run it in real time and youíre now getting kind of real time coarse 3-D. And so I think thatís why right now real time stereo is really getting a lot of interest because it gives you this coarse wide field, 3-D data, 30 frames a second. It lets you just do interesting things. I just thought I would throw in one example of what you can do. So this is actually something we did about ten years ago, I guess. Youíre a mobile robot and you want to detect obstacles and youíre running on a floor. So whatís an obstacle? Well, itís positive obstacle is something that Iím gonna run into or it could be a negative obstacle. Iím coming to the edge of the stairs and I donít want to fall down the stairs. So Iím gonna use stereo to do this and so the main observation is that since I assumed Iím more or less operating on a floor Iíve got a ground plane. And it turns out plainer structures in the world are effectively plainer structures in disparity space. So itís very easy to detect this big plane and to remove it and once youíve removed that big plane anything thatís left has gotta be an obstacle, positive or negative. It doesnít matter what it is. So you run stereo, you remove the ground plane, if thereís something left thatís something you want to worry about avoiding.
So hereís a little video showing this. So thereís a real time stereo system. So, first, we put down something thatís an obstacle, it shows up, hereís something which if you were just doing something like background subtraction you might say that that newspaper is an obstacle. Itís different than the floor, right? So you would drive around it because it could be something you donít want to run into. But here since weíve got this ground plane removal going, we can see that this is very clearly an obstacle. That is very clearly something thatís just attached to the floor and disappears and you can go by it. And this is something real cheap, easy, simple to implement.
I think that really is the great value of stereo in robotics right now, today, is that most stereo algorithms can give you this coarse sense of whatís around you and where youíre going. And you can use it for downstream computations. In the world of stereo and research there are a lot of other issues that people are trying to deal with. How do you increase the density, increase the precision, deal with photometric issues like shiny objects, and deal with these differences between the images, just for the brightness, lack of texture. How do I deal with the fact that in some places they donít have lack of texture? How do I infer some sort of depth there? And also geometric ambiguities. Itís like occlusion boundaries. How do I deal with the fact that occasionally there will be parts of the image that the left camera sees and the right camera doesnít see?
So thereís ongoing research there. Most of these methods, Iíll just say, try to solve stereo not as this local region-matching problem I mentioned, but as a global optimization problem. And so thereís a lot of work in different global optimization algorithms. And there is hope that ultimately stereo will get to the point that it really can do the sort of thing I mentioned. Iím gonna pick this up and I really want to get a 3-D model and use that for manipulation. Probably not there today. I should just say that thereís one simple way to get a huge performance boost out of your stereo algorithm, something that people often do. If anybody can guess one way to just take a stereo algorithm and make it work a whole lot better.
Think about how I could change this sheet of paper to be something that I could actually do matching on.
Guest Instructor (Gregory Hager):You want to add texture. How could you add texture?
Guest Instructor (Gregory Hager):There you go. Just put a little light projector on the top of your stereo system and youíll be amazed at how well it works. Suddenly this thing, which you couldnít match before, becomes the worldís best place to do stereo because you get to choose the texture and you get to match it. So thatís the other thing that people have looked a lot into. Youíll see in the literature structure-like stereos. Another way to get better performance out of stereo. Okay. So thatís stereo. Again, the message here is by using two cameras you can get at this point data density inaccuracy that still exceeds pretty much anything else you can imagine in the laser range finding world. Itís not as reliable as laser range finding and thatís probably the thing that is still the main topic of research. Any quick questions on that before I shift gears? Okay.
So Iím a robot, Iím running around, Iíve got real time stereo. I donít run into things anymore. If somebody walks in front of me I scurry away as quickly as I can, so that I donít hurt them. But I have no clue where I am in the world or I have no clue whatís in the world around me. So if you said go over to the printer and get my printout, you know, where is the printer? Where am I? Where is your office? Who are you? So how can we solve those problems? And, in fact, this, I think, is an area of computer vision that I would say in the last decade has undergone a true revolution. Ten years ago if I would have talked about object recognition in this lecture we really had no clue. There are kind of some interesting things going on in the field, but we had no clue. And today there are people who would claim that at least certain classes of object recognition problems are solved problems.
We actually know very well how to build solutions and there are, actually, commercial solutions out there. So what is the problem with object recognition? Well, itís a chicken and egg problem. If I want to recognize an object there are many unknowns. So I look at an image. I donít know the identity of the object. I donít know where it is. And most importantly perhaps, I donít know whatís being presented to me. So I donít happen to have a Ė oh, here we go. So if I say find my cell phone in the image you donít know if youíre gonna see the front of the cell phone, the back of the cell phone, the top of the cell phone. You donít know if youíre gonna see half of the cell phone hidden behind something else. You donít know what the lighting on the cell phone is gonna be.
Huge unknowns in the appearance let alone than finding it in the image and segmenting it and then actually doing the identification. So thereís a sense in which if I could segment the object, if I could say hereís where the object is in the image then solving recognition and pose would be fairly easy. Or if I told you the pose of the object solving segmentation and recognition would be easy to do or if I tell you what the object is finding it and figuring out its pose is easy to do. Doing all of them at the same time is hard. So for a long time people tried to use geometry. So maybe the right thing to do is to have a 3-D model of my cell phone and to use my stereovision to recognize it.
Now, we just said stereoís not real reliable, right? Probably not good enough to recognize objects. So another set of people said, well, how about if we recognize it from appearance? So letís just take pictures. The way Iím gonna recognize my cell phone is Iíve just got 30 pictures of my cell phone and you find it in the image. So you can do that and, in fact, can see here this is somewhere about ten years ago. Theyíre doing pretty well on a database of about 100 objects. But you notice some other things not in the database. Black background. The objects actually only have, in this case, one degree of freedom. Itís rotation in the plane. So, yes, theyíve got 100 objects, but the number of images they can see is fairly small.
No occlusion, no real change in lighting. So this is interesting. In some sense it will generate a lot of excitement because itís probably the first system that could ever do 100 objects. But it did it in this very limited circumstance. And so the question was, well, how do we bridge that gap? How do we get from hundreds to thousands and do it in the real world? Really the answer Ė you can almost think of it as combining both the geometry and the appearance-based approach. So the observation is that views were a very strong constraint. Giving you all these different views and recognizing from views work for a hundred objects, but itís just hard to predict a complete view. Itís hard to predict what part of the cell phone Iím gonna see. Itís hard to get enough images of it to be representative. It still doesnít solve the problem of occlusion if I donít see the whole thing.
So views seem to be in the right direction, just not quite there. So Cordelia Schmidt did a thesis in 1997 with Roger Moore where she tried a slightly different approach. She said, well, what if instead of storing complete views we store Ė think of it as interesting aspects of an object? If you were gonna store a face what are the interesting aspects? Well, the things like the eyes and nose and the mouth. The cheeks are probably not that interesting. Thereís not much information there. But the eyes and the nose and the mouth they tell you a lot. Or my cell phone, youíve got all sorts of little textures.
So what if we just stored, if you think of it this way, thumbnails of an object? So my cell phone is not a bunch of images. Itís a few thousand thumbnails. And now suppose that I can make that feature detection process very repeatable. So if I show you my cell phone in lots of different ways you get the same features back every time. Now, suddenly things start to look interesting because the signature of an object is not the image, but itís these thumbnails. I donít have to have all of the thumbnails. If I get half the thumbnails maybe Iíll be just fine. If the thumbnails donít care about rotation in the plane thatís good. If they donít care about scale even better.
So it really Ė it starts to become a doable approach and, in fact, this is really what has revolutionized this area. And, in particular, there is a set of features called SIFT, scale-invariant feature transform, which I know youíve learned about in computer vision if youíve had it. Developed by David Lowe, which, really, it become pretty much the industry standard at this point and, in fact, you can download this from his website and build your own object recognition system if you want. Let me just talk a little bit about a few details of the approach. So I said features is where we want to go. Well, the two things that we need to get good features. One is we need good detection repeatability. I need a way of saying there are features on this wall at this orientation, features on this wall at this orientation.
I should find the same features. So detection has to be invariant to image transformations. And I have to represent these features somehow. And probably just using a little thumbnail is not the best way to go, right? Because a thumbnail Ė if I take my cell phone and I rotate a little bit out of plane, or I rotate even in plane, the image changes a lot and Iíd like to not have to represent every possible appearance of my cell phone under every possible orientation. So we need to represent them in an accordant and variant way and we need lots of them. And when I say lots I donít mean ten. I mean a thousand. We want lots of them.
So SIFT features solve this problem and they do it in the following way. They do a set of filtering operations to find features that are invariant to Ė the detection is invariant to rotation and scale. It does a localization process to find a very precise location for it. It assigns an orientation to the feature, so now if I redetect it I can always assign the same orientation to that feature and cancel for rotations in the image. Builds a key point descriptor out of this information and a typical image yields about 2,000 staple features. And thereís some evidence that suggests to recognize an object like my cell phone you only need three to six of them. So from a few thousand features you only need 1 or 2 percent and youíre there.
So, again, just briefly the steps. Set up filtering operations. What theyíre trying to do is to find features that have both a local structure thatís a maximum of an objective function and a size or a scale thatís a maximum of an objective function. So theyíre really doing a 3-dimensional search for a maximum. Once they have that they say, ah ha, this areaís a feature. Key point descriptors. What they do is they can compute the so-called gradient structure of the image. So this allows them to assign a direction to features and so, again, by getting Ė by having an assigned orientation they can get rid of rotation in the image. So if you do that and youíre running down an image you get confusing figures like this.
What theyíve done is theyíve drawn a little arrow in this image for every detective feature. So you can think of a long arrow as being a big feature, so a large-scale feature, and small arrows being fine detailed features. And the direction of the arrow is this orientation assignment that they give. And so you can see in this house picture itís not even that high of a res picture. Itís 233 by 189 and theyíve got 832 original key points filtered down to 536 when they did a little bit of throwing out what they thought were bad features. So lots of features and lots of information is being memorized, but discreetly now. Not the whole image, just discreetly.
If you take those features and you try to match the features it turns out also theyíre very discriminative. So if I look at the difference in match value between two features that do match and two features that donít match itís about a factor of two typically. So thereís enough signal there you can actually get matches pretty reliably. Now, I mentioned geometry. So right now an object would just be a suitcase of features. So if Iím gonna memorize my cell phone and I just said I gave you some thumbnails. So most systems actually build that into a view. So I donít just say my cell phone is just a bag of features, but itís a bag of features with some special relationship among them. And so now if I match a feature up here it tells me something about what to expect about features down there.
Or more generally if I see a bunch of feature matches I can now try to compute an object pose thatís consistent with all of them. And so, in fact, thatís how the feature matching works. Uses something called the hub transform, which you can think of as a voting technique. Just generally voting is a good thing. Iíll just say as an aside, this whole thing that Iíve been talking about, all weíre trying to do is set up voting. And weíre really trying to be an election where weíre not going down to the August convention to make a decision of whoís winning the primaries. This is an election that we want to win on the first try. So these features are very good features. They do very discriminative matching of objects. You add a little bit of geometry and suddenly from a few feature matches youíre getting rows and identity of an object from a very small amount of information.
So here are a couple of results from David Loweís original paper. So you can see thereís just a couple objects here. A little toy train and a froggy. And there is the scene. And if youíre just given that center image I think even a person has to look a little bit before you find the froggy in the toy train. And on the right are the detected froggy and toy train. Including a froggy thatís been almost completely hidden behind that dark black object. You just see his front flipper and his back flipper, but you donít see anything else. And the systemís actually detected. In this case, two instances I think it Ė well, no, actually itís one. One box. So weíve got one instance. So youíve even got to realize that even though itís occluded itís one object out there.
So, I mean, these, again, I think itís fair to say a fairly remarkable result considering where the field was at that time. Since then thereís been a cottage industry of how can we make this better, faster, higher, stronger? So this happens to be worth by Pons and Rothganger where they try to extend it by using better geometric models and slightly richer features. So they have 51 objects and they can have any number of objects in a scene. And this is the sort of recognition policy that youíre starting to see now. So weíre not talking about getting it right 60 percent of the time or 70 percent of the time. Theyíre getting 90 plus percent recognition rates on these objects.
Now, of course, Iíve been talking about object recognition. Just point out that you can think of object recognition as thereís an object in front of you and I want to know its identity and itís pose or you can think of the object as the world and I want to know my pose inside this big object called the world. So, for example, if Iím outside I might see a few interesting landmarks and recognize or remember those landmarks using features. And now when I drive around the world Iíll go and Iíll look for those same features again and use them to decide where I am. And so, in fact, this is, again, worked out of UBC where theyíre literally using that same method to model the world, recognize Ė Iím in this room and I see a bunch of features. I go out in the hallway I see a bunch of features. I go in the AL lab I see a bunch of features.
Store all those features and now as Iím driving around in the world I look for things that I recognize. If I see it then I know where I am relative to where I was before. Moreover, remember I said that for stereo to get geometry we didnít have to actually calibrate our stereo system. We could have one camera and it could just walk around. We could compute this so-called epipolar geometry automatically and then we could do stereo. So, in fact, what theyíve done in this math is theyíve from one camera as theyíre driving around theyíre not computing just the identity, but the geometry of all these features around them. So they can build a real 3-D map, just as I could build with the laser range finder, but now just by matching features and images.
And, in fact, we edited a joint issue of IJCD and IJRR about, I guess, six, eight months ago, and probably half the papers in that special issue ended up being how can you use this technique to map the world in different variations and flavors? So, again, itís a technique, which really in many ways is there. You can download it from the web practically and put it on your mobile robot and make it run. And, in fact, this is my favorite result. So this is the work of Ryan Eustis who was a post doc at Hopkins with Wes Whitcomb that does a lot of underwater robotics. So this is the Titanic, this is actually a Bowat expedition where they flew over the Titanic with a camera and the goal was obviously to get a nice set of images of the Titanic.
But the problem is that underwater itís really difficult to do very precise localization and odometry. And so what Ryan did was he took these techniques and he built effectively a mapping system, very high precision mapping system, that was able to take these images, use the images to localize the underwater robot and then put the images together into a mosaic and so this is a mosaic of the Titanic as they flew over it. You can see the numbers up there. They actually ran for about 3.1 kilometers. They have Ė what is it? How many images? There it is. Over 3,000 images, 3,500 images matched successfully, computed the motion of the robot successfully, filtered all of this together and were able to produce this mosaic. So really an impressive, impressive system.
Guest Instructor (Gregory Hager):They actually have Ė so from this you, actually, do get the 3-D geometry. In this mosaic theyíve basically projected it down, but they are computing up to this set of features that theyíre able to use the 3-D geometry. And, actually, the little red versus brown up there, the red, I believe, is the original odometry that they thought they had on the robot and the brown is actually the corrected odometry that they computed or vice versa. I donít know which is which now. I canít remember if they did the two separate pieces or they did just one piece, but really impressive work. Very nice. Also, it mentions here doing a comment filter that he built a special purpose comment filter that operated on the space of reconstructed images.
So that is kind of the next piece of this puzzle or this, I guess I had one other three in here. So a lot of people are interested in 3-D now. So this is Peter Allen also putting together images. Here heís just showing the range data, but you can imagine if you have range in appearance now you can actually do interesting things using both 3-D in appearance information. I know thereís some work going on here at Stanford in that also. So thatís kind of Chapter 2. So now a set of techniques that not only let me avoid running into things in the world, but a set of techniques that let me say, well, where am I? And where is some things that Iím interested in?
So you can now actually imagine phrasing the problem I want to pick up the cell phone and you could actually have a system that recognizes the cell phone and is able to say, hey, thereís the thing that I want to pick up. So Iíll just finish up with what I thought was the last piece of this puzzle. Namely how do I pick it up? Iím not gonna tell you exactly how to pick it up. Thereís a lot of interesting and hard problems in figuring out how to put my fingers on this object to actually pick it up, but at least letís talk a little bit about the hand-eye coordination it takes for me to actually reach over and grab this thing or even better if I do that Ė I donít want to drop this. If I do that, how do I actually catch it again? Which luckily I did, otherwise it would be a very sad lecture.
So Iím gonna talk about this in two pieces. So one piece is gonna be visual tracking. So weíre now really moving to the domain where I want to think about moving objects in the world and having precise information about how theyíre moving, how theyíre changing. So visual tracking is an area that attacks that problem and I know you saw a video maybe a week ago of a humanoid robot that was playing badminton or ping-pong or volleyball Ė volleyball I think it was. And so I think Professor Kateeb had all ready explained that theyíre doing some simple visual tracking of this big colored thing coming at them and using that to do the feedback.
Well, those big colored things are nice. Unfortunately my cell phone is not day-glow orange so its hard to just use color as the only thing that you can deal with, but tracking has been a problem of interest for a long time. Tracking people, tracking faces, tracking expressions, all sorts of different tracking. So what I think is interesting is first to say, well, what do you mean by tracking to begin with? Itís kind of cool to write a paper that says tracking of X, but no oneís ever defined what tracking is. So I have a very simple definition of visual tracking, which simply says Iím gonna start out with a target, you know, my face is gonna be the canonical target here.
So at time zero for some reason youíve decided thatís the thing you want to track. And the game in town is to know something about where it is at time T. And the something about where it is is something is what you in principle get to pick. So thereís gonna be a configuration space for this object. I could Ė the simplest thing is your big orange ball. Itís just round, so itís got no orientation. It just has a position in the image. So itís configuration is just where the heck is the orange ball? But you can imagine my cell phone has an orientation, so presumably orientation might be part of the configuration or if I start to rotate out of plane you get those out of plane rotations.
In fact, if itís a rigid object how many degrees of freedom must it have? They better know the answer to this.
Guest Instructor (Gregory Hager):Six, yes. Believe me thereís no trick questions and he knows what heís doing, so if he told you itís six it really is six. Thereís no question there. Six. So, sure, if this is a rigid object in principle there must be six degrees of freedom that describe it. Though, of course, if itís my arm then itís got more degrees of freedom. I wonder how many more it has. Anyway, okay. So this is gonna be a configuration space for this object and ultimately thatís what we care about is that configuration space. The problem is that the image we get depends on this configuration space in some way. And so here Iím gonna imagine for the moment that I can predict an image if I knew its configuration, if I knew the original image.
So you can imagine this is like the forward kinematics of your robot. I give you a kinematic structure and I give you some joint values and now you can say ah ha, hereís a new kinematic configuration for my object. So the problem then is Iím gonna think of a tracking problem. So I know the initial configuration. I know the configuration at time T. I know the original image. Now, what Iíd like to do is to compute the change in parameters or even better just the parameters themselves. I donít know what D stands for, so donít ask me what D stands for. I want to compute the new configuration at time T plus one from the image at time T plus one and everything else Iíve seen.
Or another way to think of this is look, Iíve said I believe I can predict the appearance of an object from zero to T. I can also think of it as the other way around. I can take the image at time T and if I knew the configuration I could predict what it would have looked like when we started and now I can try to find the configuration that best explains the starting image. So this really is effectively a stabilization problem. Iím gonna try to pick a set of parameters that always make what Iím predicting look as close to the original template as possible. So in this case Iím gonna take the face and unrotate it and try to make it look like the original face. So my stabilization point now is an image.
And so this gives rise to a very natural sort of notion of tracking where I actually use my little model that I described, my prediction model, to take the current image, apply the current parameters, produce whatís hopefully something that looks like the image of the optic to start with. So if I started with my cell phone like this and later on it looks like that Iím gonna take that image, Iím gonna resample it so hopefully it looks like that again. If it doesnít look like that thereís gonna be some difference. Iím gonna take that difference, run it through something. Hopefully that something will tell me a change in parameters. Iíll accumulate this change in parameters and suddenly Iíve updated my configuration to be right.
Now, whatís interesting about this, perhaps, was Ė just skip over this for the moment. So we can Ė for a pointer object we can use a very simple configuration, which turns out to be a so-called affie model. So how do I solve that stabilization problem? Well, again, I said Iím gonna start out with this predictive model, which is kind of like your kinematics, and if I want to go from a kinematic description talking about positions in space to velocities in space what do I use? Jacobian, imagine that. Hey, weíve got kinematics in the rigid body world. Weíve also got kinematics in image space; letís take a Jacobian.
If we take a Jacobian weíre now relating what? Changes in configuration space to changes in appearance. Just like the Jacobian robotics, 308s, changes in configuration space to changes in Cartesian position. So there you go. So Iím gonna take a Jacobian; itís gonna be a very big Jacobian. So the number of pixels in the image might be 10,000. So Iím gonna have 10,000 rows and however many configurations. So 10,000 by six. So itís a very big Jacobian, but itís Jacobian nonetheless. And we know how to take those.
Well, now Iíve got a way of relating change in parameter to change in the image. Suppose I measure an error in the image, which is kind of locally like a change in the image. So an error in the alignment. Well, suppose Iíve effectively inverted that Jacobian. Now, again, I have to use a pseudo inverse because Iíve got this big tall matrix instead of a square matrix. Well, so I take this incremental error that Iíve seen in my alignment, go backwards to the Jacobian, and lo and behold it gives me a change in the configuration. And so I close the loop by literally doing an inverse Jacobian. In fact, itís the same thing that you use to control your robot to a position in Cartesian space through the Jacobian. Really no difference.
So it really is a set point control problem. Again, I wonít go into details. Right now this is a huge, big time varying Jacobian. It turns out that you can show Ė and this is work that we did in Ė the name slips my mind. Assume you also did work showing that you can make this essentially a time and variance system, which is just a way of implementing things very fast. What does a Jacobian look like? Well, the cool thing about images is you can look at Jacobians because they are images. So this is actually what the columns of my Jacobian look like. So this is the Jacobian, if you look at the image, for a change in X direction, motion in X. And it kind of makes sense. You see its basically getting all of the changes in the image along the rows. Y is getting a change along the columns. Rotation is kind of getting this velocity field in this direction, so on and so forth.
So thatís what a Jacobian looks like if youíve never saw a Jacobian before. It turns out that what I had showed you is for plainer objects. You can do this for 3-D. So my nose sticks out a lot. If I were to just kind of view my face as a photograph and I go like this it doesnít quite work right. So I can deal with 3-D by just adding some terms to this Jacobian and, in fact, youíll notice Ė what can I say? Iíve got a big nose and so thatís what comes out in the Jacobian in my face is my nose. Tells you which direction my face is pointing.
Again, we can deal with illumination and, this is actually probably a little more interesting, I can also deal with occlusion while Iím tracking because if I start to track my cell phone and I go like this, well, lo and behold thereís some pixels that donít fit the model. So what I do is I add a little so-called re-weighting loop that just detects the fact that some things are now out of sync, ignore that part of the image. So you put it all together and you get something that looks not like that. So just so you know what youíre seeing, remember I said this is a stabilization problem?
So if Iím tracking something, right? I should be stabilizing the image. I should be subtracting all the changes out. So that little picture in the middle is gonna be my stabilized face. Iím gonna start by tracking my face and this is actually the big linear system Iím solving. My Jacobian, which actually includes motion and includes some illumination components too, which I didnít talk about. So Iím just showing you a Jacobian. You can kind of see a little frame flashing. This is a badly made video. Itís back when I was young and uninitiated. And so now you can see Iím running the tracking. This is just using planer tracking. So as I tip my head back and forth and move around itís doing just fine.
Scale is just fine because Iím computing all the configuration parameters that have to do with distance. Iím not accounting for facial expressions, so I can still make goofy faces and they come through just fine. Now, Iím saying to an unseen partner turn on the lights, so I think some lights flash on and off. Yep, there we go. So itís just showing you can actually model those changes in illumination that we talked about in stereo too through some magic that only I know and Iím not telling anyone. At least no one in this room. In fact, itís not hard. It turns out that for an object like your face if you just take about a half a dozen pictures under different illuminations and use that as a linear basis for illumination thatíll work just fine for illumination.
Notice here that Iím turning side to side and it clearly doesnít know anything about 3-D. So you can actually make my nose grow like Pinocchio by just doing the right thing. So Iíll just let this run a little longer. Okay. So, now what I did was I put in the 3-D model and so the interesting thing now is you see my nose is stock still, so I actually know enough about the 3-D geometry of the face in 3-D configurations that Iím canceling all of the configuration Ė all the changes due to configuration out of the image. As a side effect I happen to know where Iím looking too. So if you look at the backside and telling you at any point in time what direction the face is looking.
And here Iím just kind of pushing it. Eventually as you start to get occlusions it starts to break down, obviously, because I havenít modeled the occlusions. And I wish I could fast-forward this, but itís before the date of fast forwarding. Here my face is falling apart. He wasnít supposed to be there.
Guest Instructor (Gregory Hager):It happens. Itís a faculty at Yale saying hey, what are you doing? And this is just showing the Ė what happens if you donít deal with occlusion in vision. You can see that Iím kind of knocking this thing out and it comes back and then eventually it goes kaboom. And now weíre doing that occlusion detection so Iím saying hey, what things match and what things donít match? And there you go. Cup, sees a cup. Says thatís not a face.
So you can take these ideas and you can then push them around in lots of different ways. This is actually using 3-D models. Here weíre actually tracking groups of individuals and regrouping them dynamically as things go on. Here is probably the most extreme case. So this is actually tracking two Davinci tools during a surgery where we learned the appearance of the tools actually as we started. So there is 18 degrees of freedom in this system. So itís actually tracking in an 18 degree of freedom configuration space during the surgery. Okay. Very last thing. I have ten minutes and Iím racing for home now.
So I can track stuff, cool. So what? Itís fun, but the thing I said I wanted to do eventually was to finally manipulate something. I want to use all this visual information and I want to pick up the stupid cell phone and call my friends and say the vision lecture is finally over in robotics. We can go out and do something else. But the question is, how do I want to do that? So Iíve got cameras. Theyíre producing all sorts of cool information. Iíve got a robot that I want to make drive around. Where do I drive it to or how do I drive it? So what should I put in that box? Any suggestions? You can assume Iíve got two cameras. So Iíve got stereo. Iíve got pretty much anything youíve seen. Whatís anybody think about Ė I donít care which way you want to think of it. What could you put in that box? What information would you use and what would you put in the box?
Itís gonna be on the mid-term. The simplest thing you could imagine, right? Is Iíve got Ė if I said Iíve got two cameras I can actually, with those two cameras, measure a point in space. I can actually calibrate those cameras to the robot and so I could just say, hey, go to this point in space. End of story. Good thing? Bad thing? Good or bad thing? Anybody think of why it could be good or bad? Yeah?
Student:Itís bad because if you run into anything on the way then you canít really accommodate for that.
Guest Instructor (Gregory Hager):Right. So youíre not monitoring. Monitoring real time, right? So that would get rid of at least that problem. What if my robotís not a real stiff robot? What if my kinematics arenít great? Like turns out the Davinci kinematics arenít great. So I could reach out to a point in space, but maybe my arm goes here or there instead, right? So the cameras tell me to go somewhere, but itís not really closing the loop. So what if I do one better? What if I compute the position of my finger, track it, letís say, and I compute the position of the phone in 3-D space. Now I can actually close the loop. I can say I want to make this distance zero and we could write down a controller that would actually do that.
Pick your favorite controller. I know Asama has some ideas of what they should be, but pick your favorite. And that will work. And, in fact, that will work pretty darn well it turns out. But suppose that my cameras are miscalibrated. And, in fact, suppose that I say, well, what I want you to do is to go along the line defined by the edge of the cell phone. I want you to be here for some reason. Turns out you can show that if you do it in position space or reconstructed space and your cameras arenít perfectly calibrated you can actually get errors. In fact, you can get arbitrary errors if you want to. Itís not real likely, but it can happen. So thereís one other possibility, which is Iím looking at this thing and Iím looking at this thing, what if I close the loop in the image space?
What if I just write my controller on the image measurements themselves? Well, it turns out if you do that and what this is called itís called encoding, so if you can encode the task you want to do like touch this point of the cell phone to my finger and do it in the image space, not in reconstructed space, well, youíve defined an error that doesnít mention calibration, right? It just says make these two things co-incident in the images. If you can close that loop stably, think of Jacobian, again, for example, you can actually drive the system to a particular point and youíve never said anything about calibration in your error function, which means that even if the cameras are miscallibrated you go there.
In fact, thereís pretty good evidence thatís what you do. You donít sit there and try to figure out the kinematics of your arm and the position in space and then kind of close your eyes and say go there. Youíre watching, right? And youíre actually using visual space and we know this because I can put funny glasses on your eyes and after a while you still get pretty good at getting your fingers together. So, again, Iím running out of time. I wonít go into great detail, but the interesting question is really when can you do this in coding? When can I write things down in the image domain? And the answer, again, depends a little bit on what you mean by cameras, but suffice it to say you can do a set of interesting tasks just by doing things in the image space and closing the loop in the image space.
And the interesting fact is that a lot of, sort of, tasks that you might imagine, like putting a screwdriver on a screw or putting a disc in a disc drive, you can write it all on the image space and you donít need to calibrate the cameras or you donít need well calibrated cameras. Iíll have to say so this is Ė why did I ever get into this? Because I was sitting in this stupid lab at Yale and I started to do this tracking and just for the heck of it I built this robot controller to do visions. So you see Iím tracking and Iím controlling here. And like the usual cynical young faculty member, I never expected this thing to work the first time. I hadnít calibrated the cameras, I just guessed what the calibration was. I just threw the code together and turned this thing on and it worked. I mean, it worked within a half a millimeter.
It wasnít like it just worked. It was right. And then I started thinking about it and I realized, of course, it worked. I didnít need to calibrate the cameras and so then we actually spent the next few years figuring out why it was that I could get this kind of accuracy out of a system where I literally put the cameras down on the table, looked at it, and said I think theyíre about a foot apart, and ran the system. And this is the moral equivalent of that. So itís out there doing some stuff and Iím doing the moral equivalent of pulling your eye out of your head and moving it over here and saying, okay, see if you can still do whatever you were doing.
And just to prove you can do useful things with it we had to actually do something with a floppy, so there you go. You can also see how long ago this was by the form factor of the Macintosh then putting the floppy disc into. Anyway, all right. So Iím about out of time, but I hope what Iíve convinced you of is that at least a lot of these basic capabilities weíve got. Weíve got stereo, real time, gross geometry, we can recognize objects, we can recognize places, we can build maps out of it, we can track things, and we even know how to close loops in a way that are robust, so we donít have to worry about having finely tuned vision systems to make it work.
So why arenít we running around with robots playing baseball with each other? Well, Iíve given you kind of the simple version of the world. Obviously if I give you complex geometry objects you havenít seen before itís not clear we really know how to pick up a random object, but weíre hopefully getting close to. A lot of the world is doing deformable. Itís not rigid. Lots of configuration space. How do I talk about tracking it or manipulating it? And a lot of things are somewhere in between. Rigid objects on a tray, which, yes, I could turn it like this, but it really doesnít accomplish the purpose in mind.
So understanding those physical relationships. In the real world thereís a lot of complexity to the environment. Itís not my cell phone sitting on an uncluttered desk. Well, my desk would be Ė Iíd be happy if it were that uncluttered. And Iím telling you to go and find something on it and manipulate it. So complexity is still a huge issue and itís not just complexity in terms of whatís out there. Itís complexity in whatís going on. People walking back and forth and up and down. Things changing, things moving. So imagine trying to build a map when people were moving through the corridors all the time.
In fact, again, I know this is something of interest to your human computer interaction. I can track people, so now, in principle, I can reach out and touch people. Whatís the safe way to do it? When do I do it? How do I do it? What am I trying to accomplish by doing so? How do you actually take these techniques, but add a layer, which is really social interaction, and say social interaction to the top of it. And I donít know if youíve noticed, but I think these are not just Ė thereís a research aspect to it, but thereís also a market aspect to it. At what point does it become interesting to do it? Whatís the first killer ap for actually picking things up and moving it around?
Itís cool to do it, but can you actually make money at it? And then the last thing, and I Ė at the beginning I said this. The real question is when youíre gonna be able to build a system where you donít preprogram everything. Itís one thing to program it to pick up my cell phone. Itís another to program it to pick up stuff and then at some point have it learn about cell phones and say go figure out how to pick up this cell phone and do it safely and by the way donít scratch the front because itís made out of glass.
So, again, thereís a lot of work going on, but I think this is really the place where I have to stop and say I have no idea how weíre gonna solve those problems. I know how to solve the problems Iíve talked about so far, but I think this is where really things are really open-ended. And there are a lot of cross cutting challenges of just building complex systems and putting them together so they work. So Iíll just close then by saying the interesting thing is all of this that Iíve talked about is getting more and more real. This chart, Iíll just tell you, is dating myself, but I built my first visual tracking system in my last year of grad school because I wanted to get out and I needed to get something done and it ran on something called a MicroVAX 2 and it ran at, I think, ten hertz on a machine that cost $20,000, so it cost me about $2,000 a cycle to get visual tracking to work.
So I still have that algorithm today, in fact. I just kept running it as I got new machines so those numbers are literally dollars per hertz, dollars per cycle of vision that I could get out of the system. So it went from $2,000 to when I finally got tired of doing it about seven years ago when it was down to 20 cents a hertz. So literally for pocket change I could have visual tracking up and running. So all of the vectors are pointed in the right way in terms of technology. Knowledge I think weíve learned a lot in the last decade. I mean, itís cool to live now and see all of this stuff thatís actually happening.
I think the real challenge is putting it together so if you look at an interesting set of objects and an interesting set of tasks like be my workshop assistant, which is something I proposed about seven years ago, that you could actually build something that would literally go out and say ah ha, I recognize that screwdriver and he said he wanted the big screwdriver, so Iíll pick that up and Iíll put it on the screw or Iíll hand it to him or whatever and oh, Iíve never seen this thing before, but I can always figure out enough to pick it up and hand that over and say, what is this? And when he says itís a plier Iíll know what it is. So I think the pieces are there is the interesting message, but nobody has put it together yet. So maybe one of you will be one of the people to do so.
So I think Iím out of time and I think Iíve covered everything I said I would cover, so if there are any questions Iíll take questions including after class. Thank you very much.
Instructor (Oussama Khatib):Thank you so much.
[End of Audio]
Duration: 76 minutes