How Self Driving Cars Uses Computer Vision To See?

In today’s world, the demand for autonomous robots or vehicles is rising at an exponential rate and the application of Simultaneous Localisation And Mapping (SLAM) is getting wider attention. Firstly, autonomous vehicles have a bundle of sensors like cameras, Lidar, Radar, etc.

These sensors analyze the environment around the vehicle before the vehicle takes any crucial decision regarding its next state of motion. From Lidar, and camera data a localization map is created. It can be a 2D or a 3D map. The purpose of the map is to identify the static objects around the autonomous vehicle like buildings, trees, etc. All dynamic objects are removed by removing all Lidar points that are found within the bounding box of detected dynamic objects. Learn more about the applications of AI

static objects that don’t interfere with the vehicle are also removed like driveable surface or tree branches. With the grid established, we can predict a collision-free path for the vehicle. One of the significant elements of SLAM is the 3DMapping of the environment which facilitates autonomous robots to understand the environment like a human for which many Depth cameras or RGB-D cameras prove valuable.

For autonomous vehicles to efficiently navigate, they require a frame of reference and observe the surrounding environment using computer vision algorithms to outline a map of its surroundings and traverse the track. 3D reconstruction includes the use of computer vision to observe the outside surroundings using a depth-based 3D point cloud.

Therefore, the basic principle is a junction point between 3D reconstruction and autonomous navigation. The increase in interest for 3D solutions requests for a complete solution that can perceive the surroundings around and build a 3D projection of the corresponding surrounding. 

The practice of computer vision algorithms for bringing about automation in robotics or producing 3D designs has been pretty common. The simultaneous localization and mapping conundrum has continued for a lengthy time and plenty of research is being carried out to find efficient methodologies to take on the problem of mapping. 

Current research in this domain employs expensive cameras for producing disparity and depth maps that although, are more accurate, but still expensive. Different methods involve utilizing stereo-vision cameras to determine the depth of the surrounding objects which is further used to produce 3D point clouds.

Types of Environment Representation Maps

  • Localization Maps: It is created using a set of LIDAR points or camera image features as the car moves. This map along with GPU, IMU, and odometry is used by the localization module to estimate the precise position of the autonomous vehicle. as new LIDAR and camera data are received it is compared with the localization map and measurement of autonomous vehicle’s position is created by aligning the new data with the existing map.
  • Occupancy Grid Map: this map uses a continuous set of LIDAR points to build a map environment that indicates the location of all static objects it is used to plan a safe collision-free path for the autonomous vehicle.

It is important to note that the presence of dynamic objects in the point cloud, hinders the accurate reconstruction of the point cloud. These dynamic objects prevent the actual remodeling of the surrounding. For the same purpose, it is important to formulate a solution that tackles this problem.

The chief intention is to identify these dynamic objects using deep learning. Once these objects are identified, the points enclosing that bounding box can be discarded. In this way, the reconstructed model will completely be of static objects. 

The RGB-D camera can measure the depth using an IR sensor. The output so obtained, is image data(the RGB values) and the depth data (range of the object from the camera). Since the depth has to be accurate, any mismatch can cause a fatal accident. For this reason, the cameras are calibrated in a way that they yield an accurate measurement of the surrounding. Depth maps are usually used to validate the accuracy of the calculated depth values.

The depth map is a grayscale output of the surroundings in which the objects that are closer to the camera possess brighter pixels and those farther away hold darker pixels. The image data that is obtained from the camera is passed on to the object detection module that identifies the dynamic objects present in the frame. 

So, How do We Identify These Dynamic Objects you May Ask?

Here, a deep learning neural network is trained to identify the dynamic objects. The model so trained runs over each frame received from the camera. If there is an identified dynamic object, those frames are skipped. But, there is a problem with this solution. Skipping the entire frame doesn’t make sense. The problem is – information retention.

To tackle this, only the bounding box pixels are eliminated whereas the surrounding pixels are retained. However, in applications related to self-driving vehicles and autonomous delivery drones, the solution is taken to another level. Remember, I had mentioned we get a 3D map of the surrounding using LIDAR sensors.

After that, the deep learning model(3D CNN) is used to eliminate objects in a 3D frame(x,y,z axes). These neural network models have outputs of 2 forms. One is the prediction output which is a probability or likelihood of the identified object. And second is the bounding box coordinates. Remember, all this is happening in real-time. So it is extremely important that there exists a good infrastructure to support this kind of processing. 

Apart from this, computer vision also plays an important role in identifying street signs. There are models that run in conjunction to detect these street signs of various types – speed limit, caution, speed breaker, etc. Again, a trained deep learning model is used to identify these vital signs so that the vehicle can act accordingly. 

For Lane Line Detection, Computer Vision is Applied in a Similar Way

The task is to produce the coefficients of the equation of a lane line. The equation of lane lines can be represented using first, second, or third-order coefficients. A simple first-order equation is simply a linear equation of the type mx+n (a straight line).  High dimensional equations to be of greater power or order that represents curves.

Datasets are not always consistent and suggest lane line coefficients. Furthermore, we may additionally want to identify the nature of the line (solid, dashed, etc). There are numerous characteristics we may want to detect and it is nearly impossible for a single neural network to generalize the results. A common method for resolving this dilemma is by employing a segmentation approach.

In segmentation, the purpose is to assign a class to each pixel of an image. In this method, every lane resembles a class and the neural network model aims to produce an image with lanes consisting of different colors(each lane will have its unique color).

Also Read: AI Project Ideas & Topics


Here we discussed the general applications of computer vision in the domain of autonomous vehicles. Hope you enjoyed this article. 

If you’re interested to learn more about machine learning & AI, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Lead the AI Driven Technological Revolution


Leave a comment

Your email address will not be published.

Accelerate Your Career with upGrad

Our Popular Machine Learning Course