Deep learning-based sensor fusion for situational awareness is a notably different approach from classic mathematical modeling. While the underlying core tasks of perception, prediction, and planning remain the same, deep learning tackles situational awareness in an integrated manner where all tasks are jointly considered and meaningful representations are directly learned from training data. Compared to classical mathematical modeling, deep learning can build a richer representation of the environment and achieve superior levels of performance, however, its success depends on the quality of the data pipeline as well as on the MLOps and deployment solutions that orchestrate the system and facilitate systematic model training and evaluation.
With deep learning, one way to fuse data is to build multiple hierarchical levels of deep learning models that are roughly speaking responsible for perception, prediction, and planning tasks. These tasks are crucial in any autonomous operation. Typically, the lower levels of the neural networks are responsible e.g. for detecting objects of interest, segmenting the environment into drivable terrain and other classes, and producing depth estimates from individual cameras. In mathematical terms, neural networks compress the raw information from sensors into so-called feature maps which are used to produce interpretable and actionable outputs from the model, for instance, the location of surrounding vehicles. These feature maps can be reused in the next layers of models to fuse information across sensors. An example of this would be to build a 360-degree birds-eye-view map of the surroundings by stitching together observations from all sensors.
These outputs in turn form the inputs for subsequent models that aggregate information across time. In this case, the task could be to predict how other moving agents will behave in the future to support action planning. To put it concretely, it would mean a car would perceive pedestrians, and then predict that they are about to cross the road. After this, the car would plan for the most suitable action, e.g. slow down or give way.
Other ways how sensor data can be fused in deep learning-based awareness systems includes feeding data from multiple sensors directly to a neural network and building explicit connections between neural networks dedicated to different tasks in the modeling stack. To give an example of the former, multiple camera views of a scene can be used for depth estimation or a combination of Lidar and images can be used to improve 3D object detection or semantic segmentation. Mid-level sensor fusion means feature sharing between models that are responsible for different but related tasks. A case like this would be object detection modules that are responsible for adjacent cameras but have a partially overlapping field-of-view, in other words, the area the cameras cover in the real world. In its most sophisticated form, late fusion can be performed where the outputs from the sensor- or task-specific models are combined as inputs to later stages of modeling. This would be needed when building e.g. the birds-eye-view map as mentioned above or by utilizing a multi-camera derived point cloud for 3D object detection.
Interested in discussing deep learning-based sensor fusion with our experts?
Get in touch with Pertti Hannelin, our VP of Business Development at firstname.lastname@example.org or via LinkedIn.