Lyft: Self-Driving Research in Review: ICCV 2019

By the Level 5 Research Team: Peter Ondruska, Luca Del Pero, Qi Dong, Anastasia Dubrovina and Guido Zuidhof

ICCV, along with CVPR, is one of the main international computer vision conferences. It happens bi-annually, with this year’s edition taking place in Seoul, South Korea. Out of the 1076 papers presented, we summarized the most relevant ones in the self-driving space.

Fast Point R-CNN
Paper from Tencent & University of Hong Kong — This paper presents a new state-of the-art on KITTY for fast 3D detection using lidar. It merges voxel and point-cloud representation in a two-stage approach to draw on strengths of both: voxel representation for initial proposal, and point cloud representation for refinement.

Range Adaptation for 3D Object Detection in LiDAR
Paper from Volvo & Duke University — Performance of existing 3D detection networks degrade with distance as the lidar input becomes more sparse. This paper proposes a way to combat this and adapt the performance of neural networks based on distance, leading to improved performance for distant objects.

M3D-RPN: Monocular 3D Region Proposal Network for Object Detection [code]
Paper from Michigan State University — This paper explores a new method for single-shot monocular 3D detection of objects that significantly improves the state-of-the-art on KITTY.

Vehicle Detection With Automotive Radar Using Deep Learning on Range-Azimuth-Doppler Tensors
Paper from Qualcomm — This team explored an approach for detecting vehicles in the radar output. Unlike other methods, the authors propose processing raw sensor measurements leading to improved performance.

Joint Monocular 3D Vehicle Detection and Tracking [code]
Paper from UC Berkeley — Detection and tracking tasks are usually modeled independently to each other in a modular AV stack. The work proposes an end-to-end neural network that learns to do both using only a single camera.

How Do Neural Networks See Depth in Single Images?
Paper from Technische Universiteit Delft — Depth from single image methods have improved significantly over recent years. The authors investigate how these methods work, concluding that they overfit to the vertical position of the object in the picture. This can then lead to errors if the camera is mounted at different positions on the car.

Robust Multi-Modality Multi-Object Tracking [code]
Paper from Nanyang Technological University & SenseTime — Multi-modal perception (detecting and tracking objects in lidar and camera) is usually done through modular, multi-stage approaches, each having their own assumptions. The proposed method allows merging these individual steps into one end-to-end trainable system, leading to improved performance on KITTY.

Exploring the Limitations of Behavior Cloning for Autonomous Driving [code]
Paper from Toyota Research Institute — Trajectory planning of autonomous vehicles (AVs) is still mostly a hand-crafted effort. One of the ways to use machine learning to solve this is by imitation of existing drivers. Authors investigate limitations of this approach and present a new dataset to evaluate the performance of learned policies.

Towards Learning Multi-Agent Negotiations via Self-Play
Paper from Apple — Reinforcement learning is potentially a powerful way for robots to learn how to act without human hand-coding, but can be difficult to apply in real-world robotics. In this work, the authors use simulation to evaluate a reinforcement learning-based system that uses self-play to learn how to merge.

PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction [code]
Paper from York University — This paper presents a new dataset for predicting intention and future trajectory of pedestrians from camera data. Authors also provide baseline models for these tasks.

DAGMapper: Learning to Map by Discovering Lane Topology
Paper from Uber ATG — Creating HD semantic maps for AVs is mostly manual effort. This paper presents an automatic way to create semantic maps for highway driving that achieve 89% correct topology.

GSLAM: A General SLAM Framework and Benchmark [code]
Northwestern Polytechnical University, China — SLAM systems have developed significantly over the past decade. This paper proposes a unified framework to compare different components of existing methods and help the development of new ones.

LPD-Net: 3D Point Cloud Learning for Large-Scale Place Recognition and Environment Analysis
The Chinese University of Hong Kong — Lidar-based localisation in a pre-built map is a key component of AV visual stacks. In this paper, the authors propose a global learned descriptor that identifies which part of the map the vehicle is in.

SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences [website] [code]
Paper from University of Bonn, Germany — This paper presents a new dataset that provides semantic annotations of point clouds in all KITTY sequences. This can be used to develop semantic classification of maps and perception systems.

DBUS: Human Driving Behavior Understanding System
Paper from University of Southern California & Didi Chuxing — Crowd-sourcing is one way to collect large quantities of human driving data to train AV systems. Making this data searchable and useful; however, is very non-trivial. To solve this problem, the authors present an end-to-end system that allows processing, indexing and query of large amounts of driving data in order to retrieve relevant scenarios.

Large Scale Multimodal Data Capture, Evaluation and Maintenance Framework for Autonomous Driving Datasets [code]
Paper from Intel — Self-driving is not only a challenge for artificial intelligence, but also a significant engineering and data management challenge. This paper describes a case study on storing and processing large quantities of AV data and involved challenges.

Meta-Sim: Learning to Generate Synthetic Datasets [code]
Paper from University of Toronto & NVIDIA — Synthesizing data for challenging conditions is one of the ways to train or test an AV stack. In this work, the authors propose a generative model that can synthesize scenes based on user text input.

Advanced Pedestrian Dataset Augmentation for Autonomous Driving
Paper from Czech Technical University in Prague — Synthesizing data for challenging conditions is one of the ways to train or test an AV stack. The authors propose the use of a GANs to generate data for a pedestrian detection problem.

[→ Read more papers from ICCV]

If this post caught your interest, we’re hiring! Check out our open roles here, and be sure to follow our blog for more technical content.

Go to Source