MagicDrive: Street View Generation with Diverse 3D Geometry Control

Abstract

Recent advancements in diffusion models have shown remarkable promise in data synthesis, enhancing a wide range of 2D perception tasks. However, achieving precise control in street view generation for 3D perception tasks still remains a formidable challenge. Specifically, when adopting Bird's-Eye View (BEV) as the sole condition for street view generation, height control becomes the major hurdle, which is indispensable for accurately representing object dimensions, occlusion patterns, and road surface elevations, particularly for 3D object detection. In this paper, we introduce MagicDrive, a novel street view generation framework that incorporates diverse 3D geometry control, including the camera poses, road maps, 3D bounding boxes, and textual descriptions, by employing customized encoding strategies. A cross-view attention module is further introduced to ensure the multi-camera view consistency. The versatility of MagicDrive empowers high-quality street-view data synthesis that accurately reflects diverse 3D geometry control, benefiting 3D perception tasks such as BEV segmentation and 3D object detection.

Diverse Generation

Latent interpolation with the same geometric conditions

Given geometric conditions, MagicDrive can generate unlimited number of street-views with diversity. Here we randomly sample 10 initial noise and perform Spherical Linear Interpolation (Slerp) between them, resulting in 100 noises for generation (like figure 6 in the DDIM paper).

Image sequences for continuous scenes

Using MagicDrive, one can generate diverse street-view images even to similar scene annotations. We show generation according to continuous annotation sequences.

scene-0012: see road changes with the ego car's direction

scene-0105: see object distance change

scene-0103: see various semantics

Diverse 3D Geometry Control

Object fine-grained control

MagicDrive controls the position of objects precisely, while keeping other objects unchanged. Drag the slider to see how the vehicle (in the bounding box) moves from left to right.

Loading...

Multi-level controls

MagicDrive considers controls from road BEV map, object bounding box, camera pose, and textual description.

Downstream Support

Generation from MagicDrive can be used as data augmentation, supporting both BEV segmentation and 3D object detection tasks.

Extension

Video Generation

By finetuning MagicDrive with Tune-a-Video on nuScenes training set, MagicDrive can generates termporal-consistent frames, while keeping diversity. We generate 7 frames once, where only the first and the last frames have bounding boxes for control.

[UPPER] Generated video 1 from MagicDrive. [LOWER] Ground truth for video 1.

[UPPER] Generated video 2 from MagicDrive. [LOWER] Ground truth for video 2.

[UPPER] Generated video 3 from MagicDrive. [LOWER] Ground truth for video 3.

BibTeX

@inproceedings{gao2023magicdrive,
  title={MagicDrive: Street View Generation with Diverse 3D Geometry Control},
  author={Gao, Ruiyuan and Chen, Kai and Xie, Enze and Hong, Lanqing and Li, Zhenguo and Yeung, Dit-Yan and Xu, Qiang},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}