Debiasing Scores and Prompts of 2D Diffusion for View-consistent Text-to-3D Generation <i>(NeurIPS 2023)</i>

TL;DR

We propose Debiased Score Distillation Sampling (D-SDS), an efficient technique to address the Janus problem. Our techniques involve Score Debiasing, which clip scores of diffusion models with a linearly increasing threshold, and Prompt Debiasing, which removes conflicting words with view prefixes (e.g., 'smiling' with 'back view'). By introducing our techniques to existing text-to-3D generation framework like DreamFusion, SJC, Magic3D, etc., you can eliminate artifacts such as multiple faces, horns, and ears from the generated 3D objects, resulting in more view-consistent objects.

SDS

Debiased-SDS (Ours)

"a small kitten", "a majestic giraffe with a long neck", "a cute and chubby panda munching on bamboo" (SJC)

"a colorful toucan with a large beak" (IF-DreamFusion, ThreeStudio Implementation)

"a playful and cuddly kitten with big eyes" (Magic3D, ThreeStudio Implementation)

"a flamingo standing on one leg in shallow water" (ProlificDreamer, ThreeStudio Implementation)

Abstract

Existing score-distilling text-to-3D generation techniques, despite their consider able promise, often encounter the view inconsistency problem. One of the most notable issues is the Janus problem, where the most canonical view of an object (e.g., face or head) appears in other views. In this work, we explore existing frameworks for score-distilling text-to-3D generation and identify the main causes of the view inconsistency problem—the embedded bias of 2D diffusion models. Based on these findings, we propose two approaches to debias the score-distillation frameworks for view-consistent text-to-3D generation. Our first approach, called score debiasing, involves cutting off the score estimated by 2D diffusion models and gradually increasing the truncation value throughout the optimization process. Our second approach, called prompt debiasing, identifies conflicting words be tween user prompts and view prompts using a language model, and adjusts the discrepancy between view prompts and the viewing direction of an object. Our experimental results show that our methods improve the realism of the generated 3D objects by significantly reducing artifacts and achieve a good trade-off between faithfulness to the 2D diffusion models and 3D consistency with little overhead.

Score Debiasing

Visualization of the magnitude of the estimated score during the optimization.

When the perturbed-and-denoised image produced by diffusion models differs greatly from the rendered image in certain pixels, we often observe a high magnitude in the 2D score. This can result in the generation of unwanted elements like extra legs, beaks, horns, or faces. To tackle this issue, we apply a clipping technique to the 2D scores, removing these unnecessary artifacts. We start with a low threshold and progressively increase it, ensuring that we preserve the fine details of the shapes while getting rid of unwanted elements.

Prompt Debiasing

Samples from Stable Diffusion given a text prompt with contradiction.

Despite "Back view of" given in the prompts, the word "smiling" in the prompt makes diffusion models biased towards the front view of objects. Thus, we remove the conflicting words in the prompts to make the prompts consistent with the viewing direction of an object. In specific, we calculate pointwise mutual information (PMI) to identify the conflicting words between the user prompts and the view prompts, utilizing a large language model.

Debiased-SDS Framework

Overall illustration of our framework.

Using the above two debiasing techniques, we propose a simple and efficient debiased score-distilling text-to-3D generation framework. First, we perform prompt debiasing to make the prompts consistent with the viewing direction of an object. Then, we perform score debiasing to remove the artifacts in the generated 3D objects. Note that our framework is easily applicable to any score-distilling text-to-3D generation framework, such as DreamFusion, SDS, Magic3D, etc. We provide an implementation of our techniques applied to the DreamFusion and SJC, which can be found on our official repository for D-SDS. For further details, please refer to our paper.

More Results

SDS

Debiased-SDS (Ours)

"an elegant teacup with delicate floral designs" (ProlificDreamer, ThreeStudio Implementation)

"a colorful toucan with a large beak" (Magic3D, ThreeStudio Implementation)

"a baby bunny, sitting on top of a stack of pancakes" (IF-DreamFusion, ThreeStudio Implementation)

"a playful and cuddly kitten with big eyes" (IF-DreamFusion, ThreeStudio Implementation)

"a kangaroo wearing boxing gloves" (IF-DreamFusion, ThreeStudio Implementation)

BibTeX

@article{hong2023debiasing,
      title={Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation},
      author={Hong, Susung and Ahn, Donghoon and Kim, Seungryong},
      journal={arXiv preprint arXiv:2303.15413},
      year={2023}
    }
}