-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
309 lines (277 loc) · 15.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="soon">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>DiffPano: Scalable and Consistent Text to Panorama
Generation with Spherical Epipolar-Aware Diffusion</title>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-PYVRSFMDRL"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'G-PYVRSFMDRL');
</script>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="stylesheet" href="./static/css/comparison.css">
<!-- <link rel="icon" href="./static/images/favicon.svg"> -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<!-- authors -->
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">DiffPano: Scalable and Consistent Text to Panorama
Generation with Spherical Epipolar-Aware Diffusion</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
Anonymized Authors
</div>
</div>
</div>
</div>
</div>
</section>
<!-- teaser.png -->
<section class="hero teaser">
<div class="container is-max-desktop">
<div class="hero-body">
<img id="teaser" height="100%" src="./static/images/teaser.png" alt="ND-SDF teaser."/>
<p>
<b>DiffPano allows scalable and consistent panorama generation (i.e. room switching) with given unseen text descriptions and camera poses.</b>
Each column represents the generated multi-view panoramas, switching from one room to another.
</p>
</div>
</div>
</section>
<!-- 添加水平线 -->
<hr>
<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Diffusion-based methods have achieved remarkable achievements in 2D image or 3D object generation.
However, the generation of 3D scenes and even $360^{\circ}$ images remains constrained, due to the limited number of scene datasets,
the complexity of 3D scenes themselves, and the difficulty of generating consistent multi-view images.
To address these issues, we first establish a large-scale panoramic video-text dataset
containing millions of consecutive panoramic keyframes with corresponding panoramic depths,
camera poses, and text descriptions. Then, we propose a novel text-driven panoramic generation framework, termed DiffPano,
to achieve scalable, consistent, and diverse panoramic scene generation. Specifically,
benefiting from the powerful generative capabilities of stable diffusion,
we fine-tune a single-view text-to-panorama diffusion model with LoRA on the established panoramic video-text dataset.
We further design a spherical epipolar-aware multi-view diffusion model to ensure the multi-view consistency of the generated panoramic images.
Extensive experiments demonstrate that DiffPano can generate scalable, consistent, and diverse panoramic images with given unseen text descriptions and camera poses.
</p>
</div>
</div>
</div>
<!--/ Abstract. -->
<hr>
<!-- dataset. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Panoramic Video-Text Dataset Pipeline</h2>
<div class="content has-text-justified">
<img src="./static/images/dataset_pipeline.jpg"
class="framework-image"
alt="DiffPano dataset."/>
<p>
<b>Panoramic Video Construction and Caption Pipeline.</b>
We utilize the Habitat Simulator to randomly select positions within the scenes of the Habitat Matterport 3D(HM3D) dataset
and render cubic six-face maps. These maps are then interpolated and stitched together to form panoramas.
We can obtain panoramas with clear tops and bottoms.
To generate more precise text descriptions for the panoramas, we first use Blip2 to generate corresponding text descriptions for each of the obtained cube maps,
and then employ LLM to summarize and obtain accurate and complete text descriptions.
Furthermore, the Habitat Simulator allows us to render images based on camera trajectories within the HM3D scenes,
enabling the creation of a dataset that simultaneously includes camera poses, panoramas, panoramic depths and their corresponding text descriptions.
</p>
</div>
</div>
</div>
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Panoramic Video-Text Dataset</h2>
<div class="video-row-container">
<video controls src="./static/videos/dataset/00819-6D36GQHuP8H_0.mp4" type="video/mp4" autoplay muted loop playsinline></video>
<video controls src="./static/videos/dataset/00899-58NLZxWBSpk_0.mp4" type="video/mp4" autoplay muted loop playsinline></video>
</div>
</div>
</div>
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Comparison of Dataset and Text description</h2>
<img src="./static/images/dataset_comparisons.jpg"
class="dataset_comparisons"
alt="DiffPano dataset_comparisons."/>
<p>
<b>Comparisons between PanFusion and Ours.
</b> PanFusion uses BLIP2 to directly generate text descriptions for panoramas, which only has <b>four or five words and is very concise</b>.
The CLIP Score (CS) cannot reflect the accuracy of the text description, and the PanFusion dataset has the problem of <b>blurry top and bottom</b>.
In contrast, our panoramic video dataset construction pipeline first generates text descriptions for perspective images using BLIP2,
then uses LLM to summarize, which can obtain <b>more detailed text descriptions</b>.
At the same time, <b>the top and bottom of our panoramas are clear, and the dataset is larger (millions of panoramic keyframes).</b>
We also provide the camera pose of each panorama, the corresponding panoramic depth map, etc.
</p>
</div>
</div>
<hr>
<!-- Method. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Method</h2>
<div class="content has-text-justified">
<img src="./static/images/framework.png"
class="framework-image"
alt="ND-SDF overview."/>
<p>
<b>DiffPano Framework.</b>
The DiffPano framework consists of a single-view panoramic diffusion model and a multi-view diffusion model based on panoramic epipolar aware attention.
It can support text-to-panorama or multi-view panoramas generation.
</p>
</div>
</div>
</div>
<hr>
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Text to Single-View Panorama</h2>
<h2 class="title is-4">Comparisons</h2>
<div class="content has-text-justified">
<img src="./static/images/text2pano.jpg"
class="framework-image"
alt="DiffPano comparison."/>
<p>
<b>Text to Panorama Comparison between Ours vs PanFusion vs TextLight.</b>
Compared with PanFusion, our method can generate panoramas with clear top and bottom,
while the top and bottom of panfusion's generation are blurred.
Compared with Text2Light, our method has better left-right consistency.
</p>
</div>
<h2 class="title is-4">Diversity</h2>
<div class="content has-text-justified">
<div class="video-row-container">
<video controls src="./static/videos/text2pano/Diversity/diversity0.mp4" type="video/mp4" autoplay muted loop playsinline></video>
<video controls src="./static/videos/text2pano/Diversity/diversity1.mp4" type="video/mp4" autoplay muted loop playsinline></video>
</div>
<div class="video-row-container">
<video controls src="./static/videos/text2pano/Diversity/diversity2.mp4" type="video/mp4" autoplay muted loop playsinline></video>
<video controls src="./static/videos/text2pano/Diversity/diversity7.mp4" type="video/mp4" autoplay muted loop playsinline></video>
</div>
<p>
<b>Diversity of Text to Panorama Generation with Our Method.</b>
Given the same text prompt like "A cozy living room with wooden floors and a couch",
our method can generate diverse and consistent panoramas.
</p>
</div>
<h2 class="title is-4">Generalizability</h2>
<div class="content has-text-justified">
<div class="video-row-container">
<video controls src="./static/videos/text2pano/Generalizability/generalizability1.mp4" type="video/mp4" autoplay muted loop playsinline></video>
<video controls src="./static/videos/text2pano/Generalizability/generalizability4.mp4" type="video/mp4" autoplay muted loop playsinline></video>
</div>
<div class="video-row-container">
<video controls src="./static/videos/text2pano/Generalizability/generalizability6.mp4" type="video/mp4" autoplay muted loop playsinline></video>
<video controls src="./static/videos/text2pano/Generalizability/generalizability8.mp4" type="video/mp4" autoplay muted loop playsinline></video>
</div>
<p>
<b>Generalizability of Text to Panorama Generation with Our Method.</b> Despite our method is only trained on indoor scene datasets,
it can still generate outdoor panoramas scenes conditoned on text, which shows that our method has a certain degree of generalization.
In the future, we can explore outdoor scene reconstruction to increase the diversity of training data and further improve the generalization of our method.
</p>
</div>
<h2 class="title is-4">Single-View vs Multi-View Panorama Generation</h2>
<div class="content has-text-justified">
<div class="video-row-container">
<video controls src="./static/videos/single-multi/single.mp4" type="video/mp4" autoplay muted loop playsinline></video>
<video controls src="./static/videos/single-multi/single_multi.mp4" type="video/mp4" autoplay muted loop playsinline></video>
</div>
<p>
<b>Comparison of Single-View and Multi-View Panorama Generation.</b>
A single-view panorama can only represent local areas, missing many parts of the scene.
A multi-view panorama is controllably generated with camera poses, which can generate the missing parts of a single view.
</p>
</div>
</div>
</div>
<hr>
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Text to Multi-View Panorama</h2>
<h2 class="title is-4">Comparisons</h2>
<div class="content has-text-justified">
<img src="./static/images/text2mvpano_comp.jpg"
class="framework-image"
alt="DiffPano mv comparison."/>
<p>
<b>Text to Multi-View Panorama Comparison between Ours vs Modified MVDream.</b>
Experiments show that DiffPano is easier to converge than MVDream and can generate more consistent multi-view panoramas.
"MVDream×2" denotes MVDream is trained with twice iteration number relative to DiffPano.
</p>
</div>
<h2 class="title is-4">Qualitative Results</h2>
<div class="content has-text-justified">
<img src="./static/images/text2mvpano1.jpg"
class="framework-image"
alt="DiffPano mv comparison."/>
<img src="./static/images/text2mvpano2.jpg"
class="framework-image"
alt="DiffPano mv comparison."/>
<img src="./static/images/text2mvpano3.jpg"
class="framework-image"
alt="DiffPano mv comparison."/>
<p>
<b>Text to Multi-View Panorama of Our Method.</b>
DiffPano allows scalable and consistent panorama generation (i.e. room switching) with given unseen text descriptions and camera poses.
Each column represents the generated multi-view panoramas, switching from one room to another.
</p>
</div>
</div>
</div>
</div>
</section>
<hr>
<!-- Comparisons -->
<section class="section">
<div class="container is-centered has-text-centered is-max-desktop">
<h2 class="title is-3" style="text-align: center;">Text to Panoramic Video Generation</h2>
<h2 class="title is-4">Scalability</h2>
<div class="video-row-container">
<video controls src="./static/videos/pano_video/video1.mp4" type="video/mp4" autoplay muted loop playsinline></video>
<video controls src="./static/videos/pano_video/video2.mp4" type="video/mp4" autoplay muted loop playsinline></video>
</div>
<div class="video-row-container">
<video controls src="./static/videos/pano_video/video3.mp4" type="video/mp4" autoplay muted loop playsinline></video>
<video controls src="./static/videos/pano_video/video4.mp4" type="video/mp4" autoplay muted loop playsinline></video>
</div>
<p>
<b>Text to Panoramic Video Generation of Our Method.</b>
Our method can generate longer panoramic videos with image-based panorama generation which demonstrate scalability.
To generate panoramic videos, we can first generate multi-view panoramas of different rooms with large pose changes conditioned on the text,
and then run the image-to-multi-view panorama generation conditioned on the generated multi-view panoramas,
so that it can be expanded to generate longer panoramic videos.
</p>
</div>
</section>
</body>
</html>