index.template 9.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197
  1. <!DOCTYPE HTML>
  2. <!--
  3. Based on
  4. Spatial by TEMPLATED
  5. templated.co @templatedco
  6. Released for free under the Creative Commons Attribution 3.0 license (templated.co/license)
  7. -->
  8. <html>
  9. <head>
  10. <!-- Global site tag (gtag.js) - Google Analytics -->
  11. <script async src="https://www.googletagmanager.com/gtag/js?id=UA-117339330-4"></script>
  12. <script>
  13. window.dataLayer = window.dataLayer || [];
  14. function gtag(){dataLayer.push(arguments);}
  15. gtag('js', new Date());
  16. gtag('config', 'UA-117339330-4');
  17. </script>
  18. <title>
  19. Taming Transformers for High-Resolution Image Synthesis
  20. </title>
  21. <meta charset="utf-8" />
  22. <meta name="viewport" content="width=device-width, initial-scale=1" />
  23. <link rel="stylesheet" href="assets/css/main.css" />
  24. </head>
  25. <body class="landing">
  26. <!-- Banner -->
  27. <section id="banner" style="background-attachment:scroll;">
  28. <h2>
  29. Taming Transformers for High-Resolution Image Synthesis
  30. </h2>
  31. <p>
  32. <a href="https://github.com/pesser">Patrick Esser</a>&ast;,
  33. <a href="https://github.com/rromb">Robin Rombach</a>&ast;,
  34. <a href="https://hci.iwr.uni-heidelberg.de/Staff/bommer">Bj&ouml;rn Ommer</a><br/>
  35. <a href="https://www.iwr.uni-heidelberg.de/">IWR, Heidelberg University</a>
  36. </p>
  37. </section>
  38. <!-- One -->
  39. <section id="one" class="wrapper style1">
  40. <div class="container 75%">
  41. <div class="image fit captioned align-left"
  42. style="margin-bottom:2em; box-shadow:0 0;
  43. text-align:justify">
  44. <img src="paper/teaser.png" alt="" style="border:0px solid black"/>
  45. <strong>TL;DR:</strong>
  46. We combine the efficiancy of convolutional approaches with
  47. the expressivity of transformers by introducing a
  48. convolutional <em>VQGAN</em>, which learns a codebook of
  49. context-rich visual parts, whose composition is modeled
  50. with an autoregressive transformer.
  51. </div>
  52. <div class="row 200%">
  53. <div class="6u 12u$(medium) vert-center" style="margin:1% 0">
  54. <div class="container 25%">
  55. <div class="image fit captioned align-center"
  56. style="margin-bottom:0em; box-shadow:0 0">
  57. <a href="paper/paper.pdf">
  58. <img src="paper/paper.jpg" alt="" style="border:1px solid black"/>
  59. </a>
  60. <a href="https://arxiv.org/abs/2012.09841">arXiv</a>
  61. <div class="headerDivider"></div>
  62. <a href="paper/paper.bib">BibTeX</a>
  63. <div class="headerDivider"></div>
  64. <a href="https://github.com/CompVis/taming-transformers">GitHub</a>
  65. <br/>
  66. &ast; equal contribution
  67. </div>
  68. </div>
  69. </div>
  70. <div class="6u$ 12u$(medium)">
  71. <h1>Abstract</h1>
  72. <p style="text-align: justify">
  73. Designed to learn long-range interactions on sequential data, transformers
  74. continue to show state-of-the-art results on a wide variety of tasks. In
  75. contrast to CNNs, they contain no inductive bias that prioritizes local
  76. interactions. This makes them expressive, but also computationally infeasible
  77. for long sequences, such as high-resolution images. We demonstrate how
  78. combining the effectiveness of the inductive bias of CNNs with the
  79. expressivity of transformers enables
  80. them to model and thereby synthesize high-resolution images.
  81. We show how to (i) use CNNs to learn a context-rich vocabulary of
  82. image constituents, and in turn (ii) utilize transformers to efficiently
  83. model their composition within high-resolution images.
  84. Our approach is readily applied to conditional synthesis tasks, where both
  85. non-spatial information, such as object classes, and spatial information,
  86. such as segmentations, can
  87. control the generated image.
  88. In particular, we present the first results on semantically-guided synthesis
  89. of megapixel images with transformers.
  90. </p>
  91. </div>
  92. </div>
  93. <!--
  94. <p style="text-align:center">Related work <br/><a
  95. href="https://compvis.github.io/iin/">"A Disentangling
  96. Invertible Interpretation Network for Explaining Latent
  97. Representations"</a></p>
  98. </div>
  99. -->
  100. </section>
  101. <!-- Two -->
  102. <section id="two" class="wrapper style2 special">
  103. <div class="container">
  104. <header class="major">
  105. <h2>Results</h2>
  106. <p>and applications of our model.</p>
  107. </header>
  108. __TEMPLATE_STRING__
  109. </div>
  110. </section>
  111. <!-- related works !-->
  112. <section id="one" class="wrapper style1">
  113. <div class="container 75%">
  114. <div class="row 200%">
  115. <div class="12u">
  116. <h4>Related Work on Modular Compositions of Deep Learning Models</h4>
  117. </div>
  118. <div class="12u">
  119. <h6>
  120. <a href="https://compvis.github.io/net2net/">
  121. Network-to-Network Translation with Conditional Invertible Neural Networks
  122. </a>
  123. </h6>
  124. </div>
  125. <div class="3u 12u$(medium)">
  126. <div class="image fit align-center">
  127. <a href="https://compvis.github.io/net2net/">
  128. <img src="https://compvis.github.io/net2net/paper/teaser.png" style="max-width:25em; margin:auto" />
  129. </a>
  130. </div>
  131. </div>
  132. <div class="9u 12u$(medium)">
  133. <p align="justify" style="line-height: 1.0em; font-size:0.8em">
  134. Given the ever-increasing computational costs of modern machine learning models, we need to find new ways to reuse such expert models and thus tap into the resources that have been invested in their creation. Recent work suggests that the power of these massive models is captured by the representations they learn. Therefore, we seek a model that can relate between different existing representations and propose to solve this task with a conditionally invertible network. This network demonstrates its capability by (i) providing generic transfer between diverse domains, (ii) enabling controlled content synthesis by allowing modification in other domains, and (iii) facilitating diagnosis of existing representations by translating them into interpretable domains such as images. Our domain transfer network can translate between fixed representations without having to learn or finetune them. This allows users to utilize various existing domain-specific expert models from the literature that had been trained with extensive computational resources. Experiments on diverse conditional image synthesis tasks, competitive image modification results and experiments on image-to-image and text-to-image generation demonstrate the generic applicability of our approach. For example, we translate between BERT and BigGAN, state-of-the-art text and image models to provide text-to-image generation, which neither of both experts can perform on their own.
  135. </p>
  136. </div>
  137. <div class="12u">
  138. <h6>
  139. <a href="https://compvis.github.io/invariances/">
  140. Making Sense of CNNs: Interpreting Deep Representations & Their Invariances with INNs
  141. </a>
  142. </h6>
  143. </div>
  144. <div class="3u 12u$(medium)">
  145. <div class="image fit align-center">
  146. <a href="https://compvis.github.io/invariances/">
  147. <img src="https://compvis.github.io/invariances/images/overview.jpg" style="max-width:25em; margin:auto" />
  148. </a>
  149. </div>
  150. </div>
  151. <div class="9u 12u$(medium)">
  152. <p align="justify" style="line-height: 1.0em; font-size:0.8em">
  153. To tackle increasingly complex tasks, it has become an essential ability of neural networks to learn abstract representations. These task-specific representations and, particularly, the invariances they capture turn neural networks into black box models that lack interpretability. To open such a black box, it is, therefore, crucial to uncover the different semantic concepts a model has learned as well as those that it has learned to be invariant to. We present an approach based on INNs that (i) recovers the task-specific, learned invariances by disentangling the remaining factor of variation in the data and that (ii) invertibly transforms these recovered invariances combined with the model representation into an equally expressive one with accessible semantic concepts. As a consequence, neural network representations become understandable by providing the means to (i) expose their semantic meaning, (ii) semantically modify a representation, and (iii) visualize individual learned semantic concepts and invariances. Our invertible approach significantly extends the abilities to understand black box models by enabling post-hoc interpretations of state-of-the-art networks without compromising their performance.
  154. </p>
  155. </div>
  156. <!-- /related works !-->
  157. </div>
  158. </section>
  159. <!-- Four -->
  160. <section id="four" class="wrapper style3 special"
  161. style="background-attachment:scroll;background-position:center bottom;">
  162. <div class="container">
  163. <header class="major">
  164. <h2>Acknowledgement</h2>
  165. <p>
  166. This page is based on a design by <a href="http://templated.co">TEMPLATED</a>.
  167. </p>
  168. </header>
  169. </div>
  170. </section>
  171. <!-- Scripts -->
  172. <script src="assets/js/jquery.min.js"></script>
  173. <script src="assets/js/skel.min.js"></script>
  174. <script src="assets/js/util.js"></script>
  175. <script src="assets/js/main.js"></script>
  176. </body>
  177. </html>