github
/
taming-transformers
mirror of https://github.com/CompVis/taming-transformers


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197
							<!DOCTYPE HTML>
<!--
  Based on
	Spatial by TEMPLATED
	templated.co @templatedco
	Released for free under the Creative Commons Attribution 3.0 license (templated.co/license)
-->
<html>
	<head>
    <!-- Global site tag (gtag.js) - Google Analytics -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=UA-117339330-4"></script>
    <script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js', new Date());

      gtag('config', 'UA-117339330-4');
    </script>

    <title>
      Taming Transformers for High-Resolution Image Synthesis
    </title>
		<meta charset="utf-8" />
		<meta name="viewport" content="width=device-width, initial-scale=1" />
		<link rel="stylesheet" href="assets/css/main.css" />
	</head>
	<body class="landing">

		<!-- Banner -->
			<section id="banner" style="background-attachment:scroll;">
        <h2>
          Taming Transformers for High-Resolution Image Synthesis
        </h2>
        <p>
        <a href="https://github.com/pesser">Patrick Esser</a>&ast;, 
        <a href="https://github.com/rromb">Robin Rombach</a>&ast;,
        <a href="https://hci.iwr.uni-heidelberg.de/Staff/bommer">Bj&ouml;rn Ommer</a><br/>
        <a href="https://www.iwr.uni-heidelberg.de/">IWR, Heidelberg University</a>
        </p>
			</section>

			<!-- One -->
				<section id="one" class="wrapper style1">
					<div class="container 75%">
                    <div class="image fit captioned align-left"
                                style="margin-bottom:2em; box-shadow:0 0;
                                text-align:justify">
                      <img src="paper/teaser.png" alt="" style="border:0px solid black"/>
                      <strong>TL;DR:</strong>
                      We combine the efficiancy of convolutional approaches with
                      the expressivity of transformers by introducing a
                      convolutional <em>VQGAN</em>, which learns a codebook of
                      context-rich visual parts, whose composition is modeled
                      with an autoregressive transformer.
                    </div>
						<div class="row 200%">
							<div class="6u 12u$(medium) vert-center" style="margin:1% 0">
                  <div class="container 25%">


                    <div class="image fit captioned align-center"
                                style="margin-bottom:0em; box-shadow:0 0">
                      <a href="paper/paper.pdf">
                        <img src="paper/paper.jpg" alt="" style="border:1px solid black"/>
                      </a>
                      <a href="https://arxiv.org/abs/2012.09841">arXiv</a>
                      <div class="headerDivider"></div>
                      <a href="paper/paper.bib">BibTeX</a>
                      <div class="headerDivider"></div>
                      <a href="https://github.com/CompVis/taming-transformers">GitHub</a>
                      <br/>
                      &ast; equal contribution
                    </div>

                  </div>
							</div>
							<div class="6u$ 12u$(medium)">
                <h1>Abstract</h1>
                <p style="text-align: justify">
  Designed to learn long-range interactions on sequential data, transformers
  continue to show state-of-the-art results on a wide variety of tasks.  In
  contrast to CNNs, they contain no inductive bias that prioritizes local
  interactions. This makes them expressive, but also computationally infeasible
  for long sequences, such as high-resolution images.  We demonstrate how
  combining the effectiveness of the inductive bias of CNNs with the
  expressivity of transformers enables
  them to model and thereby synthesize high-resolution images.
  We show how to (i) use CNNs to learn a context-rich vocabulary of
  image constituents, and in turn (ii) utilize transformers to efficiently
  model their composition within high-resolution images.
  Our approach is readily applied to conditional synthesis tasks, where both
  non-spatial information, such as object classes, and spatial information,
  such as segmentations, can
  control the generated image.
  In particular, we present the first results on semantically-guided synthesis
  of megapixel images with transformers.
                </p>
							</div>
						</div>
            <!--
          <p style="text-align:center">Related work <br/><a
             href="https://compvis.github.io/iin/">"A Disentangling
             Invertible Interpretation Network for Explaining Latent
           Representations"</a></p>
					</div>
            -->
				</section>

			<!-- Two -->
				<section id="two" class="wrapper style2 special">
					<div class="container">
						<header class="major">
							<h2>Results</h2>
							<p>and applications of our model.</p>
						</header>

            __TEMPLATE_STRING__

				  </div>
				</section>

<!-- related works !-->

				<section id="one" class="wrapper style1">
					<div class="container 75%">
						<div class="row 200%">
<div class="12u">
  <h4>Related Work on Modular Compositions of Deep Learning Models</h4>
</div>

<div class="12u">
  <h6>
    <a href="https://compvis.github.io/net2net/">
      Network-to-Network Translation with Conditional Invertible Neural Networks
    </a>
  </h6>
</div>
<div class="3u 12u$(medium)">
  <div class="image fit align-center">
    <a href="https://compvis.github.io/net2net/">
      <img src="https://compvis.github.io/net2net/paper/teaser.png" style="max-width:25em; margin:auto" />
    </a>
  </div>
</div>
<div class="9u 12u$(medium)">
  <p align="justify" style="line-height: 1.0em; font-size:0.8em">
  Given the ever-increasing computational costs of modern machine learning models, we need to find new ways to reuse such expert models and thus tap into the resources that have been invested in their creation. Recent work suggests that the power of these massive models is captured by the representations they learn. Therefore, we seek a model that can relate between different existing representations and propose to solve this task with a conditionally invertible network. This network demonstrates its capability by (i) providing generic transfer between diverse domains, (ii) enabling controlled content synthesis by allowing modification in other domains, and (iii) facilitating diagnosis of existing representations by translating them into interpretable domains such as images. Our domain transfer network can translate between fixed representations without having to learn or finetune them. This allows users to utilize various existing domain-specific expert models from the literature that had been trained with extensive computational resources. Experiments on diverse conditional image synthesis tasks, competitive image modification results and experiments on image-to-image and text-to-image generation demonstrate the generic applicability of our approach. For example, we translate between BERT and BigGAN, state-of-the-art text and image models to provide text-to-image generation, which neither of both experts can perform on their own.
  </p>
</div>

<div class="12u">
  <h6>
    <a href="https://compvis.github.io/invariances/">
      Making Sense of CNNs: Interpreting Deep Representations & Their Invariances with INNs
    </a>
  </h6>
</div>
<div class="3u 12u$(medium)">
  <div class="image fit align-center">
    <a href="https://compvis.github.io/invariances/">
      <img src="https://compvis.github.io/invariances/images/overview.jpg" style="max-width:25em; margin:auto" />
    </a>
  </div>
</div>
<div class="9u 12u$(medium)">
  <p align="justify" style="line-height: 1.0em; font-size:0.8em">
  To tackle increasingly complex tasks, it has become an essential ability of neural networks to learn abstract representations. These task-specific representations and, particularly, the invariances they capture turn neural networks into black box models that lack interpretability. To open such a black box, it is, therefore, crucial to uncover the different semantic concepts a model has learned as well as those that it has learned to be invariant to. We present an approach based on INNs that (i) recovers the task-specific, learned invariances by disentangling the remaining factor of variation in the data and that (ii) invertibly transforms these recovered invariances combined with the model representation into an equally expressive one with accessible semantic concepts. As a consequence, neural network representations become understandable by providing the means to (i) expose their semantic meaning, (ii) semantically modify a representation, and (iii) visualize individual learned semantic concepts and invariances. Our invertible approach significantly extends the abilities to understand black box models by enabling post-hoc interpretations of state-of-the-art networks without compromising their performance.
  </p>
</div>


<!-- /related works !-->
						</div>
				</section>


			<!-- Four -->
				<section id="four" class="wrapper style3 special"
          style="background-attachment:scroll;background-position:center bottom;">
					<div class="container">
						<header class="major">
							<h2>Acknowledgement</h2>
              <p>
              This page is based on a design by <a href="http://templated.co">TEMPLATED</a>.
              </p>
						</header>
					</div>
				</section>

		<!-- Scripts -->
			<script src="assets/js/jquery.min.js"></script>
			<script src="assets/js/skel.min.js"></script>
			<script src="assets/js/util.js"></script>
			<script src="assets/js/main.js"></script>

	</body>
</html>