A team of MIT researchers has revealed that neural network components previously thought to serve only as encoders can actually perform sophisticated image generation and manipulation tasks on their own.
The research, presented at the International Conference on Machine Learning (ICML 2025) in Vancouver, demonstrates that one-dimensional (1D) tokenizers—neural networks that compress visual information into sequences of discrete tokens—possess untapped generative capabilities that eliminate the need for traditional image generators.
Led by graduate student Lukas Lao Beyer from MIT's Laboratory for Information and Decision Systems (LIDS), the team discovered that manipulating individual tokens within these compressed representations produces specific, predictable changes in the resulting images. "This was a never-before-seen result, as no one had observed visually identifiable changes from manipulating tokens," Lao Beyer explained.
The researchers found that replacing single tokens could transform image quality from low to high resolution, adjust background blurriness, change brightness levels, or even alter the pose of objects within the image. This discovery opens new possibilities for efficient image editing through direct token manipulation.
More significantly, the MIT team demonstrated a novel approach to image generation that requires only a 1D tokenizer and a decoder (also called a detokenizer), guided by an off-the-shelf neural network called CLIP. This system can convert one image type to another—for example, transforming a red panda into a tiger—or generate entirely new images from random token values that are iteratively optimized.
The approach builds upon a 2024 breakthrough from Technical University of Munich and ByteDance researchers, who developed a method to compress 256×256-pixel images into just 32 tokens, compared to the 256 tokens typically used by previous tokenizers. The MIT innovation demonstrates that these highly compressed representations contain rich semantic information that can be leveraged for creative applications.
The research team includes Tianhong Li from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), Xinlei Chen from Facebook AI Research, MIT Professor Sertac Karaman, and MIT Associate Professor Kaiming He. Their findings suggest a more computationally efficient future for AI image generation, which is projected to become a billion-dollar industry by the end of this decade.