A team led by PhD student Sukjun Hwang and professors Brandon Wang and Albert Gu at Carnegie Mellon University has developed a groundbreaking AI architecture called H-Net that could transform how language models process text and other sequential data.
Traditional language models rely on tokenization—a pre-processing step that breaks text into smaller units according to rigid rules. This approach creates fundamental limitations, particularly for languages without clear word boundaries and specialized domains like genomics. H-Net eliminates this constraint by implementing a dynamic chunking mechanism that automatically learns the most effective way to segment text during training.
The researchers' paper, published on arXiv on July 10 and updated on July 15, 2025, demonstrates that H-Net achieves nearly 4x improvement in data efficiency when processing DNA sequences compared to conventional approaches. The system also shows superior performance across multiple languages, with particularly strong results for Chinese and programming code.
What makes H-Net revolutionary is its ability to learn content and context-dependent segmentation strategies without explicit supervision. The model operates at the byte level and incorporates a hierarchical network structure that can be iterated to multiple stages, allowing it to model different levels of abstraction. This approach enables H-Net to match the performance of token-based Transformers twice its size.
Beyond language processing, H-Net opens possibilities for processing continuous-valued sequences like audio and video, potentially enabling better multimodal AI systems. The researchers have made their code publicly available on GitHub, allowing other researchers and developers to build upon their work.
"Overcoming tokenization is not about tokenizers, but about learning abstractions," wrote Albert Gu in a blog post explaining the project. "Discovering a tool that can do this will unlock new capabilities." As AI systems continue to evolve, H-Net represents a significant step toward more flexible, efficient, and capable models that can better understand the complexities of human language and other sequential data.