publications | Maximilian Beck

2025

FlashRNN: I/O-Aware Optimization of Traditional RNNs on modern hardware

Korbinian Pöppel , Maximilian Beck, and Sepp Hochreiter

In The International Conference on Learning Representations (ICLR) , 2025

Abs arXiv Bib PDF Code

While Transformers and other sequence-parallelizable neural network architectures seem like the current state of the art in sequence modeling, they specifically lack state-tracking capabilities. These are important for time-series tasks and logical reasoning. Traditional RNNs like LSTMs and GRUs, as well as modern variants like sLSTM do have these capabilities at the cost of strictly sequential processing. While this is often seen as a strong limitation, we show how fast these networks can get with our hardware-optimization FlashRNN in Triton and CUDA, optimizing kernels to the register level on modern GPUs. We extend traditional RNNs with a parallelization variant that processes multiple RNNs of smaller hidden state in parallel, similar to the head-wise processing in Transformers. To enable flexibility on different GPU variants, we introduce a new optimization framework for hardware-internal cache sizes, memory and compute handling. It models the hardware in a setting using polyhedral-like constraints, including the notion of divisibility. This speeds up the solution process in our ConstrINT library for general integer constraint satisfaction problems (integer CSPs). We show that our kernels can achieve 50x speed-ups over a vanilla PyTorch implementation and allow 40x larger hidden sizes compared to our Triton implementation. We will open-source our kernels and the optimization library to boost research in the direction of state-tracking enabled RNNs and sequence modeling.
@inproceedings{poeppel2025flashrnn, title = {Flash{RNN}: I/O-Aware Optimization of Traditional {RNN}s on modern hardware}, author = {P{\"o}ppel, Korbinian and Beck, Maximilian and Hochreiter, Sepp}, booktitle = {The International Conference on Learning Representations (ICLR)}, url = {https://openreview.net/forum?id=l0ZzTvPfTw}, year = {2025}, }
ICLR25
Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin , Maximilian Beck, Korbinian Pöppel , and 2 more authors

In The International Conference on Learning Representations (ICLR) , 2025

Abs arXiv Bib PDF Code

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.
@inproceedings{alkin2024visionlstm, title = {Vision-LSTM: xLSTM as Generic Vision Backbone}, author = {Alkin, Benedikt and Beck, Maximilian and P{\"o}ppel, Korbinian and Hochreiter, Sepp and Brandstetter, Johannes}, booktitle = {The International Conference on Learning Representations (ICLR)}, url = {https://nx-ai.github.io/vision-lstm/}, year = {2025}, }

2024

xLSTM: Extended Long Short-Term Memory

Maximilian Beck*, Korbinian Pöppel* , Markus Spanring , and 6 more authors

In Advances in Neural Information Processing Systems (NeurIPS) , 2024

Abs arXiv Bib PDF Code Poster

In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
@inproceedings{beck2024xlstm, title = {xLSTM: Extended Long Short-Term Memory}, author = {Beck*, Maximilian and P{\"o}ppel*, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, G{\"u}nter and Brandstetter, Johannes and Hochreiter, Sepp}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, url = {https://arxiv.org/abs/2405.04517}, year = {2024}, }
arXiv
A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks

Thomas Schmied , Thomas Adler , Vihang Patil , and 6 more authors

In arXiv , 2024

Abs arXiv Bib PDF Code

In recent years, there has been a trend in the field of Reinforcement Learning (RL) towards large action models trained offline on large-scale datasets via sequence modeling. Existing models are primarily based on the Transformer architecture, which result in powerful agents. However, due to slow inference times, Transformer-based approaches are impractical for real-time applications, such as robotics. Recently, modern recurrent architectures, such as xLSTM and Mamba, have been proposed that exhibit parallelization benefits during training similar to the Transformer architecture while offering fast inference. In this work, we study the aptitude of these modern recurrent architectures for large action models. Consequently, we propose a Large Recurrent Action Model (LRAM) with an xLSTM at its core that comes with linear-time inference complexity and natural sequence length extrapolation abilities. Experiments on 432 tasks from 6 domains show that LRAM compares favorably to Transformers in terms of performance and speed.
@inproceedings{schmied2024lram, title = {A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks}, author = {Schmied, Thomas and Adler, Thomas and Patil, Vihang and Beck, Maximilian and Pöppel, Korbinian and Brandstetter, Johannes and Klambauer, Günter and Pascanu, Razvan and Hochreiter, Sepp}, booktitle = {arXiv}, url = {https://arxiv.org/abs/2410.22391}, year = {2024}, }

2023

ICLR23
Addressing Parameter Choice Issues in Unsupervised Domain Adaptation by Aggregation

Marius-Constantin Dinu , Markus Holzleitner , Maximilian Beck, and 7 more authors

In The International Conference on Learning Representations (ICLR) , 2023

Abs arXiv Bib PDF Code Poster

We study the problem of choosing algorithm hyper-parameters in unsupervised domain adaptation, i.e., with labeled data in a source domain and unlabeled data in a target domain, drawn from a different input distribution. We follow the strategy to compute several models using different hyper-parameters, and, to subsequently compute a linear aggregation of the models. While several heuristics exist that follow this strategy, methods are still missing that rely on thorough theories for bounding the target error. In this turn, we propose a method that extends weighted least squares to vector-valued functions, e.g., deep neural networks. We show that the target error of the proposed algorithm is asymptotically not worse than twice the error of the unknown optimal aggregation. We also perform a large scale empirical comparative study on several datasets, including text, images, electroencephalogram, body sensor signals and signals from mobile phones. Our method outperforms deep embedded validation (DEV) and importance weighted validation (IWV) on all datasets, setting a new state-of-the-art performance for solving parameter choice issues in unsupervised domain adaptation with theoretical error guarantees. We further study several competitive heuristics, all outperforming IWV and DEV on at least five datasets. However, our method outperforms each heuristic on at least five of seven datasets.
@inproceedings{dinu2023aggregationmethod, title = {Addressing Parameter Choice Issues in Unsupervised Domain Adaptation by Aggregation}, author = {Dinu, Marius-Constantin and Holzleitner, Markus and Beck, Maximilian and Nguyen, Hoan Duc and Huber, Andrea and Eghbal-zadeh, Hamid and Moser, Bernhard A. and Pereverzyev, Sergei and Hochreiter, Sepp and Zellinger, Werner}, booktitle = {The International Conference on Learning Representations (ICLR)}, year = {2023}, url = {https://openreview.net/forum?id=M95oDwJXayG}, }

2022

Few-Shot Learning by Dimensionality Reduction in Gradient Space

Martin Gauch , Maximilian Beck, Thomas Adler , and 10 more authors

In The Conference on Lifelong Learning Agents , 2022

Abs arXiv Bib PDF Blog Code Poster

We introduce SubGD, a novel few-shot learning method which is based on the recent finding that stochastic gradient descent updates tend to live in a low-dimensional parameter subspace. In experimental and theoretical analyses, we show that models confined to a suitable predefined subspace generalize well for few-shot learning. A suitable subspace fulfills three criteria across the given tasks: it (a) allows to reduce the training error by gradient flow, (b) leads to models that generalize well, and (c) can be identified by stochastic gradient descent. SubGD identifies these subspaces from an eigendecomposition of the auto-correlation matrix of update directions across different tasks. Demonstrably, we can identify low-dimensional suitable subspaces for few-shot learning of dynamical systems, which have varying properties described by one or few parameters of the analytical system description. Such systems are ubiquitous among real-world applications in science and engineering. We experimentally corroborate the advantages of SubGD on three distinct dynamical systems problem settings, significantly outperforming popular few-shot learning methods both in terms of sample efficiency and performance.
@inproceedings{gauch2022subgd, title = {Few-Shot Learning by Dimensionality Reduction in Gradient Space}, author = {Gauch, Martin and Beck, Maximilian and Adler, Thomas and Kotsur, Dmytro and Fiel, Stefan and Eghbal-zadeh, Hamid and Brandstetter, Johannes and Kofler, Johannes and Holzleitner, Markus and Zellinger, Werner and Klotz, Daniel and Hochreiter, Sepp and Lehner, Sebastian}, booktitle = {The Conference on Lifelong Learning Agents}, year = {2022}, editor = {Chandar, Sarath and Pascanu, Razvan and Precup, Doina}, url = {https://proceedings.mlr.press/v199/gauch22a.html}, }