Maximilian Beck

ELLIS PhD Student at Johannes Kepler University Linz, Institute for Machine Learning

I am a fourth year PhD at the Institute for Machine Learning at the Johannes Kepler University (JKU) Linz advised by Mr LSTM Sepp Hochreiter. I work on efficient, RNN-inspired architectures for Large Language Models with sub-quadratic complexity.

I have obtained my bachelors and masters degree in Mechatronics and Information Technology with focus on Control Theory from Karlsruhe Institute of Technology (KIT) in 2017 and 2021. From 2018 to 2019 I spent two amazing semesters abroad studying Computer Engineering at San José State University (SJSU) in the heart of Silicon Valley.

During my bachelors I have worked at the Institute for Production Science (wbk) at KIT, focusing on Automation Technology. After my time in San José I joined the autonomous driving division at FZI Research Center of Information Technology. There, I contributed the visibility computation package to their driving simulator written in C++. For my master thesis project I have developed a Monte-Carlo Tree search motion planning algorithm that explicitly considers a vehicles’ uncertainty about its environment.

In 2021, I got accepted as an ELLIS PhD at JKU Linz. During my first 1.5 years I focused on Few-Shot Learning, Meta-Learning and Domain Adaptation. I got also very excited about studying Loss Landscapes of Deep Learning and its properties such as Mode Connectivity for example.

With the rise of ChatGPT in 2022, I pivoted towards Large Language Models (LLMs). While the impressive performance of LLMs is the main driver of todays hype about generative AI, they have a major drawback: Their quadratic scaling in compute costs with growing input length. The reason for this is that all todays LLMs are based on the Transformer architecture with the quadratic Attention mechanism at its core.

Before the introduction of Transformers, LSTMs, that scale only linear in input length during inference, were state-of-the-art in Natural Language Processing. In our current project we extend the LSTM with the most recent tricks of the trade of modern LLMs, aiming to challenge the conventional dominance of Transformer models. This work is funded by the newly founded company NXAI.

news

Jan 22, 2025	Two papers accepted at ICLR 2025! Our FlashRNN and Vision-LSTM paper will be presented in Singapore.
Sep 25, 2024	🚨 The xLSTM paper has been accepted as a spotlight at NeurIPS 2024! 🎉 I am looking forward to present and discuss the xLSTM in Vancouver.
Jul 26, 2024	We presented the xLSTM at 3 Workshops at ICML 2024 including an Oral at ES-FOMO@ICML2024.
Jul 01, 2024	I presented the xLSTM on 1littlecoder's YouTube Channel.
Jun 27, 2024	ELISE wrap up conference 2024 in Helsinki. I presented the xLSTM as ELLIS PhD Spotlight presentation.

selected publications

xLSTM: Extended Long Short-Term Memory

Maximilian Beck*, Korbinian Pöppel* , Markus Spanring , and 6 more authors

In Advances in Neural Information Processing Systems (NeurIPS) , 2024

Abs arXiv Bib PDF Code Poster

In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
@inproceedings{beck2024xlstm, title = {xLSTM: Extended Long Short-Term Memory}, author = {Beck*, Maximilian and P{\"o}ppel*, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, G{\"u}nter and Brandstetter, Johannes and Hochreiter, Sepp}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, url = {https://arxiv.org/abs/2405.04517}, year = {2024}, }
FlashRNN: I/O-Aware Optimization of Traditional RNNs on modern hardware

Korbinian Pöppel , Maximilian Beck, and Sepp Hochreiter

In The International Conference on Learning Representations (ICLR) , 2025

Abs arXiv Bib PDF Code

While Transformers and other sequence-parallelizable neural network architectures seem like the current state of the art in sequence modeling, they specifically lack state-tracking capabilities. These are important for time-series tasks and logical reasoning. Traditional RNNs like LSTMs and GRUs, as well as modern variants like sLSTM do have these capabilities at the cost of strictly sequential processing. While this is often seen as a strong limitation, we show how fast these networks can get with our hardware-optimization FlashRNN in Triton and CUDA, optimizing kernels to the register level on modern GPUs. We extend traditional RNNs with a parallelization variant that processes multiple RNNs of smaller hidden state in parallel, similar to the head-wise processing in Transformers. To enable flexibility on different GPU variants, we introduce a new optimization framework for hardware-internal cache sizes, memory and compute handling. It models the hardware in a setting using polyhedral-like constraints, including the notion of divisibility. This speeds up the solution process in our ConstrINT library for general integer constraint satisfaction problems (integer CSPs). We show that our kernels can achieve 50x speed-ups over a vanilla PyTorch implementation and allow 40x larger hidden sizes compared to our Triton implementation. We will open-source our kernels and the optimization library to boost research in the direction of state-tracking enabled RNNs and sequence modeling.
@inproceedings{poeppel2025flashrnn, title = {Flash{RNN}: I/O-Aware Optimization of Traditional {RNN}s on modern hardware}, author = {P{\"o}ppel, Korbinian and Beck, Maximilian and Hochreiter, Sepp}, booktitle = {The International Conference on Learning Representations (ICLR)}, url = {https://openreview.net/forum?id=l0ZzTvPfTw}, year = {2025}, }
ICLR25
Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin , Maximilian Beck, Korbinian Pöppel , and 2 more authors

In The International Conference on Learning Representations (ICLR) , 2025

Abs arXiv Bib PDF Code

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.
@inproceedings{alkin2024visionlstm, title = {Vision-LSTM: xLSTM as Generic Vision Backbone}, author = {Alkin, Benedikt and Beck, Maximilian and P{\"o}ppel, Korbinian and Hochreiter, Sepp and Brandstetter, Johannes}, booktitle = {The International Conference on Learning Representations (ICLR)}, url = {https://nx-ai.github.io/vision-lstm/}, year = {2025}, }
ICLR23
Addressing Parameter Choice Issues in Unsupervised Domain Adaptation by Aggregation

Marius-Constantin Dinu , Markus Holzleitner , Maximilian Beck, and 7 more authors

In The International Conference on Learning Representations (ICLR) , 2023

Abs arXiv Bib PDF Code Poster

We study the problem of choosing algorithm hyper-parameters in unsupervised domain adaptation, i.e., with labeled data in a source domain and unlabeled data in a target domain, drawn from a different input distribution. We follow the strategy to compute several models using different hyper-parameters, and, to subsequently compute a linear aggregation of the models. While several heuristics exist that follow this strategy, methods are still missing that rely on thorough theories for bounding the target error. In this turn, we propose a method that extends weighted least squares to vector-valued functions, e.g., deep neural networks. We show that the target error of the proposed algorithm is asymptotically not worse than twice the error of the unknown optimal aggregation. We also perform a large scale empirical comparative study on several datasets, including text, images, electroencephalogram, body sensor signals and signals from mobile phones. Our method outperforms deep embedded validation (DEV) and importance weighted validation (IWV) on all datasets, setting a new state-of-the-art performance for solving parameter choice issues in unsupervised domain adaptation with theoretical error guarantees. We further study several competitive heuristics, all outperforming IWV and DEV on at least five datasets. However, our method outperforms each heuristic on at least five of seven datasets.
@inproceedings{dinu2023aggregationmethod, title = {Addressing Parameter Choice Issues in Unsupervised Domain Adaptation by Aggregation}, author = {Dinu, Marius-Constantin and Holzleitner, Markus and Beck, Maximilian and Nguyen, Hoan Duc and Huber, Andrea and Eghbal-zadeh, Hamid and Moser, Bernhard A. and Pereverzyev, Sergei and Hochreiter, Sepp and Zellinger, Werner}, booktitle = {The International Conference on Learning Representations (ICLR)}, year = {2023}, url = {https://openreview.net/forum?id=M95oDwJXayG}, }
Few-Shot Learning by Dimensionality Reduction in Gradient Space

Martin Gauch , Maximilian Beck, Thomas Adler , and 10 more authors

In The Conference on Lifelong Learning Agents , 2022

Abs arXiv Bib PDF Blog Code Poster

We introduce SubGD, a novel few-shot learning method which is based on the recent finding that stochastic gradient descent updates tend to live in a low-dimensional parameter subspace. In experimental and theoretical analyses, we show that models confined to a suitable predefined subspace generalize well for few-shot learning. A suitable subspace fulfills three criteria across the given tasks: it (a) allows to reduce the training error by gradient flow, (b) leads to models that generalize well, and (c) can be identified by stochastic gradient descent. SubGD identifies these subspaces from an eigendecomposition of the auto-correlation matrix of update directions across different tasks. Demonstrably, we can identify low-dimensional suitable subspaces for few-shot learning of dynamical systems, which have varying properties described by one or few parameters of the analytical system description. Such systems are ubiquitous among real-world applications in science and engineering. We experimentally corroborate the advantages of SubGD on three distinct dynamical systems problem settings, significantly outperforming popular few-shot learning methods both in terms of sample efficiency and performance.
@inproceedings{gauch2022subgd, title = {Few-Shot Learning by Dimensionality Reduction in Gradient Space}, author = {Gauch, Martin and Beck, Maximilian and Adler, Thomas and Kotsur, Dmytro and Fiel, Stefan and Eghbal-zadeh, Hamid and Brandstetter, Johannes and Kofler, Johannes and Holzleitner, Markus and Zellinger, Werner and Klotz, Daniel and Hochreiter, Sepp and Lehner, Sebastian}, booktitle = {The Conference on Lifelong Learning Agents}, year = {2022}, editor = {Chandar, Sarath and Pascanu, Razvan and Precup, Doina}, url = {https://proceedings.mlr.press/v199/gauch22a.html}, }