Master projects/internships - Leuven | More than two weeks ago
Aligning future system design with the ever-increasing complexity and scale of large language models
The rapid growth of AI models, particularly large language models (LLMs), has significantly increased computational demands, necessitating the scaling of compute systems and power efficiency to meet these requirements. One of the potential solutions is specialized system architecture or technology based on the workload characteristics. Thus, the architecture of AI models is now driving the design of compute infrastructure. Given the vast scale of transformer-based LLMs and the substantial volume of training data, data centers with tens of thousands of GPUs are required to facilitate the distributed training of these models. Achieving an optimal setup involves extensive preliminary exploration of network architecture, hyperparameters, and distributed parallelism strategies, making this process highly time- and energy-intensive. Whereas, an analytical modeling framework enables quick analysis and evaluation of how workloads or algorithms interact with a computing system, creating opportunities for HW-SW co-design. Furthermore, a versatile analytical performance modeling framework can guide the development of next-generation systems and technologies tailored for LLMs.
In the computer system architecture (CSA) department, we have developed a performance modelling framework for distributed LLM training and inference, named Optimus [1]. This framework is designed for performance prediction and design exploration of LLMs across various compute platforms, such as GPUs and TPUs. Optimus supports cross-layer analysis, spanning from algorithms to hardware technology, and facilitates automated design space exploration and co-optimization across multiple layers of the computational stack. It emulates the task graph of LLM implementations and integrates key features, including various mapping and parallelism strategies (data, tensor, pipeline, sequence), activation recomputation techniques, collective communication patterns, and KV caching.
Built upon the Optimus framework, this internship project will focus on analytical performance modelling for one or more state-of-the-art optimization algorithms, including,
Research efforts might involve algorithmic analysis, understanding of the Pytorch implementations of the above features, and detailed profiling to enable performance prediction.
The requirements for the ideal candidate:
[1] J. Kundu et al., Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference, IISWC, 2024.
[2] T. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS, 2022.
[3] S. Rajbhandari et al., ZeRO: Memory Optimizations Toward Training Trillion
Parameter Models, Supercomputing, 2020.
[4] D. Lepikhin et al., GShard: Scaling Giant Models with Conditional
Computation and Automatic Sharding, ICLR, 2021
[5] Y. Zhao et al., PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel, arXiv:2304.11277, 2023.
[6] Rui-Jie Zhu et al., “Scalable MatMul-free Language Modeling”, https://arxiv.org/pdf/2406.02528
Type of Project: Internship; Thesis
Master's degree: Master of Science; Master of Engineering Science
Duration: 6 - 9 months
For more information or application, please contact Wenzhe Guo (wenzhe.guo@imec.be) and Joyjit Kundu (joyjit.kundu@imec.be).
Imec allowance will be provided.