/Performance modeling and analysis of large language models on distributed systems

Performance modeling and analysis of large language models on distributed systems

Master projects/internships - Leuven | More than two weeks ago

Aligning future system design with the ever-increasing complexity and scale of large language models

Motivation

The rapid growth of AI models, particularly large language models (LLMs), has significantly increased computational demands, necessitating the scaling of compute systems and power efficiency to meet these requirements. One of the potential solutions is specialized system architecture or technology based on the workload characteristics. Thus, the architecture of AI models is now driving the design of compute infrastructure. Given the vast scale of transformer-based LLMs and the substantial volume of training data, data centers with tens of thousands of GPUs are required to facilitate the distributed training of these models. Achieving an optimal setup involves extensive preliminary exploration of network architecture, hyperparameters, and distributed parallelism strategies, making this process highly time- and energy-intensive. Whereas, an analytical modeling framework enables quick analysis and evaluation of how workloads or algorithms interact with a computing system, creating opportunities for HW-SW co-design. Furthermore, a versatile analytical performance modeling framework can guide the development of next-generation systems and technologies tailored for LLMs.

Project description

In the computer system architecture (CSA) department, we have developed a performance modelling framework for distributed LLM training and inference, named Optimus [1]. This framework is designed for performance prediction and design exploration of LLMs across various compute platforms, such as GPUs and TPUs. Optimus supports cross-layer analysis, spanning from algorithms to hardware technology, and facilitates automated design space exploration and co-optimization across multiple layers of the computational stack. It emulates the task graph of LLM implementations and integrates key features, including various mapping and parallelism strategies (data, tensor, pipeline, sequence), activation recomputation techniques, collective communication patterns, and KV caching.

Built upon the Optimus framework, this internship project will focus on analytical performance modelling for one or more state-of-the-art optimization algorithms, including,

Flash attention algorithms [2] to reduce memory access frequency
Mixed of experts (MoE) [3] for efficient model scaling
Zero parallelism [4] for memory optimization
Fully sharded data parallel [5] for model distribution across data parallel workers
MatMul-free LLMs [6]

Research efforts might involve algorithmic analysis, understanding of the Pytorch implementations of the above features, and detailed profiling to enable performance prediction.

The requirements for the ideal candidate:

Proficiency in Python
Experience with Pytorch framework.
Knowledge of hardware (e.g., GPUs) microarchitectures
Strong understanding of LLM architectures and implementations.

Reference

[1] J. Kundu et al., Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference, IISWC, 2024.
[2] T. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS, 2022.
[3] S. Rajbhandari et al., ZeRO: Memory Optimizations Toward Training Trillion
Parameter Models, Supercomputing, 2020.
[4] D. Lepikhin et al., GShard: Scaling Giant Models with Conditional
Computation and Automatic Sharding, ICLR, 2021
[5] Y. Zhao et al., PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel, arXiv:2304.11277, 2023.
[6] Rui-Jie Zhu et al., “Scalable MatMul-free Language Modeling”, https://arxiv.org/pdf/2406.02528

Type of Project: Internship; Thesis

Master's degree: Master of Science; Master of Engineering Science

Duration: 6 - 9 months

For more information or application, please contact Wenzhe Guo (wenzhe.guo@imec.be) and Joyjit Kundu (joyjit.kundu@imec.be).

Imec allowance will be provided.

Apply

Who we are

Accept marketing-cookies to view this content.

imec's cleanroom

Accept marketing-cookies to view this content.

Related jobs

Analyzing Current Distribution in Power Delivery Networks under Different Workloads for Electromigration Reliability

Exploring Power Delivery Networks Using Advanced Circuit Design Tools

Open-Source Physical Design Flow Development for Advanced CMOS Nodes

Contribute to pioneering research in semiconductor design in the open-source domain

Development, modelling and analysis of off-chip PDN components for next-generation heterogenous high-performance computing systems

Excited about powering-up heterogenous computing systems of the future?

Share this article on

Performance modeling and analysis of large language models on distributed systems

Motivation

Project description

Reference

Who we are

imec's cleanroom

Related jobs

Analyzing Current Distribution in Power Delivery Networks under Different Workloads for Electromigration Reliability

Open-Source Physical Design Flow Development for Advanced CMOS Nodes

Development, modelling and analysis of off-chip PDN components for next-generation heterogenous high-performance computing systems

Principal Memory Subsystem Architect

Hardware Security Researcher

Team Leader – Semiconductor Technologies Design Platforms

Send this job to your email

Expertise

What we offer

Applications

Jobs

About imec

More imec