AI Seminar Series: Nouamane Tazi

Apr 22, 2025
AI seminar Banner
Tazi

 

Nouamane Tazi

Affiliation: Hugging Face

22nd April 2025, 3:00PM - 4:00PM (GST)

Title:The Ultra-Scale Talk: Scaling Training to Thousands of GPUs
Abstract:Training large language models (LLMs) demands more than just raw compute—it requires infrastructure, strategy, and a deep understanding of parallelism. What begins as a single-GPU prototype must eventually scale across thousands of devices, each step introducing new complexity. This talk dives into the practicalities of ultra-scale training. We'll explore how 5D parallelism—spanning data, tensor, pipeline, context, and expert dimensions—makes it possible to stretch a single training run across massive GPU clusters. Along the way, we’ll cover performance tuning, communication patterns, and architecture choices that impact throughput and hardware efficiency. Scaling isn’t just about size—it’s about doing more with what you have. From case studies and benchmarks to design trade-offs and tooling insights, this webinar offers a comprehensive look at what it really takes to train state-of-the-art models at scale. This session is designed for engineers, researchers, and practitioners who want to move beyond “it fits on one GPU” toward infrastructure that trains trillion-parameter models—efficiently, and at speed https://huggingface.co/spaces/nanotron/ultrascale-playbook

REGISTER