AI Seminar Series - Loubna Ben Allal

Aug 21, 2024
AI Seminar Banner
Loubna Ben Allal

Loubna Ben Allal

21st August, 2024, 3:00pm - 4:00 PM (GST)

 

Title:The age of synthetic data and small LLMs
Affiliation:Hugging Face
Abstract:Recent advancements in Large Language Models (LLMs) have primarily stemmed from improved pre- training datasets, with minimal changes to the transformer architecture. And as models have become more powerful, synthetic data emerged as a new approach for generating training data, primarily for fine- tuning but also for pre-training. In pre-training, synthetic data can serve two key purposes: building effective classifiers for web content filtering, as demonstrated in our FineWeb-Edu project, and generating pre-training samples, as illustrated in our work with Cosmopedia dataset. In this talk, we will go over the process of building these datasets and how they led to the development of SmolLM models —a series of compact yet powerful LLMs.
Bio:Loubna Ben Allal is a Machine Learning Engineer in the Science team at Hugging Face, where she leads efforts on synthetic data for pre-training and small LLMs. Previously, she worked on large language models for code and was a core member of the BigCode team behind The Stack datasets and StarCoder models for code generation. Loubna holds master's degrees in Mathematics and Deep Learning from Ecole des Mines de Nancy and ENS Paris-Saclay.