Loubna Ben Allal
21st August, 2024, 3:00pm - 4:00 PM (GST)
Title: | The age of synthetic data and small LLMs |
Affiliation: | Hugging Face |
Abstract: | Recent advancements in Large Language Models (LLMs) have primarily stemmed from improved pre- training datasets, with minimal changes to the transformer architecture. And as models have become more powerful, synthetic data emerged as a new approach for generating training data, primarily for fine- tuning but also for pre-training. In pre-training, synthetic data can serve two key purposes: building effective classifiers for web content filtering, as demonstrated in our FineWeb-Edu project, and generating pre-training samples, as illustrated in our work with Cosmopedia dataset. In this talk, we will go over the process of building these datasets and how they led to the development of SmolLM models —a series of compact yet powerful LLMs. |
Bio: | Loubna Ben Allal is a Machine Learning Engineer in the Science team at Hugging Face, where she leads efforts on synthetic data for pre-training and small LLMs. Previously, she worked on large language models for code and was a core member of the BigCode team behind The Stack datasets and StarCoder models for code generation. Loubna holds master's degrees in Mathematics and Deep Learning from Ecole des Mines de Nancy and ENS Paris-Saclay. |