Terra Incognita

Machine Learning

Large language model data pipelines and Common Crawl (WARC/WAT/WET)

Erik Desmazieres’s “La Bibliothèque de Babel”. 1997.

We have been training language models (LMs) for years, but finding valuable resources about the data pipelines commonly used to build the datasets for training these models is paradoxically challenging. It may be because we often take it for granted that these datasets exist (or at least existed? As replicating them is becoming increasingly difficult). However, one must consider the numerous decisions involved in creating such pipelines, as it can significantly impact the final model’s quality, as seen recently in the struggle of models aiming to replicate LLaMA (LLaMA: Open and Efficient Foundation Language Models). It might be tempting to think that now, with large models that can scale well, data is becoming more critical than modeling, since model architectures are not radically changing much. However, data has always been critical.

This article provides a short introduction to the pipeline used to create the data to train LLaMA, but it allows for many variations and I will add details about other similar pipelines when relevant, such as RefinedWeb (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only) and The Pile (The Pile: An 800GB Dataset of Diverse Text for Language Modeling). This article is mainly based on the pipeline described in CCNet (CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data) and LLaMA’s paper, both from Meta. CCNet was developed focusing on the data source that is often the largest one, but also the most challenging in terms of quality: Common Crawl.

The big picture

The entire pipeline of CCNet (plus some minor modifications made by LLaMA’s paper) can be seen below. It has the following stages: data source, deduplication, language, filtering, and the “is-reference” filtering which was added in LLaMA. I will go through each one of them in the sections below.

*Visual overview of the CCNet pipeline with some modifications done in LLaMA. Click to enlarge*.

Let’s dive into it !

(more…)

03/06/202304/06/2023 by Christian S. Perone

Machine Learning

Feste: composing NLP tasks with automatic parallelization and batching

I just released Feste, a free and open-source framework with a permissive license that allows scalable composition of NLP tasks using a graph execution model that is optimized and executed by specialized schedulers. The main idea behind Feste is that it builds a graph of execution instead of executing tasks immediately, this graph allows Feste to optimize and parallelize it. One main example of optimization is when we have multiple calls to the same backend (e.g. same API), Feste automatically fuses these calls into a single one and therefore it batches the call to reduce latency and improve backend inference leverage of GPU vectorization. Feste also executes tasks that can be done in parallel in different processes, so the user doesn’t have to care about parallelization, especially when there are multiple frameworks using different concurrency strategies.

Project page: https://feste.readthedocs.io/en/latest/design.html
Github: https://github.com/perone/feste

09/03/202309/03/2023 by Christian S. Perone

Machine Learning

Couple of recent publications in uncertainty estimation and autonomous vehicles

Just sharing some recent publications I’ve been involved recently:

L2M: Practical posterior Laplace approximation with optimization-driven second moment estimation

ArXiv: https://arxiv.org/abs/2107.04695 (ICML 2021 / UDL)

Uncertainty quantification for deep neural networks has recently evolved through many techniques. In this work, we revisit Laplace approximation, a classical approach for posterior approximation that is computationally attractive. However, instead of computing the curvature matrix, we show that, under some regularity conditions, the Laplace approximation can be easily constructed using the gradient second moment. This quantity is already estimated by many exponential moving average variants of Adagrad such as Adam and RMSprop, but is traditionally discarded after training. We show that our method (L2M) does not require changes in models or optimization, can be implemented in a few lines of code to yield reasonable results, and it does not require any extra computational steps besides what is already being computed by optimizers, without introducing any new hyperparameter. We hope our method can open new research directions on using quantities already computed by optimizers for uncertainty estimation in deep neural networks.

CW-ERM: Improving Autonomous Driving Planning with Closed-loop Weighted Empirical Risk Minimization

ArXiv: https://arxiv.org/abs/2210.02174 (NeurIPS 2022 / ML4AD / ICRA 2023 under review)

Project page: https://woven.mobi/cw-erm

The imitation learning of self-driving vehicle policies through behavioral cloning is often carried out in an open-loop fashion, ignoring the effect of actions to future states. Training such policies purely with Empirical Risk Minimization (ERM) can be detrimental to real-world performance, as it biases policy networks towards matching only open-loop behavior, showing poor results when evaluated in closed-loop. In this work, we develop an efficient and simple-to-implement principle called Closed-loop Weighted Empirical Risk Minimization (CW-ERM), in which a closed-loop evaluation procedure is first used to identify training data samples that are important for practical driving performance and then we these samples to help debias the policy network. We evaluate CW-ERM in a challenging urban driving dataset and show that this procedure yields a significant reduction in collisions as well as other non-differentiable closed-loop metrics.

SafePathNet: Safe Real-World Autonomous Driving by Learning to Predict and Plan with a Mixture of Experts

ArXiv: https://arxiv.org/abs/2211.02131 (NeurIPS 2022 / ML4AD / ICRA 2023 under review)

Project page: https://woven.mobi/safepathnet

The goal of autonomous vehicles is to navigate public roads safely and comfortably. To enforce safety, traditional planning approaches rely on handcrafted rules to generate trajectories. Machine learning-based systems, on the other hand, scale with data and are able to learn more complex behaviors. However, they often ignore that agents and self-driving vehicle trajectory distributions can be leveraged to improve safety. In this paper, we propose modeling a distribution over multiple future trajectories for both the self-driving vehicle and other road agents, using a unified neural network architecture for prediction and planning. During inference, we select the planning trajectory that minimizes a cost taking into account safety and the predicted probabilities. Our approach does not depend on any rule-based planners for trajectory generation or optimization, improves with more training data and is simple to implement. We extensively evaluate our method through a realistic simulator and show that the predicted trajectory distribution corresponds to different driving profiles. We also successfully deploy it on a self-driving vehicle on urban public roads, confirming that it drives safely without compromising comfort.

10/11/2022 by Christian S. Perone

Machine Learning, Programming

Tutorial on using LLVM to JIT PyTorch fx graphs to native code (x86/arm/risc-v/wasm) (Part I – Scalars)

In 2009 I started playing with LLVM for some projects (data structure jit, for genetic programming, jit for tensorflow graphs, etc), and in these projects I realized how powerful LLVM design was at the time (and still is): using an elegant IR (intermediate representation) with an user-facing API and modularized front-ends and backends with plenty of transformation and optimization passes. Nowadays, LLVM is the main engine behind many compilers and JIT compilation and where most of the modern developments in compilers is happening.

On a related note, PyTorch is doing an amazing job of exposing more of the torch tracing system and its IR and graphs, not to mention their work on recent fusers and TorchDynamo. In this context, I was doing a small test to re-implement Shine, but using ATen ops for tensors and realized that there were not many educative tutorials on how to use LLVM to JIT PyTorch graphs, so this is a quick series (if time helps there will be more following posts) on how to use LLVM (python bindings) to go from PyTorch graphs (as traced by torch.fx) to LLVM IR and native code.

Detour – PyTorch NNC (Neural Net Compiler)

PyTorch itself also has a compiler that uses LLVM to generate native code for subgraphs that the fuser identifies. This is also called NNC (Neural Net Compiler) or Tensor Expressions (TE) as well, you can read more about them here in the C++ API tutorial. One thing to note though is that default binaries you get from PyTorch do not come linked to the LLVM libraries, so you need to compile it by yourself and enable LLVM backend, otherwise it won’t use LLVM to do the JIT compilation/optimization of the subgraphs. Let’s take a look at it first before starting our tutorial

(more…)

01/09/202201/09/2022 by Christian S. Perone

Uncategorized

Arduino WAN, Helium network and cryptographic co-processor

I was recently interested in the intersection of Machine Learning and RF and I was taking a look into LoRa modulation, which is based on Chirp Spread Spectrum (CSS), and ended up getting to know more about the Helium network. I still think that the most stupid piece of technology behind crypto mining is spending GPU/CPU/ASIC cycles to do proof-of-work (PoW), but in the Helium network, they did something quite interesting, which was to switch to something useful such as the proof-of-coverage instead of generating heat and burning energy. Therefore we can say that the miners are doing something useful by providing radio coverage, instead of purely generating heat.

(more…)

13/01/202213/01/2022 by Christian S. Perone

Uncategorized

Episuite: epidemiology in Python

I’m proud to announce Episuite, an open-source project with a suite of tools and components for epidemiology in Python. It is an initiative trying to fill the gap that we have in the Python ecosystem for epidemiology frameworks.

Documentation: https://perone.github.io/episuite/

Repository: https://github.com/perone/episuite

21/03/2021 by Christian S. Perone

Uncategorized

Talk: Gradient-based optimization for Deep Learning

This weekend I gave a talk at the Machine Learning Porto Alegre Meetup about optimization methods for Deep Learning. In this material you will find an overview of first-order methods, second-order methods and some approximations of second-order methods as well about the natural gradient descent and approximations to it. I took some long nights to prepare this material, so I hope you like it! You can download the PDF of the slides by clicking on the top-right menu.

View Fullscreen

– Christian S. Perone

22/11/2020 by Christian S. Perone

Uncategorized

Visualizing sample simplex trajectories in Deep Learning

Softmax is a distribution over choices, it maps a vector into the probability simplex that is defined as $\Delta_{n-1}=\{p\in\mathbb{R}^n\; \vert\; 1^\top p = 1 \; \; {\rm and} \;\; p \geq 0 \}$ , where the sum of all elements of the vector must equal 1. Softmax is used a lot in classification and I thought it would be interesting to visualize (when possible, on lower dimensions) the trajectories of individual samples in that simplex as predicted by the network while the network is being trained.

In the animations below you’ll see the trajectories of the sample individual sample (from the test set) over the simplex of 3 classes (dog, cat, horse) from CIFAR-10 and using a simple shallow CNN both with Adam and SGD. Each frame is generated after 10 optimization steps and the video is from 4 epochs with CIFAR-10 dataset with only the 3 aforementioned classes.

Trajectory of a CNN using Adam with LR of 0.001

Trajectory of a CNN using SGD with LR of 0.001 and momentum

19/09/2020 by Christian S. Perone

The big picture

L2M: Practical posterior Laplace approximation with optimization-driven second moment estimation

CW-ERM: Improving Autonomous Driving Planning with Closed-loop Weighted Empirical Risk Minimization

SafePathNet: Safe Real-World Autonomous Driving by Learning to Predict and Plan with a Mixture of Experts

Detour – PyTorch NNC (Neural Net Compiler)

Trajectory of a CNN using Adam with LR of 0.001

Trajectory of a CNN using SGD with LR of 0.001 and momentum

Tags