Machine Learning, Programming

VectorVFS: your filesystem as a vector database

PS: thanks for all the interest, here you are some discussions about VectorVFS as well:
Hacker News: discussion thread
Reddit: discussion thread

When I released EuclidesDB in 2018, which was the first modern vector database before milvus, pinecone, etc, I ended up still missing a piece of simple software for retrieval that can be local, easy to use and without requiring a daemon or any other sort of server and external indexes. After quite some time trying to figure it out what would be the best way to store embeddings in the filesystem I ended up in the design of VectorVFS.

timeline

The main goal of VectorVFS is to store the data embeddings (vectors) into the filesystem itself without requiring an external database. We don’t want to change the file contents as well and we also don’t want to create extra loose files in the filesystem. What we want is to store the embeddings on the files themselves without changing its data. How can we accomplish that ?

It turns out that all major Linux file systems (e.g. Ext4, Btrfs, ZFS, and XFS) support a feature that is called extended attributes (also known as xattr). This metadata is stored in the inode itself (ona reserved space at the end of each inode) and not in the data blocks (depending on the size). Some file systems might impose limits on the attributes, Ext4 for example requires them to be within a file system block (e.g. 4kb).

That is exactly what VectorVFS do, it embeds files (right now only images) using an encoder (Perception Encoder for the moment) and then it stores this embedding into the extended attributes of the file in the filesystem so we don’t need to create any other file for the metadata, the embedding will also be automatically linked directly to the file that was embedded and there is also no risks of this embedding being copied by mistake. It seems almost like a perfect solution for storing embeddings and retrieval that were there in many filesystems but it was never explored for that purpose before.

If you are interested, here is the documentation on how to install and use it, contributions are welcome !

 

Machine Learning, Math

The geometry of data: the missing metric tensor and the Stein score [Part II]

Credit: ESA/Webb, NASA & CSA, J. Rigby. / The James Webb Space Telescope captures gravitational lensing, a phenomenon that can be modeled using differential geometry.

Note: This is a continuation of the previous post: Thoughts on Riemannian metrics and its connection with diffusion/score matching [Part I], so if you haven’t read it yet, please consider reading as I won’t be re-introducing in depth the concepts (e.g., the two scores) that I described there already. This article became a bit long, so if you are familiar already with metric tensors and differential geometry you can just skip the first part.

I was planning to write a paper about this topic, but my spare time is not that great so I decided it would be much more fun and educative to write this article in form of a tutorial. If you liked it, please consider citing it:

Cite this article as: Christian S. Perone, "The geometry of data: the missing metric tensor and the Stein score [Part II]," in Terra Incognita, 12/11/2024, https://blog.christianperone.com/2024/11/the-geometry-of-data-part-ii/.

(more…)

Article, Machine Learning, Philosophy

Notes on Gilbert Simondon’s “On the Mode of Existence of Technical Objects” and Artificial Intelligence

Happy new year ! This is the first post of 2025 and this time it is not a technical article (but it is about philosophy of technology 😄)

Gilbert Simondon (1924-1989). Photo by LeMonde.

This is a short opinion article to share some notes on the book by the French philosopher Gilbert Simondon called “On the Mode of Existence of Technical Objects” (Du mode d’existence des objets techniques) from 1958. Despite his significant contributions, Simondon still (and incredibly) remains relatively unknown, and it seems to me that this is partly due to the delayed translation of his works. I realized recently that his philosophy of technology aligns very well with an actionable understanding of AI/ML. His insights illuminated a lot for me on how we should approach modern technology and what cultural and societal changes are needed to view AI as an evolving entity that can be harmonised with human needs. This perspective offers an alternative to the current cultural polarization between technophilia and technophobia, which often leads to alienation and misoneism. I think that this work from 1958 provides more enlightening and actionable insights than many contemporary discussions of AI and machine learning, which often prioritise media attention over public education. Simondon’s book is very dense and it was very difficult to read (I found it more difficult than Heidegger’s work on philosophy of technology), so in my quest to simplify it, I might be guilty of simplism in some cases.

(more…)

Machine Learning, Philosophy

Generalisation, Kant’s schematism and Borges’ Funes el memorioso – Part I

Introduction

Portrait of Immanuel Kant by Johann Gottlieb Becker, 1768.

One of the most interesting, but also obscure and difficult parts of Kant’s critique is schematism. Every time I reflect on generalisation in Machine Learning and how concepts should be grounded, it always leads to the same central problem of schematism. Friedrich H. Jacobi said that schematism was “the most wonderful and most mysterious of all unfathomable mysteries and wonders …” [1], and Schopenhauer also said that it was “famous for its profound darkness, because nobody has yet been able to make sense of it” [1].

It is very rewarding, however, to realize that it is impossible to read Kant without relating much of his revolutionary philosophy to the difficult problems we are facing (and had always been) in AI, especially regarding generalisation. The first edition of the Critique of Pure Reason (CPR) was published more than 240 years ago, therefore historical context is often required to understand Kant’s writing, and to make things worse there is a lot of debate and lack of consensus among Kant’s scholars, however, even with these difficulties, it is still one of the most relevant and worth reading works of philosophy today.

(more…)

Machine Learning, Math

Thoughts on Riemannian metrics and its connection with diffusion/score matching [Part I]

Different gaussian curvature surfaces. Image by Nicoguaro.

We are so used to Euclidean geometry that we often overlook the significance of curved geometries and the methods for measuring things that don’t reside on orthonormal bases. Just as understanding physics and the curvature of spacetime requires Riemannian geometry, I believe a profound comprehension of Machine Learning (ML) and data is also not possible without it. There is an increasing body of research that integrates differential geometry into ML. Unfortunately, the term “geometric deep learning” has predominantly become associated with graphs. However, modern geometry offers much more than just graph-related applications in ML.

I was reading the excellent article from Sander Dieleman about different perspectives on diffusion, so I thought it would be cool to try to contribute a bit with a new perspective.

(more…)

Machine Learning

Feste: composing NLP tasks with automatic parallelization and batching

I just released Feste, a free and open-source framework with a permissive license that allows scalable composition of NLP tasks using a graph execution model that is optimized and executed by specialized schedulers. The main idea behind Feste is that it builds a graph of execution instead of executing tasks immediately, this graph allows Feste to optimize and parallelize it. One main example of optimization is when we have multiple calls to the same backend (e.g. same API), Feste automatically fuses these calls into a single one and therefore it batches the call to reduce latency and improve backend inference leverage of GPU vectorization. Feste also executes tasks that can be done in parallel in different processes, so the user doesn’t have to care about parallelization, especially when there are multiple frameworks using different concurrency strategies.

Project page: https://feste.readthedocs.io/en/latest/design.html
Github: https://github.com/perone/feste

Machine Learning

Couple of recent publications in uncertainty estimation and autonomous vehicles

Just sharing some recent publications I’ve been involved recently:

L2M: Practical posterior Laplace approximation with optimization-driven second moment estimation

ArXiv: https://arxiv.org/abs/2107.04695 (ICML 2021 / UDL)

Uncertainty quantification for deep neural networks has recently evolved through many techniques. In this work, we revisit Laplace approximation, a classical approach for posterior approximation that is computationally attractive. However, instead of computing the curvature matrix, we show that, under some regularity conditions, the Laplace approximation can be easily constructed using the gradient second moment. This quantity is already estimated by many exponential moving average variants of Adagrad such as Adam and RMSprop, but is traditionally discarded after training. We show that our method (L2M) does not require changes in models or optimization, can be implemented in a few lines of code to yield reasonable results, and it does not require any extra computational steps besides what is already being computed by optimizers, without introducing any new hyperparameter. We hope our method can open new research directions on using quantities already computed by optimizers for uncertainty estimation in deep neural networks.

CW-ERM: Improving Autonomous Driving Planning with Closed-loop Weighted Empirical Risk Minimization

ArXiv: https://arxiv.org/abs/2210.02174 (NeurIPS 2022 / ML4AD / ICRA 2023 under review)

Project page: https://woven.mobi/cw-erm

The imitation learning of self-driving vehicle policies through behavioral cloning is often carried out in an open-loop fashion, ignoring the effect of actions to future states. Training such policies purely with Empirical Risk Minimization (ERM) can be detrimental to real-world performance, as it biases policy networks towards matching only open-loop behavior, showing poor results when evaluated in closed-loop. In this work, we develop an efficient and simple-to-implement principle called Closed-loop Weighted Empirical Risk Minimization (CW-ERM), in which a closed-loop evaluation procedure is first used to identify training data samples that are important for practical driving performance and then we these samples to help debias the policy network. We evaluate CW-ERM in a challenging urban driving dataset and show that this procedure yields a significant reduction in collisions as well as other non-differentiable closed-loop metrics.

SafePathNet: Safe Real-World Autonomous Driving by Learning to Predict and Plan with a Mixture of Experts

ArXiv: https://arxiv.org/abs/2211.02131 (NeurIPS 2022 / ML4AD / ICRA 2023 under review)

Project page: https://woven.mobi/safepathnet

The goal of autonomous vehicles is to navigate public roads safely and comfortably. To enforce safety, traditional planning approaches rely on handcrafted rules to generate trajectories. Machine learning-based systems, on the other hand, scale with data and are able to learn more complex behaviors. However, they often ignore that agents and self-driving vehicle trajectory distributions can be leveraged to improve safety. In this paper, we propose modeling a distribution over multiple future trajectories for both the self-driving vehicle and other road agents, using a unified neural network architecture for prediction and planning. During inference, we select the planning trajectory that minimizes a cost taking into account safety and the predicted probabilities. Our approach does not depend on any rule-based planners for trajectory generation or optimization, improves with more training data and is simple to implement. We extensively evaluate our method through a realistic simulator and show that the predicted trajectory distribution corresponds to different driving profiles. We also successfully deploy it on a self-driving vehicle on urban public roads, confirming that it drives safely without compromising comfort.

Uncategorized

[pt-br] Dados das enchentes no Rio Grande do Sul (RS) em 2024

É muito triste ver as enchentes devastadoras que tem atingido o Rio Grande do Sul nos últimos anos. Decidi fazer este post pra tentar compreender melhor a escala e o impacto desses eventos usando algumas fotos de satélite e dados recentes sobre as enchentes. Maioria das imagens recentes são do MODIS (Moderate Resolution Imaging Spectroradiometer) do qual fiz um post em 2009 (artigo aqui) que usam 2 satélites (Terra e Aqua) para fazer cobertura quase diária em baixa resolução da terra inteira.

Estamos no meio de uma tragédia sem precedentes, por outro lado, este é um momento único para a coleta de dados por parte de pesquisadores e governo com a esperança de melhorar a modelagem desses processos hidrológicos para desenvolver sistemas de alerta e previsão de enchentes. Nunca antes observamos estes processos complexos nos rios do Rio Grande do Sul, este momento é extremamente importante para o futuro do RS.

PS: tentarei manter atualizado este post com novas imagens.

Avisos de licença das imagens:

Sentinel images: contains modified Copernicus Sentinel data 2024 processed by Sentinel Hub.
MODIS (Aqua/Terra): NASA/OB.DAAC.