Year: 2025

Machine Learning

Diffusion Elites: surprisingly good, simple and embarrassingly parallel

Introduction

Hi ! I was going to publish this content on ArXiv but I decided to write a blog post this time so I can write it a bit more informally =)

It is not a secret that Diffusion models have become the workhorses of high-dimensionality generation: start with a Gaussian noise and, through a learned denoising trajectory, you get high-fidelity images, molecular graphs, or robot trajectories that look (uncannily) real. I wrote extensively about diffusion and its connection with the data manifold metric tensor recently as well so if you are interested please take a look on it.

Now, for many engineering and practical tasks we care less about “looking real” and more about maximising a task-specific score or a reward from a simulator, a chemistry docking metric, a CLIP consistency score, human preference, etc. Even though we can use guidance and do a constrained sampling from the model, we often require differentiable functions for that. Evolution-style search methods (CEM, CMA-ES, etc), however, can shine in that regime, but naively applying them in the raw object space wastes most samples on absurd or invalid candidates and takes a lot of time to converge to a reasonable solution.

I have been experimenting on some personal projects with something that we can call “Diffusion Elites“, which aims to close this gap by letting a pre-trained diffusion model provide the prior and letting an adapted Cross-Entropy Method (CEM) steer the search inside its latent space instead. If found that this works quite well for many domains and it is also a impressively flexible method with so much to explore (I will talk about some cases later).

The method is as simple as the following:

  1. Draw a population of latent vectors from a Gaussian
  2. Run one full denoise pass to turn each latent into a structured object
  3. Score every object with any reward function
  4. Keep the top-K “elite” latents, refit a new Gaussian to them, and iterate

This five-line loop inherits the robust, gradient-free advantages of evolutionary search while guaranteeing that every candidate lives on the diffusion model’s data manifold. In practice that means fewer wasted evaluations, faster convergence, and dramatically higher-quality solutions for many tasks (e.g. planning, design, etc).
The rest of this post I will unpack the Diffusion Elites in detail, from the algorithm’s theoretical roots to hands-on code and examples. Diffusion Elites shows that you can explore a diffusion model and turn it into a powerful black box optimizer.

Diffusion Elites

Algorithm

The algorithm of Diffusion Elites is described below:

Algorithm 1 — Diffusion Elites (DE)
Inputs
\(f:\mathbb{R}^d\!\to\!\mathcal{X}\) — fixed decoder
\(R:\mathcal{X}\!\to\!\mathbb{R}\) — reward function
\(N\in\mathbb{N}\) — population size
\(ρ\in(0,1)\) — elite fraction
\(T\in\mathbb{N}\) — iterations
\(ε>0\) — jitter
Initialise
\(μ_0=\mathbf{0}_d,\;Σ_0=I_d\)
for \(t=0…T-1\) do
Sample \(z_i^{(t)}\sim\mathcal{N}(μ_t,Σ_t)\) for \(i=1…N\)
Decode \(x_i=f(z_i^{(t)})\)
Score \(r_i=R(x_i)\)
Select \(γ_t=\operatorname{quantile}_{1-ρ}\{r_1,…,r_N\}\)
\(E_t=\{z_i^{(t)}:r_i≥γ_t\}\)
Update \(μ_{t+1}=|E_t|^{-1}\!\sum_{z\in E_t}z\)
\(Σ_{t+1}=|E_t|^{-1}\!\sum_{z\in E_t}(z-μ_{t+1})(z-μ_{t+1})^{⊤}+εI_d\)
end for
Output
\(x_\text{best}=\arg\max_{i,t}r_i\)

 

It looks complicated but it is actually quite simple, you just start with a standard Gaussian for the latents, you sample a population with N samples from it \(z_i^{(t)}\sim\mathcal{N}(μ_t,Σ_t)\), then you run your pre-trained Diffusion model \(x_i=f(z_i^{(t)})\) which will get you a sample from the model given the latent \(z_i^{(t)}\), then you calculate your reward \(r_i=R(x_i)\) which can really be anything and doesn’t need to be differentiable as in other approaches doing guidance, then you select a proportion (e.g. 10%) of the top-k individuals, you refit your Gaussian distribution (e.g. using MLE) and keep iterating on that. You can also see a visual diagram of the algorithm below:

If you use a simple diagonal Gaussian, you can fit the algorithm in a few lines of PyTorch code as in the example below, which might help you understand how it looks like:

shape = (batch_size, sample_size)
mu  = torch.zeros((1, sample_size))
std = torch.full_like(mu, 1.0)
min_std = 0.01
n_iters = 10
elite_frac = 0.1  # 10% elite
diffusion = (...) # A pre-trained Diffusion model

elite_size = max(1, int(batch_size * elite_frac))

for it in range(n_iters):
    # 1. sample population
    eps0 = mu + std * torch.randn(shape)

    # 2. decode
    samples = diffusion.sample(shape, eps0, steps=60, eta=0.0)

    # 3. reward and raking
    rewards = reward_function(samples)
    elites = torch.topk(rewards, elite_size, largest=False).indices
    elite_eps = eps0[elites]

    # 4. refit Gaussian
    mu  = elite_eps.mean(dim=0, keepdim=True)
    var = ((elite_eps - mu) ** 2).mean(dim=0, keepdim=True)
    std = torch.sqrt(var + 1e-6).clamp(min=min_std)
    
    elites = samples[elites]

I haven’t seen this algorithm before being used with a Diffusion model, but I think it is a very interesting and simple algorithm. I saw something similar on Diffusion-ES, which also runs gradient-free search over diffusion samples, but mutates trajectories by partial noising/denoising at every ES generation, while Diffusion Elites DE stays in the initial Gaussian latent and uses a clean CEM (Cross-Entropy Method) loop, giving simpler theory.

Interesting properties

Now, there are a few interesting properties in this algorithm which I will describe below.

Parallelization

Note that the sampling from either the Gaussian or Diffusion can be done in parallel, they are independent samples and you can run your Diffusion model distributed across how many nodes you want and you can also take leverage of using the largest batch size you can afford too. Also, the reward computation is also easy to distribute, you can compute it on distributed nodes too (just like you did with the Diffusion model too). This makes the method very flexible and amenable to use island models for example, where you evolve populations distributed in islands and can optionally exchange individuals through time. This creates a large distributed search which I think is similar to the approach taken on AlphaEvolve, although they don’t seem to be exchanging individuals across islands.

Highly flexible reward

Note that you don’t have the differentiability requirement for the reward here, you can think about it as a black box. This allows you to do all sorts of things, a few examples include:

  • Using a world model in the reward, so for instance, you can do a roll-out in the world model and compute rewards or metrics
  • Using any kind of simulator (e.g. physics, fluids, etc) where you will be evaluating your sample afterwards
  • If you are generating images, you can use a VLM or even an object detection to make sure it contains some elements you are looking for, kind of a structured guidance that goes beyond prompting
  • You can use any LLM to evaluate your sample depending on what your sample is and provide a reward for it
  • You can even use preferences here through a reward model, etc
  • If you reward tends to be differentiable, you can still use guidance with another non-differentiable reward, there is nothing that would block you from doing this in the algorithm
  • The choices are really endless

And as I mentioned above, you can do all of this in parallel, you can easily distributed the reward function computation so you can have very expensive computation there.

Speed-ups

While you are searching during the iteration loop of Diffusion Elites, you don’t have to run too many denoising steps because you are basically ranking samples and you don’t often need a high quality sample for the sake of evaluation, so you can just do a few DDIM steps for example and then do a higher number of steps only for the final samples that you are interested in after convergence, by doing this you are speeding up the inner loop of the denoising on the Diffusion sampling.

Expressive but low-dimensionality search

Latent spaces of modern LDMs (Latent Diffusion Models) are typically 4-8x smaller than pixel/action space, so working on the latent space is much more efficient than working on the structured sample space.

Stay in the manifold

I always like to think about Diffusion as learning the main building block of the metric tensor for the data manifold, so when we work on the latent space of the Diffusion models, we are staying within the realm of the data manifold.

Flexible modelling of the latent distribution

You obviously don’t need to use a Gaussian only for re-fitting the latent distribution, you can start with a standard Gaussian and then later re-fix a mixture, use ensembles, or whatever other option you might want to to consider (there are a lot of papers with many interesting choices for it) that can explore better the latent space at the trade-off of perhaps having more iteration steps needed to converge.

Experiments and visualizations

To help visualize the search for solutions and modelling, I’m going to generate a random dataset of trajectories using a simple kinematic model, then train a Diffusion model on it using a 1D CNN UNet with attention with v-prediction objective and DDIM sampling, so nothing fancy there, just a regular Diffusion model as used for audio signals as well. After training the Diffusion model, I will then use Diffusion Elites to find a trajectory that reaches a goal that I will predefine.

This is how the random generated dataset looks like:

Each sample is a smooth arc generated by the planar unicycle model \(\dot x = v\cos\theta,\; \dot y = v\sin\theta,\; \dot\theta = \omega\) with a random heading \(\theta_0\sim\mathcal U[-\pi/8,\pi/8]\); I also draw a constant linear speed \(v\) from a uniform range and a constant curvature \(\kappa\) from a zero-mean Gaussian, giving a fixed yaw-rate \(\omega = v\kappa\). Integrating this system for a short horizon with a small time-step gives us a perfectly smooth straight line when \(\kappa\approx 0\) and left or right circular arcs as \(|\kappa|\) grows. This is just a sample a random dataset for experimentation so I haven’t worried much about it, but you can use really any kind of data here, this is just for the sake of visualization.

Now, let’s come up with a simple task of reaching a goal. What we want to achieve is to set a goal position, like for example \(g = (0.4, 0.2)\) where we want the end of the trajectory to reach. What happens, however, if we want to try to find a solution using Diffusion inpainting (if you just fix the beginning and the end of the trajectory towards a goal), you will see something like below:

Which is not great as there are a lot of breaks in the trajectory and it looks completely out of the manifold.

If we use Diffusion Elites though, that’s what you will see at every iteration:

 

In this video above, what you are seeing is the entire population in gray and the elites coloured according to their distance to the goal. There is a lot to see here, note that in the first iteration we see trajectories all over the place, which makes sense because that is what we trained it with, but note that the fraction of samples (the top-k trajectories) that we select (I’m using only 10% fraction) starts to narrow more and more to a set of solutions that are very close to the goal we wanted to reach, note how quickly it converges to it as well, in a few steps we can already see that there are a lot of solutions reaching the goal.

When you think that you can have any sort of reward (non-differentiable) to guide the search and that you can parallelize the evaluation of each sample, you will see how interesting this starts to look like. There is a lot to explore for sure and I’m planning to try on other domains as well. The nice thing about Diffusion Elites as well is that you can rely on more than a decade of convergence theory and improvements on top of the cross-entropy method and the best part is that you can implement it in a few lines of Python. I’m planning to start an open-source repository with the barebones for using it with any Diffusion model in a more structured way.

Hope you enjoyed it !

– Christian S. Perone

Cite this article as: Christian S. Perone, "Diffusion Elites: surprisingly good, simple and embarrassingly parallel," in Terra Incognita, 09/07/2025, https://blog.christianperone.com/2025/07/diffusion-elites/.
Machine Learning

TorchStation Prototype V1 – GPUs panel

I finally had some time over the holidays to complete the first panel of the TorchStation. The core idea is to have a monitor box that sits on your desk and tracks distributed model training. The panel shown below is a prototype for displaying GPU usage and memory. I’ll continue to post updates as I add more components. The main challenge with this board was power: the LED bars alone drew around 1.2A (when all full brightness and all lit up), so I had to use an external power supply and do a common ground with the MCU, for the panel I used a PLA Matte and 3mm. Wiring was the worst, this panel alone required at least 32 wires, but the panel will hide it quite well. I’m planning to support up to 8 GPUs per node, which aligns with the typical maximum found in many cloud instances. Here is the video, which was quite tricky to capture because of the camera metering of exposure that kept changing due to the LEDs (the video doesn’t do justice to how cool these LEDs are, they are very bright and clear even in daylight):

I’m using for the interface the Arduino Mega (which uses the ATmega2560) and Grove ports to make it easier to connect all of this, but I had to remove all VCCs from the ports to be able to supply from an external power supply, in the end it looks like this below:

                 ┌────────── PC USB (5V, ≤500 mA)
                 │
            +5 V │
                 ▼
     ┌─────────────────┐
     │  Arduino Mega   │  Data pins D2…D13, D66…D71 → LED bars
     └─────────────────┘
                ▲  GND (common reference)
                │
   ┌────────────┴──────────────┐
   │ 5V, ≥3 A switching PSU    │  ← external PSU
   └───────┬───────────┬───────┘
           │           │
           │ +5V       │ GND
           ▼           ▼
┌─────────────────────────────────┐
│ Grove Ports (VCC rail)          │ <– external 5V injected here
│ 8 × LED Bar cables              │
└─────────────────────────────────┘
Machine Learning, Programming

VectorVFS: your filesystem as a vector database

PS: thanks for all the interest, here you are some discussions about VectorVFS as well:
Hacker News: discussion thread
Reddit: discussion thread

When I released EuclidesDB in 2018, which was the first modern vector database before milvus, pinecone, etc, I ended up still missing a piece of simple software for retrieval that can be local, easy to use and without requiring a daemon or any other sort of server and external indexes. After quite some time trying to figure it out what would be the best way to store embeddings in the filesystem I ended up in the design of VectorVFS.

timeline

The main goal of VectorVFS is to store the data embeddings (vectors) into the filesystem itself without requiring an external database. We don’t want to change the file contents as well and we also don’t want to create extra loose files in the filesystem. What we want is to store the embeddings on the files themselves without changing its data. How can we accomplish that ?

It turns out that all major Linux file systems (e.g. Ext4, Btrfs, ZFS, and XFS) support a feature that is called extended attributes (also known as xattr). This metadata is stored in the inode itself (ona reserved space at the end of each inode) and not in the data blocks (depending on the size). Some file systems might impose limits on the attributes, Ext4 for example requires them to be within a file system block (e.g. 4kb).

That is exactly what VectorVFS do, it embeds files (right now only images) using an encoder (Perception Encoder for the moment) and then it stores this embedding into the extended attributes of the file in the filesystem so we don’t need to create any other file for the metadata, the embedding will also be automatically linked directly to the file that was embedded and there is also no risks of this embedding being copied by mistake. It seems almost like a perfect solution for storing embeddings and retrieval that were there in many filesystems but it was never explored for that purpose before.

If you are interested, here is the documentation on how to install and use it, contributions are welcome !

 

Article, Machine Learning, Philosophy

Notes on Gilbert Simondon’s “On the Mode of Existence of Technical Objects” and Artificial Intelligence

Happy new year ! This is the first post of 2025 and this time it is not a technical article (but it is about philosophy of technology 😄)

Gilbert Simondon (1924-1989). Photo by LeMonde.

This is a short opinion article to share some notes on the book by the French philosopher Gilbert Simondon called “On the Mode of Existence of Technical Objects” (Du mode d’existence des objets techniques) from 1958. Despite his significant contributions, Simondon still (and incredibly) remains relatively unknown, and it seems to me that this is partly due to the delayed translation of his works. I realized recently that his philosophy of technology aligns very well with an actionable understanding of AI/ML. His insights illuminated a lot for me on how we should approach modern technology and what cultural and societal changes are needed to view AI as an evolving entity that can be harmonised with human needs. This perspective offers an alternative to the current cultural polarization between technophilia and technophobia, which often leads to alienation and misoneism. I think that this work from 1958 provides more enlightening and actionable insights than many contemporary discussions of AI and machine learning, which often prioritise media attention over public education. Simondon’s book is very dense and it was very difficult to read (I found it more difficult than Heidegger’s work on philosophy of technology), so in my quest to simplify it, I might be guilty of simplism in some cases.

(more…)