How fast is rust? Simulating 200,000,000 particles

The Challenge. Simulate 100m particles in rust with an Apple m1 on battery.

A while back I set out to simulate as many particles as possible in pure javascript. I got 20 million at somewhat real-time speed. Months have gone by but one question always sat at the back of my mind, day and night, taunting my little engineer ego.

How much faster would rust or another systems language be?

I am no systems language expert. As a matter of fact I have made it a point to avoid system languages. They're scary. All those types and weird keyboard symbols. What the heck does having an &, *, or *&***| by a variable even mean? I know I don't know.

Still, I think now is the time to take a neophyte's stab at the rust monster and simulate buckets and buckets of particles, as many as possible. At least 100m.

My expectations are high. Everyone tells me I should get a 10x performance bump because compiled languages are just so darn fast, blazingly fast even. Right? Right.

There is only one way to find out though.

tangent cam - why i love javascript and you should too

One of the reasons I stick to js land so often is because of how portable it is. If it runs in the browser, it will run everywhere a browser does which is well...just about everywhere. All an end user needs to do is type in a url. No 500 megabyte app. No platform tax. No developer fees. No installs.

I love it.

Distribution is powerful.

System language land is not as sharable. Like at all. Yes, it can compile to almost anything but those roads are usually covered with miles of * between the source code and a user's target platform. Not everyone is an engineer. Not everyone has a desktop.

This means that most of the examples will not run on your phone. In theory they could, but there is no easy way to download a native app without first genuflecting to Apple or Google. Since multi-threading is a must, web is also off the table as wasm multi-threading is not there yet.

If you would like the code to compile and run for yourself, here is a gist of it all.

Javascript is absolutely stupidly trivial to distribute. You cannot go wrong. Feel free to complain about web standards and browser implementation quirks but I'll take them any day if comes with the ease of distribution the web offers.

That said, we are rust now and javascript is an evil sick blight on this earth begging to be replaced by a real programming language. And I for one curse all those who peddle in such rubbish.

Javascript. Eww, sick, gross.

Time to simulate particles...the rust way.

The First Step

To keep things somewhat of an apples-to-apples comparison the simulation will remain largely unchanged from the js land version.

There are a bunch of particles with an x, y, dx, and dy. You will be able to tap and swipe around to pull particles using some kind of gravity approximation. These particles will then be drawn to the screen as single pixels using the cpu. As much work as possible must remain on the cpu to keep in the spirit of the challenge.

I know nothing about rust. Like literally nothing. The internet tells me that the best rust engineers have blue hair and anime pfps. Since I have neither blue hair nor an anime pfp I think it prudent to start with a single-threaded simulation. But before that, how exactly does one draw pixels to the screen in rust?

windows all the way down

System languages usually do not come with much out of the box. I want to create a window, draw pixels to it, and handle some input. Pretty basic if you ask me. Sadly, without some kind of library, this becomes a pain to handle yourself. This is in large part because each platform has its own * on managing a window and input. I don't want to learn all that. I want to paint some pixels.

Luckily, there is minifb which will provide cross-platform bindings to let me paint pixels and handle input without issue.

It is worth making a mental note here. There is a chance that this library has some gotchas with how it draws pixels or handles input. There are always gotchas.

I already ran into one where I wanted to have the current fps on the title of the window. I updated the title once a second but any time it updated, it would cause the simulation to stutter. No idea why. Very surprised setting the title is so performance-heavy. I am sure it is a skill issue because 10 years ago java had no such issue setting window titles without stuttering.

rendering

I am going to avoid graphics land as much as possible. This isn't about how fast the GPU is but how fast the cpu is. The idea should be simple. Build a framebuffer of pixels based on a simulation and display it to the user.

This seems pretty easy.

let mut buffer: Vec<u32> = vec![0; WIDTH * HEIGHT];
let mut window = Window::new("Safu", WIDTH, HEIGHT, WindowOptions::default())
  .unwrap_or_else(|e| {
      panic!("{}", e);
  });

First, create a big buffer for pixel data based on window width/height and then create the window itself.

I have no idea what the unwrap_or_else is. I think it is a way of handling errors which for me is to panic.

It seems like rust needed to differentiate itself so they use | instead of () for closure params. Why the pipe? I am sure it is for better compression because a pipe is a single character used for both opening and closing rather than the traditional () which is two different characters. Every little bit counts. 10 points to Slytherin because rust is certainly house Slytherin.

while window.is_open() && !window.is_key_down(Key::Escape) {
    // simulation code

    buffer.iter_mut().for_each(|pixel| *pixel = 0);
    // render code
    window.update_with_buffer(&buffer, WIDTH, HEIGHT).unwrap();
}

While the window is open, simulate, clear the buffer, draw particles, and finally blit to the screen.

For those like myself wondering what that unwrap is about. As far as I can tell, you can return optional errors and values as a single type. The unwrap is a way to pull out any values and errors. Since I don't care about errors, I am not going to handle an error value and will let it panic if something goes wrong.

memory layout

To be fast you need to write in rust AND keep data as tightly packed as possible and as close to the cpu as possible. This is because cpu's can crunch numbers much faster if the next computation's data is waiting in the cpu's cache or even better, in another register.

And thank fucking god rust supports structs with real primitive data types.

struct Particle {
    x: f32,
    y: f32,
    dx: f32,
    dy: f32,
}

Creating a bunch of these bad girls looks like this.

fn generate_particles(count: usize, width: usize, height: usize) -> Vec<Particle> {
    let mut rng = rand::thread_rng();
    let mut particles = Vec::with_capacity(count);

    for _ in 0..count {
        particles.push(Particle {
            x: rng.gen_range(0.0..width as f32),
            y: rng.gen_range(0.0..height as f32),
            dx: rng.gen_range(-1.0..1.0),
            dy: rng.gen_range(-1.0..1.0),
        });
    }

    particles
}

I get why usize and isize exist but I am still annoyed by it.

the meat and potatoes

Cool, now time to port over the simulation.

for particle in particles.iter_mut() {
    if mouse_pressed {
        let dx = mouse_pos.0 - particle.x;
        let dy = mouse_pos.1 - particle.y;
        let distance = (dx * dx + dy * dy).sqrt();

        if distance > 0.2 {
            let inv_gravity = GRAVITY_STRENGTH / distance;
            particle.dx += dx * inv_gravity * 3.0;
            particle.dy += dy * inv_gravity * 3.0;
        }
    }

    let friction = friction_per_second.powf(delta_time);
    particle.dx *= friction;
    particle.dy *= friction;
    particle.x += particle.dx * delta_time;
    particle.y += particle.dy * delta_time;
}

Simple. If mouse is down, add some gravity then apply friction and update the position.

And the render loop.

for particle in &particles {
    let x = particle.x as usize;
    let y = particle.y as usize;
    if x < WIDTH && y < HEIGHT && x >= 0 && y >= 0 {
        let red = (particle.x / WIDTH as f32 * 255.0 * 0.8) as u32;
        let green = (particle.y / HEIGHT as f32 * 255.0 * 0.8) as u32;
        let blue = (255.0*0.6) as u32;
        let color = (red << 16) | (green << 8) | blue;
        buffer[y * WIDTH + x] = color;
    }
}

This sets a fixed color where color is based on the particle position just like the js land version.

And it works. Here are 5m particles.

I am not a fan of what it feels like to pull particles around. It feels....mushy.

What about performance?

Running in release mode with level 3 opt mode at 1 million particles is great with 240 fps all on a single Apple m1 core. There are some dips down to the 160s sometimes which seems odd but it always bounces back up. Fps is not the best metric to use though and I think it is worth learning how to profile with rust. Also, up to this point, I have been writing in basically notepad so maybe I should also add a little bit of tooling too.

Before that, time to crank up the particle count. How about 10m? Idling it is at 52 fps. Awesome. That is a great sign. However, when I apply gravity it can drop down to low 30s. If I make the particle bounce using a few if statements, it drops another 5 fps. This is amazing because even before busting out the profiler we know there is room for potentially far more. Assuming this is spread to multiple cores, 40m is already on the horizon as the Apple m1 has 4 performance cores.

One issue though is that the simulation broke down a bit. It got super super slow when the delta time should have smoothed things out to still remain interactive.

It turns out I was missing applying delta time here.

if distance > 0.2 {
    let inv_gravity = GRAVITY_STRENGTH / distance;
    particle.dx += dx * inv_gravity * 3.0 * delta_time; // fixed
    particle.dy += dy * inv_gravity * 3.0 * delta_time; // fixed
}

Scaling correctly using the delta time makes the simulation much nicer at lower frame rates. However, now it performs much worse. Like 15fps at 10m particles worse. What is going on here? This question will be answered later on but to foreshadow a bit something something simd something something the compiler didn't like that.

Interestingly, setting a new variable called dt right before the simulation loop "fixed" the issue and it once again performs as it did before. This is pretty odd to me but I don't question my betters.

let now = Instant::now();
let delta_time = now.duration_since(last_frame_time).as_secs_f32();
last_frame_time = now;

if let Some((mx, my)) = window.get_mouse_pos(minifb::MouseMode::Clamp) {
    mouse_pos = (mx, my);
}

mouse_pressed = window.get_mouse_down(MouseButton::Left);

buffer.iter_mut().for_each(|pixel| *pixel = 0);
let dt = delta_time; // why am I fast but my twin above is slow? 
for particle in particles.iter_mut() {
    //sim code
}

rust profiling

How do I profile a rust app? I have no idea. I know that in js land it is stupid stupid easy to profile code. Like mind blowingly easy. That is the bar rust needs to hit or more so I want it to hit that bar. I want it to be stupid easy.

First I find perf a linux tool. Hold up, there isn't a rust blessed standard tool? Instruments I guess is the one to use on a mac but my god did they really suggest something from xcode? I tried using Instruments and as I would expect of any dev tool from Apple it crashed my computer. Like the mac version of a bsod crash not only the app. I know I am likely doing something wrong but I still find it hilarious. I don't miss iOS dev.

There is a nice tool that was pretty easy to spin up called flamegraph. It wasn't as easy as js land but not too far off. Sadly, it only provides function timing as a...flamegraph. I know the main function takes up all the time. I can refactor the code to have functions which do most of the work but that isn't what I want. I would like to know line by line what is taking up time.

For now, I am going to have to roll a bit of my own here. I will add some code to keep track of the last 1k or so frame times and how long the drawing and simulation take respectively. This isn't too hard to add.

let mut simulation_times = vec![];
let mut drawing_times = vec![];
...
mainLoop() {
    let sim_start = Instant::now();
    //sim
    simulation_times.push(sim_start.elapsed().as_secs_f32());
    if simulation_times.len() > MAX_PERF_SAMPLE_FRAMES {
        simulation_times.remove(0);
    }

    let draw_start = Instant::now();
    // draw stuff
    drawing_times.push(draw_start.elapsed().as_secs_f32());
    if drawing_times.len() > MAX_PERF_SAMPLE_FRAMES {
        drawing_times.remove(0);
    }

    // once a second print
    let avg_simulation_time: f32 = simulation_times.iter().sum::<f32>() / simulation_times.len() as f32;
    let avg_drawing_time: f32 = drawing_times.iter().sum::<f32>() / drawing_times.len() as f32;
    println!("Avg Simulation Time: {:.5}", avg_simulation_time*1000.0);
    println!("Avg Drawing Time: {:.5}", avg_drawing_time*1000.0);
}

This shows that 5m particles idle is about 5.6ms on the sim and 8ms rendering. Adding gravity brings the sim to 6.5ms. If I make it so particles combine their color rather than being fully lit this slows rendering down to 15ms. This is all on a single thread which means that moving the simulation to other threads should free up more budget but just like in js land, rendering is going to be the tricky optimization.

I did dig a bit into getting a real profiler setup and I couldn't get it to work. I found docs about profiling the rust compiler which were detailed but also not what I am looking for.

I know doing something more robust with like p99 frametimes etc would be cool but good enough tm wins the day more often than not with me.

rust memory

I wanted to try creating a big'ol array of bytes and store particle data that way rather than a struct to see if that makes any difference. Rust by default stores stuff on the stack and it took me way too long to figure out how to allocate a giant chunk of memory on the heap. Like, I guess I can box things but even then if you try and gobble up too much memory rust will stack overflow.

I guess this fixes it.

let mut particles: Box<[MaybeUninit<f32>]> = Box::new_uninit_slice(PARTICLE_COUNT*4);
for i in 0..(PARTICLE_COUNT * 4) {
    particles[i].write(0.0);
}
let mut particles: Box<[f32]> = unsafe { std::mem::transmute(particles) };

This is great but like why do I have to do this? Why can I not do this

let mut particles: Box<[f32]> = Box::new([0..PARTICLE_COUNT*4]);

And it turns out I can or pretty close to it.

let mut particles: Box<[f32]> = vec![0.0; PARTICLE_COUNT * 4].into_boxed_slice();

My guess is vec handles the memory itself and who knew that into_boxed_slice is like exactly what I wanted.

After coming off a python grinder I find rust to feel very "pythonic". Both have that flow where something obnoxiously confusing is actually completely obvious if one only knew the standard library from memory. I understand the power of going heavy into iterators but I prefer loops and directly indexing into an array. It feels cozy to me and looks the same across most languages, bounds checking be damned.

Ok, time to multi-thread.

The Second Step

Naturally, rust should be trivial to multi-thread right? right?!?!?!?

First, I tried the obvious answer, rayon a library for parallel iterators. And it worked but it was not using all the threads and I saw zero performance change. It should have been as easy as replacing the iter_mut with a par_iter_mut. And while it did compile, it was clearly not running in parallel. I could have done something wrong. I don't know.

To confirm if rayon was the issue, I got gyptiy to spit out a multi-threaded..."something" that I got to compile and while I am in way beyond my rust skill there are many things that look way off. The code spawns threads every frame and creates alot of large buffers in said threads every frame. It does run measurably faster but not as fast as it should be. I honestly have no idea what is going on in the gypity code. There are barriers, mutexes, Arcs, channels, I even had to manually add an unsafe in there too to get it to compile.

I know some of those words and even though I read the docs on each one, I still have no idea what is going on. There are wayyyyy too many constructs here. This seems so silly. I suppose if I learned tokio or rayon properly they would make this easier. I think I have been spoiled with how easy js land is so it may be worth sticking to plain rust as a learning experience.

To give the rust monster its due I only had a segfault when I went unsafe and forgot to consume the messages being passed from worker threads in the main thread. Pretty nice I suppose but a far cry from fearless because as far as I can tell there is no way to avoid the unsafeness of what I want to do and it seems rayon and friends are unsafe wrappers?

I digress.

I refactored the gyptiy code a bit removing what I don't think I needed. The idea is to use a mpsc channel rust provides as I think it is the closest match for what I am doing. Multiple producers (workers) with a single consumer (main thread).

let (tx, rx) = mpsc::channel();
let chunk_size = PARTICLE_COUNT / THREAD_COUNT;
let particles_ptr = particles.as_mut_ptr();

for t in 0..THREAD_COUNT {
    let tx = tx.clone();
    let mouse_pos = mouse_pos;
    let mut buffer = vec![0 as u8; WIDTH * HEIGHT];

    let start = t * chunk_size * PARTICLE_STRIDE;
    let end = if t == THREAD_COUNT - 1 {
        PARTICLE_COUNT * PARTICLE_STRIDE
    } else {
        (t + 1) * chunk_size * PARTICLE_STRIDE
    };

    unsafe {
        // get a slice based on pointer to particle data and chunk size.
        let particles_chunk = std::slice::from_raw_parts_mut(particles_ptr.add(start), end - start);

        thread::spawn(move || {
            for particle in particles_chunk.chunks_mut(PARTICLE_STRIDE) {
                // sim code
            }

            tx.send(buffer).unwrap();
        });
    }
}

drop(tx);
p_count_buffer.iter_mut().for_each(|c| *c = 0);
for local_buffer in rx {
    for (i, &count) in local_buffer.iter().enumerate() {
        p_count_buffer[i] = p_count_buffer[i].saturating_add(count);
    }
}

Pretty simple, I think? Create some threads, give them a chunk to work on, and wait until they are all done. From what I can tell, when you iterate over the buffers in rx that causes the main thread to block until all the threads are done. One change I made was to only store particle counts since it saves memory. The unsafe bit is basically just about grabbing slices of particle data based on the thread number and a chunk size as you cannot borrow the same thing multiple times.

Still, I think what I WANT to do is create a channel with some threads once before the simulation starts and then send messages to them to simulate a chunk of particles. The main thread can wait for them to finish before building and pushing a new frame to the screen.

It may be worth double buffering to avoid blocking on the worker threads but it seems like it takes only a few ms to render the frames. This is a massive improvement from the js land version which I am happy to see. Js land was about 10x slower when merging the local particle counts together.

Speaking of rendering, this is the current code.

for (i, &count) in p_count_buffer.iter().enumerate() {
    let x = i % WIDTH;
    let y = i / WIDTH;
    let r = (x as f32 / WIDTH as f32 * 12.0 * count as f32) as u8;
    let g = (y as f32 / HEIGHT as f32 * 12.0 * count as f32) as u8;
    let b = (10.0 * count as f32) as u8;

    buffer[i] = ((r as u32) << 16) | ((g as u32) << 8) | (b as u32);
}

Pull out the x and y locations based on index and then multiply the color channels by the count. If it is zero, it will be set to 0 for all channels which is black. There is no need to zero out the frame buffer since it will always be overwritten based on the particle count.

Here are 20m particles.

This is a start. In the next version, I am going to clean up the code more and prep it for some possible simd in the simulation. From my reading on the subject, it seems it will help to take another look at how the simulation data is stored.

The third step

It took a bit but I managed to slap together a simple thread pool. I don't understand why it works but the compiler eventually stopped complaining.

Pretty basic concept though.

pub struct ThreadPool {
    workers: Vec<Worker>,
    sender: mpsc::Sender<Message>,
    job_count: Arc<(Mutex<usize>, Condvar)>,
}

Have a bunch of workers and a count of how many outstanding jobs there are. I spent most of the time trying to figure out how to make the main thread wait for all the workers to complete. I don't care about any return results because I am living in an unsafe world.

pub fn wait_for_completion(&self) {
    let (lock, cvar) = &*self.job_count;
    let mut count = lock.lock().unwrap();
    while *count > 0 {
        count = cvar.wait(count).unwrap();
    }
}

I guess there is a condvar which is a better way to block a thread for an event? I am not sure but this works. The amount of "boxing" things in Arc and mutex I did is nasty. The typing is also a monster and I wasn't even going for generics. I know there is a better way but I don't know that way. Not yet at least.

Using it is easy. Pass a function in with ye'old move command to capture needed scope and then wait.

let pool = ThreadPool::new(THREAD_COUNT);
// other code

pool.execute(move || {
    // sim code
});

pool.wait_for_completion();

//render code

That is pretty damn simple. If I got return data working, I think it would be easier than js land. Fearless concurrency indeed.

I extracted the buffer that was being created on every frame per thread into a shared buffer which is created only once. Access is shared across other threads like the particle data.

let mut shared_pos_buffer = vec![0.0; PARTICLE_COUNT * PARTICLE_STRIDE].into_boxed_slice();
let mut shared_vel_buffer = vec![0.0; PARTICLE_COUNT * PARTICLE_STRIDE].into_boxed_slice();
// a flat map of particle grid counts per thread.
let mut shared_p_count_buffer = vec![0 as u8; WIDTH * HEIGHT * THREAD_COUNT].into_boxed_slice();

shared_p_count_buffer.iter_mut().for_each(|c| *c = 0);  
for t in 0..THREAD_COUNT {
    
    // calc start and end index Kinda ugly. 
    let p_data_start = t * chunk_size * PARTICLE_STRIDE;
    let p_data_end = if t == THREAD_COUNT - 1 {
        PARTICLE_COUNT * PARTICLE_STRIDE
    } else {
        (t + 1) * chunk_size * PARTICLE_STRIDE
    };
    let p_count_start = t * WIDTH * HEIGHT;
    let p_count_end = if t == THREAD_COUNT - 1 {
        WIDTH * HEIGHT
    } else {
        (t + 1) * WIDTH * HEIGHT
    };

    unsafe {
        let pos_chunk = std::slice::from_raw_parts_mut(shared_pos_buff_ptr.add(p_data_start), p_data_end - p_data_start);
        let vel_chunk = std::slice::from_raw_parts_mut(shared_vel_buff_ptr.add(p_data_start), p_data_end - p_data_start);
        let p_count_chunk = std::slice::from_raw_parts_mut(shared_p_count_buff_ptr.add(p_count_start), p_count_end - p_count_start);
        let friction = friction_per_second.powf(delta_time);
        pool.execute(move || {
            for (pos, vel) in pos_chunk.chunks_mut(2).zip(vel_chunk.chunks_mut(2)) {
                // sim the world
            }
        });
    }
}

I also broke up the particle data into arrays as the internet told me this was the right way to prep for simd. The particle count grid has a grid per thread which is a stupid amount of memory but prevents locks.

The heavy focus on iterators in rust, much like python, lends itself to some pretty interesting code. For example, I never would have guessed there was a method called zip which iterates over two other iterators simultaneously. I would have used array indexing and called it good. My understanding is that the iterator style makes it easier for compiler optimizations. I find it far far more difficult to read unless you know what all the magic iterator utils do in which case it is lovely. All these changes had only a small impact on performance.

Here are 40m particles with non-additive color blending.

Simulating 100m particles spends 2.3ms drawing and 70ms simulating. Adding or removing computations in the simulation doesn't make much of a change in frame times which makes me think it isn't a number crunching problem. Perhaps it is a cache issue due to how data is accessed in the particle count buffer. This was a similar problem in the js land version I wrote previously.

If I remove writing to the particle count buffer in the workers, then it should mean less memory is read/written to in the cache per iteration which should give a performance boost, at least in theory. In practice, it only shaved 8ms off the simulation time. In js land this made a much larger difference.

I even tried going full crazy and letting threads read and write to a single particle count grid fully unsafe. Embrace the undefined behavior if it means performance gains right? Sadly, it did not mean performance gains. It ran almost as fast as before with some crazy flashing. Which perhaps makes sense as each thread is still reading and writing from the same amount of memory only some of that memory is now very spicy.

Instruments would surely provide ample illumination to what specifically is the bottle-neck but as it bsods me I am going to have to think outside the box. And by that I mean windows.

How well does this perform natively compiled on my desktop AMD cpu? I assumed it would be much much faster as there should be a decent chance rust simds something, anything really. It is bad. Like really bad. And yes before anyone says it, I compiled a release build.

How bad? It is slower than my js land version bad. And while that gives for a great title "js faster than rust" clearly that is not the full story What gives?

Doing a little bit of digging and it seems the underlying issue is the same one my earlier version written in js land had. Both of them cache thrash when writing to the particle grid count and they are memory-limited due to the low L1 cache size of my AMD CPU. I confirmed this by first removing writing to the particle grid count which gave a 3x boost in performance. Then I removed writing to the velocity buffer which shaved another 10ms off.

The m1 chip has a very very large cache which is why it isn't as impacted by the unordered writes to the particle grid count. The large cache size also helps with memory throughput since it can keep more data closer to the cpu. This is also the case in the js land version. It makes sense that it shows up again in a rust version and should in theory show up in any other language as the constraint is pretty far down the stack.

One interesting note is that my desktop fps was rock steady where as the Apple m1 could bounce up and down by as much as 50% of the average. I think this is because of the efficiency cores in the m1. As far as I know, I cannot choose which core to run a thread on and since I break up the work to be equal among all threads I end up being as slow as the slowest core. If I used smaller chunks and then distributed those to the worker pool, that may fix the issue.

I don't think there is much more on the table as it stands. There is only one option left, simd.

The fourth step

It has been over a decade since I wrote a line of assembly and I plan keep it that way. Nothing personal assembly but you are gross to look at. Luckily, from all the blogs, docs, and googles I have read, it shouldn't be that bad.

What is simd? It makes the cpu go brrr. I kid. simd is basically gpu programming for a cpu but without any fancy shader language so it can kind of look a bit like assembly. From what I can tell, there is still no real standardization which is likely why it is not more common. This means for each target arch you want to support, you must use their specific instruction set.

The first hurdle is that simd is not stable in rust or maybe it is. or isn't? I don't think it is? It is all very confusing to me. But it looks like you gotta go rustup nightly to compile anything. I am already unsafe so I think nightly fits right in.

The next hurdle is that there are like 3 maybe 4 competing simd toolkits? I think? There are some crates which try to give you generic building blocks but everything I read gives me that "unstable" feeling. That edgy vibe. The feeling that if something breaks or doesn't work, there is a non-zero chance you have done everything correctly and shit is just broken.

There be dragons.

Before I get into the conversion process, note that I am skeptical I will see any real gain here. My understanding is that this would improve the computation time or crunch numbers faster. It does not change bandwidth. Given bandwidth looks to be the largest constraint, this is unlikely to give a big boost. But as they say, there is only one way to find out...

SIMDeez

First, I would highly recommend this article. It is a nice write up on adding simd for path tracing in rust. The downside is that it is pushing a decade old. Ya I know. Time flys. Don't think about it. Just roll with it.

I wish I could give a nice simd breakdown tutorial here. My take away is that it is tedious and not the kind I like. The idea is to load data for multiple calculations into simd registers and then perform math on the data. Then when you are done, you read back out the result.

How do ifs and branches work you ask? Good question. There are specific instructions used for certain operations like greater than, less than, etc. The result is stored in something janky called a mask. I don't really get the details but you can ask the mask result for the larger of two values back.

This kinda gives you branching flow but is awkward to use.

There is a standardized process to it all which many compilers are often able to perform themselves. I wouldn't be surprised if both the earlier rust code and the js land version were getting some kind of simd action. Already they both perform far better than I would expect.

One other nasty little hicup is that your data may not nicely line up with whatever the bit-width is of the simd registers for the target arch. This means you will either need to pad if there is not enough data to fill a simd register or simulate the overflow normally without simd.

Man, and people wonder why no one simds their code. What a pain.

Here is what I got to compile using #![feature(portable_simd)] and a nightly build. I "think" portable_simd should handle different archs and chunks_exact_mut has a remainder support but I am going to ignore it for now to keep the code smaller. Assuming I keep the particle count as a multiple of 2 then there should be no remainder anyways. This is great.

let gravity_vec = f32x4::splat(GRAVITY_STRENGTH);
let delta_time_vec = f32x4::splat(delta_time);
let friction_vec = f32x4::splat(friction);
let mouse_pos_vec = f32x4::from_array([mouse_pos.0, mouse_pos.1, mouse_pos.0, mouse_pos.1]);
let threshold_vec = f32x4::splat(2.0);

for (pos, vel) in pos_chunk.chunks_exact_mut(4).zip(vel_chunk.chunks_exact_mut(4)) {
    let mut pos_vec = f32x4::from_slice(pos);
    let mut vel_vec = f32x4::from_slice(vel);

    if mouse_pressed {
        let delta = mouse_pos_vec - pos_vec;
        let distance_sq = delta * delta;
        let distance_vec: f32x4 = (simd_swizzle!(distance_sq, [1, 0, 3, 2]).add(distance_sq)).sqrt();
        let mask = f32x4::simd_gt(distance_vec, threshold_vec);
        let inv_gravity = gravity_vec / distance_vec;
        let force = delta * inv_gravity * delta_time_vec;
        vel_vec = mask.select(vel_vec + force, vel_vec);
    }

    vel_vec *= friction_vec;
    pos_vec += vel_vec * delta_time_vec;

    pos.copy_from_slice(pos_vec.as_array());
    vel.copy_from_slice(vel_vec.as_array());
}

This is doing 2 particle calculations at a time. I could also use f32x8 to do 4 particles at a time. There is a little trick in the swizzling to make sure the right distances show up for the mask. Updating the particle count is left out because of how long it was and awkward it is but it was faster to do in a single loop rather than done after the first. Big O notation be damned. This has almost no difference on performance, at the current scale of 10s of millions of particles. However, when I scale up to truly massive scales, there is a difference.

At 180m particles the multi-threaded version would take 118ms to simulate and 2ms to render at 9 fps. When gravity is applied by tapping around, it drops to 6 fps taking 174ms to simulate and 2ms to render. This means that there is a number crunching wall being hit. It is not as obvious when there are only a few million particles.

The multi-threaded simd version takes 114ms to simulate and 1.75ms to render. When gravity is applied it takes 120ms to simulate and 1.75ms to render. The rendering time on both versions bounces between 1.7-2.2ms but it is more stable on the simd version. It is more smooth now even if there is little change in the max fps. This seems like a memory wall now.

I confirmed this by adding 2 more floats to the particle data to store the original starting position which tanked performance across almost all scales above 10m particles.

Both versions use about 1.58 gigs of ram at 180m particle scale. Make of that what you will.

`godbolt` is god

I think it would be worth spending a little more time looking at what is currently being compiled. The godbolt website is pretty crazy once you figure out how to use it. It will let you select a compiler and config to use and then show you the resulting assembly. This is awesome.

I am using a few crates and godbolt only supports a tiny number so I cannot blindly copy all my code into their online editor. After some tinkering it seems the best flow is to extract out hot path code and compile that alone. In my case it would be to extract out the simulation loop into a function. It is also worth noting you need to have some of your code public because godbolt basically packages everything as a library which is how it shows you the assembly. If not, the compiler will remove all the code because it is all private and you will get a strange message.

I started with the first single-threaded version.

pub struct Particle {
    x: f32,
    y: f32,
    dx: f32,
    dy: f32,
}

pub fn simulate_particles(
    particles: &mut [Particle],
    mouse_pos: (f32, f32),
    mouse_pressed: bool,
    gravity_strength: f32,
    delta_time: f32,
    friction_per_second: f32,
) {
    for particle in particles.iter_mut() {
        if mouse_pressed {
            let dx = mouse_pos.0 - particle.x;
            let dy = mouse_pos.1 - particle.y;
            let distance = (dx * dx + dy * dy).sqrt();

            if distance > 0.2 {
                let inv_gravity = gravity_strength / distance;
                particle.dx += dx * inv_gravity * 3.0;
                particle.dy += dy * inv_gravity * 3.0;
            }
        }

        let friction = friction_per_second.powf(delta_time);
        particle.dx *= friction;
        particle.dy *= friction;
        particle.x += particle.dx * delta_time;
        particle.y += particle.dy * delta_time;
    }
}

The output assembly has movaps vector instructions among many others. Here is an interesting section of the output.

mulps %xmm0, %xmm5
leaq -8(%rbx), %rax
movaps %xmm5, %xmm4
mulps %xmm7, %xmm4
addps %xmm3, %xmm4
movlhps %xmm5, %xmm4
movups %xmm4, -8(%rbx)
addq $16, %rbx
addq $16, %rax

What the astute observer would note is the movups command. As far as I can tell, it is used to align unaligned memory. This shouldn't be happening if our data is properly aligned. With a little reading there is a common and easy fix. Just add #[repr(align(16))] to your struct. The size would depend on what simd instruction set is supported. From my understanding 16 is the standard.

Applying this to the struct.

#[repr(align(16))]
pub struct Particle {
    x: f32,
    y: f32,
    dx: f32,
    dy: f32,
}

yields assembly which no longer has any movups. In practice this made a 5m single-threaded simulation go from 100fps to 116fps. This earlier code didn't have specific ms timings but there was a noticable bump.

This is both exciting because simd is indeed happening but also disappointing because it also means manually adding simd is not always likely to give a huge boost. However, being able to see the assembly is a fun tool in the belt and there is a surprising result in the multi-threaded code's assembly.

First, I need to extract out the simulation logic into a function which can be compiled in godbolt.

fn simulate_particles(
    vel_buff: &mut [f32],
    pos_buff: &mut [f32],
    p_count_buff: &mut [u8],
    mouse_pos: (f32, f32),
    mouse_pressed: bool,
    gravity_strength: f32,
    delta_time: f32,
    friction: f32,
    width: usize,
    height: usize,
) {
    for (pos, vel) in pos_buff.chunks_mut(2).zip(vel_buff.chunks_mut(2)) {
        if mouse_pressed {
            let dx = mouse_pos.0 - pos[0];
            let dy = mouse_pos.1 - pos[1];
            let distance = (dx * dx + dy * dy).sqrt();

            if distance > 2.0 {
                let inv_gravity = gravity_strength / distance;
                vel[0] += dx * inv_gravity * 8.0 * delta_time;
                vel[1] += dy * inv_gravity * 8.0 * delta_time;
            }
        }

        vel[0] *= friction;
        vel[1] *= friction;
        pos[0] += vel[0] * delta_time;
        pos[1] += vel[1] * delta_time;

        let x = pos[0] as usize;
        let y = pos[1] as usize;
        if x >= width || y >= height || x <= 0 || y <= 0 {
            continue;
        }

        let buffer_index = y * width + x;
        let p_count = p_count_buff[buffer_index];
        p_count_buff[buffer_index] = p_count.saturating_add(1);
    }
}

Thatsalota arguments. No matter. Take a look at this section of the output.

movq %rbx, %r8
movl %r9d, %ebx
cvttss2si %xmm10, %r9
movq %r9, %rbp
sarq $63, %rbp
movaps %xmm10, %xmm11
subss %xmm7, %xmm11
cvttss2si %xmm11, %r13
andq %rbp, %r13

There is a movaps in there but there are no addps ops or other common vector opperations. Now, what happens if I comment out the particle grid buffer code like so?

// let x = pos[0] as usize;
// let y = pos[1] as usize;
// if x >= width || y >= height || x <= 0 || y <= 0 {
//     continue;
// }

// let buffer_index = y * width + x;
// let p_count = p_count_buff[buffer_index];
// p_count_buff[buffer_index] = p_count.saturating_add(1);

And indeed there is now a addps in there but only the one and it happens at this line.

vel[0] *= friction;
vel[1] *= friction;
pos[0] += vel[0] * delta_time;
pos[1] += vel[1] * delta_time;

It is getting simd but only that one line. The single-threaded version is able to simd all of the code but only with the alignment being set. I tried pulling out setting the particle grid counts into its own loop but that performed worse than setting it in the loop. This lends again to memory speed being a wall as going over all the particles again seems to be pretty costly even if there is zero work.

I am not sure how to take a look at the output assembly when compiling locally. I know it is possible but this will do for now. It is nice to see that there was some simd action going on and likely why manually doing it didn't yield crazy results.

Having an even somewhat interactive simulation at 180m particles is pretty extreme. At 20m particles rust is about 4-5x as fast as the js land version. The rust version also is more stable and uses maybe 1/4 of the memory. At 100m particles the js land version has a hard time even starting where as the rust version is only about 20% slower than js land is at 20m. Both scaled almost linearly when both the simulation and rendering were multi-threaded. An amazing feat for both js land with v8/JsCore but also rust. It is pretty impressive.

After about 20m particles adding more isn't that interesting. You have to make them super dim or the screen is all white.

Take a look at different scales.

What if the simulation was more...interesting?

The fifth step

The idea was to make a physarum simulation also known as a slime mold simulation. However, I want something you the reader can actually interact with. This meant compiling to wasm which meant dropping multi-threading. This is fine as once I got a simple version of physarum going I found the bottle-neck to be hugely limited by the blurring and sensing code as they perform GPU like texture sampling. Most simulations you come across are done on the GPU which is WAY faster at sampling data from a texture than the CPU is. simding may not have as much of an impact and truthfully I got stuck trying to figure out how to simd the code. I will rely on the compiler to do the magic.

Give it a try. You can tap around to pull spore "agents" towards you but also push those closest away.

It is a little like boids in that there is emergent behavior from simple rules. I will leave it to the reader to look up how the simulation specifically works.

It was more difficult than I thought getting js and rust compiled wasm to play nice. This article was my reference among others. The tldr; is to use wasm-bindgen which generates some glue for you. The windowing library I was using before doesn't work for the web so I had to pretty heavily refactor the code.

This is the gist.

pub fn init(simulation config) {

}

pub fn simulate(inputData, deltatime) {
    //sim
    //render to framebuffer
    return frame_buff_ptr;
}

Then in js land I setup the wasm instance and wire up some event handlers and a render function. The render function has the most interesting line of code.

function render(timestamp = 0) {
    const dt = (timestamp - prev)/1000;
    prev = timestamp;
    const frame_buff_ptr = simulator.simulate(mouseX, mouseY, mousePressed, 0.016);
    const buff = new Uint8Array(instance.memory.buffer, frame_buff_ptr, canvas.width * canvas.height*4);
    // ^ the most interesting line of code
    for(let i = 0; i < frameBuffer.length;i++) {
        const r = buff[i*4];
        const g = buff[i*4+1];
        const b = buff[i*4+2];
        const a = buff[i*4+3];
        frameBuffer[i] = (r << 24) | (g << 16) | (b << 8) | a;
    }
    ctx.putImageData(imageData, 0, 0);
    ctx.drawImage(canvas, 0, 0);
    requestAnimationFrame(render);
}

What that line does is create a typed array view into the linear memory of the wasm instance based on a the pointer wasm sends back. The pointer is the starting point for the frame data. The length is known based on the width, height, and pixel size.

I struggled a bit with this because in the rust code I don't set an alpha channel on the frame buffer and for some reason I thought that meant the frame buffer was storing 3 bytes per pixel when it was in fact storing 4 bytes per pixel. The resulting images of this assumption were...interesting.

The other way to do this would be to have rust manage the canvas and input. To each their own.

While a physarum simulation may not be the best showcase of great number crunching power it is interesting to watch for a few minutes.

one step forward two steps back

The code from all the iterations of this little toy project are located in this gist.

In the end, I was able to get 200m particles at 8 fps and 100m at 16fps which is almost as fast as js land at 20m. I am 100% convinced there is a crab out there well versed in the art of rust who could eek out another 2x bump maybe even more. Which means rust is in fact 10x faster than javascript on both v8 and JsCore.

Until next time.

I found out the m1's branch prediction is not as good as my AMD cpu's. The m1 takes a massive 4x performance hit when particles are outside the screen bounds. This is because of an if in the render loop which prevents indexing out of bounds. This whole time I thought it was cache thrashing and it wasn't, at least not on the m1. To fix this I could make the particles bounce off the edges or wrap around which may be more predictable but also require a branch.

Or...

for particle in &particles {
    let x = particle.x as usize;
    let y = particle.y as usize;
    let red = (particle.x / WIDTH as f32 * 255.0 * 0.8) as u32;
    let green = (particle.y / HEIGHT as f32 * 255.0 * 0.8) as u32;
    let blue = (255.0*0.5) as u32;
    let color = (red << 16) | (green << 8) | blue;
    let idx = (y.max(0).min(HEIGHT-1) * WIDTH + x.max(0).min(WIDTH-1));
    buffer[idx] = color;
}

If I accept a bit of jank, this completely solves the issue. No if as I think there isn't a branch for min/max'ing. I am sure there could be a way to min/max out of bounds particles to a single pixel which makes it almost a one for one match to the original culling code.

I love these kinds of janky solutions. My AMD desktop cpu didn't have this issue which makes me think the branch prediction is better. I don't know for sure though. Still, with this the single-threaded version can now handle 15m particles at almost 40fps and then gravity drops it down to 25fps.

Crazy for a single core.

Truly, until next time.

last time

how not to run a saas company