Phase Locked State Machines

2023-05-26

I work a lot with systems that talk to other systems. Technically, this puts them in the class of distributed systems, but unlike databases or something similarly complex, it's usually some kind of PC (laptop, webserver), talking to some kind of embedded system (a USB device, a sensor).

Regardless of complexity, these systems need to talk to each other to achieve some goal: obtaining sensor data, changing configuration values, etc. For many simple cases like these, we can think of these systems as "phase locked" - they are two systems that move state-to-state at the same time.

This post explores a technique I've found, and that I think is useful (and fun?) for specifying protocols in Rust. Prefer to jump straight to the code? It's here on github.

A useful example

My go-to example of a "two system problem" is a bootloader:

This is harder to write than you might think

Writing code for this kind of problem usually means developing three separate components:

This means I write three chunks of software in three places, and if I change any one part, I need to make sure the other two are consistent.

Yes, I know I'm writing three things. But that is frustrating, because I'm not writing three things: I'm writing one thing that exists in three places:

I run into problems that are roughly this shape, but I haven't been able to find a good way of specifying directly in code. Often I end up drawing the sequence up on paper.

For state machines, this means drawing state machine diagrams, one for the client and one for the host. A client diagram could look like this:

For writing down the sequence of messages, this means drawing sequence diagrams, and making sure it is consistent with the state machines. The sequence diagram could look like this:

But I don't love this: docs don't always live with the code, and I am usually iterating pretty quickly when I am designing things off the cuff. Ideally, I'd like to be able to lean on the compiler to help make sure any changes I make between the three parts end up being consistent across the entire design.

Let's switch to looking at code

Let me make some assertions:

So how about some async rust code that describes our bootloader state machine?

async fn bootload(something: &mut Something) -> Result<(), ()> {
    // start the bootloading, and get the image size
    let image_end = something.start().await?;

    // While there are more sectors to erase/fill...
    while let Some(start) = something.next_sector() {
        // ...erase the sector...
        something.erase_sector(start, SECTOR_SIZE).await?;
        let sector_end = start + SECTOR_SIZE;
        let mut now = start;

        // ...while there is more data to write in this sector...
        while (now < sector_end) && (now < image_end) {
            // ... write the chunk to disk
            now += something.write_next_chunk().await?;
        }
    }

    // All done, now boot.
    something.boot().await?;

    Ok(())
}

That seems like a decent representation of the system I described above.

Let's look at this one chunk at a time:

let image_end = something.start().await?;

We start the bootloading.

while let Some(start) = something.next_sector() { /* ... */ }

We iterate over the number of sectors. Both the host and client will know how many sectors need to be written.

// ...erase the sector...
something.erase_sector(start, SECTOR_SIZE).await?;

We need to erase a given sector.

let sector_end = start + SECTOR_SIZE;
let mut now = start;

// ...while there is more data to write in this sector...
while (now < sector_end) && (now < image_end) {
    // ... write the chunk to disk
    now += something.write_next_chunk().await?;
}

We need to send manageable chunks of the new firmware image, until we have either filled this now-empty sector, or until we have finished transferring the whole firmware image.

// All done, now boot.
something.boot().await?;

Once we are done transferring, we should prompt the client to boot into its new firmware.

Okay that's cool but what?

Note that there are no mentions of "client" or "host" in the code above, yet it still describes a state machine that is suitable for both the host and client.

I could copy and paste this code to both projects, or I could instead somehow make this generic over whether this state machine is running on a host, or running on a client.

Like I said before, this is what traits are for. Let's turn each of the function calls above into trait methods instead. We can think of each of these functions as a state transition:

trait TraitMachine {
    const SECTOR_SIZE: usize;
    const CHUNK_SIZE: usize;

    fn next_sector(&mut self) -> Option<usize>;

    // These are all the joint state transitions, basically?
    async fn start(&mut self) -> Result<usize, ()>;
    async fn erase_sector(&mut self, start: usize, len: usize) -> Result<(), ()>;
    async fn write_next_chunk(&mut self) -> Result<usize, ()>;
    async fn boot(&mut self) -> Result<(), ()>;
    async fn abort(&mut self) -> Result<(), ()>;
}

Now we can make our async function generic over that trait:

async fn bootload<TM: TraitMachine>(tm: &mut TM) -> Result<(), ()> {
    // start the bootloading, and get the image size
    let image_end = tm.start().await?;

    // While there are more sectors to erase/fill...
    while let Some(start) = tm.next_sector() {
        // ...erase the sector...
        tm.erase_sector(start, TM::SECTOR_SIZE).await?;
        let sector_end = start + TM::SECTOR_SIZE;
        let mut now = start;

        // ...while there is more data to write in this sector...
        while (now < sector_end) && (now < image_end) {
            now += tm.write_next_chunk().await?;
        }
    }

    // All done, now boot.
    tm.boot().await?;

    Ok(())
}

Implementing the steps as separate pieces

But what does it look like to specify separate behavior for each entity?

On the Host

Let's look at the start method on the host:

impl TraitMachine for Host {
    // ...
    async fn start(&mut self) -> Result<usize, ()> {
        self.channel
            .send(Host2Client::Start {
                total_size: self.image.len(),
            })
            .await?;
        match self.channel.recv().await? {
            Client2Host::Starting => Ok(self.image.len()),
            _ => Err(()),
        }
    }
    // ...

We:

On the Client

Now let's look at the start method on the client:

impl TraitMachine for Client {
    // ...
    async fn start(&mut self) -> Result<usize, ()> {
        match self.channel.recv().await? {
            Host2Client::Start { total_size } if total_size <= Self::TOTAL_SIZE => {
                self.image_len = Some(total_size);
                self.channel.send(Client2Host::Starting).await?;
                Ok(total_size)
            }
            _ => Err(()),
        }
    }
    // ...
}

We:

Mission Accomplished?

Okay, but is this actually "better" than other techniques?

I dunno. I haven't seen anyone do it like this before, and it seems neat that I can use out-of-the-box tools of Rust to achieve this.

The whole demo is here, and you can run it for yourself to see how it works.

I don't think this could be a "library", but more like a "pattern", similar to how things like xtask work.

Things this pattern does well

There are a couple of things that I think this pattern definitely helps with:

Things this pattern doesn't address

There are definitely some drawbacks to this pattern:

Wrapping up

Should you use this? I don't know! Maybe try it out, and let me know how it goes!

I think it is neat, and is maybe the first step towards a more general, or more useful, solution.

Need help with building something in Rust? Maybe I can help!.