Use State Machines!

FSMs are not as complex as you think and they make your code better.

Richard Clayton

Stumbling my way through the great wastelands of enterprise software development.

What is a State Machine?

Finite State Machines (FSM, or in the context of this post, simply "State Machines") are a methodology for modeling the behavior of an entity with an established lifecycle. The lifecycle is defined by an enumerated set of states known at the time of implementation (this is where the term "finite" comes from).

Silvrback blog image

Imagine we are designing a shopping cart encapsulated in an Order entity. For an Order, we will use the initial state creating; in this state, the customer adds items to the order. When the customer is done, they will "check out" causing the order model to transition to finalizing, where no more items can be added to the cart. As you can see, the "states" of an entity also define the available behavior within that state. For instance, adding items to an order can only happen in the creating state.

State transitions are controlled as a graph of possible state transitions. This means each state can transition between zero-or-more possible states, including returning to previous states. State transitions are executed by external stimuli (API call, event, or timer). For instance, the user may decide to cancel an order. She should be able to do that if the order has not been shipped or delivered. The stimulus to cancel the order would be the customer pressing the "Cancel Order" button. If the order is not in a state that can be canceled, the state machine should reject the attempt to transition.

Why should I use a State Machine?

Every entity is a state machine. Entities by definition (DDD terminology) are identifiable, contain mutable properties, and possess a lifecycle of at least one state: "created". We often add the "updated" and "deleted" states providing the entity full CRUD capabilities. Most of the time these states are implicit. There is often no need to model the entity's status - its existence or absence models its binary states (created and deleted, respectively). However, more often we model the lifecycle of an entity in subtle ways:

Timestamps for the creation and last modification of the entity.
Flags that change the behavior of an entity. For example, a "soft delete" flag may render an entity unretrievable in a list operation.
Publishing events in any scenario is an indication of a lifecycle.

Entities with only generic CRUD functionality are not very interesting - probably anemic. Entities should have behavior, and that behavior is constrained by the life cycle of the entity. For instance, Orders in the creating state allow items to be added, but when the Order is finalizing, that behavior is no longer valid.

The Finite-State-Machine pattern is a formalization of an entity's life cycle and thus, forces us to think about our models in terms of behavior. The consequence is that we tend to design better systems when we use the pattern. The process of discovery helps us identify the behaviors of each state. Behaviors expand into actions or "intents" clients can request from the entity. State transitions indicate events that need to be published.

Despite the benefits of state machines, many engineers have developed a distaste for the pattern. I think this sentiment comes from mistaking the Finite State Machine model with common FSM implementation frameworks and patterns. From what I've seen of existing implementations, I largely agree with detractors.

Many FSM frameworks suffer from one or more of the following problems:

Libraries do too much - many libraries handle the state transitions, updating models in the database, publishing events, running pre/post handlers, etc. This makes troubleshooting errors more complex because you don't actually control the lifecycle, the library does.
Force unnatural implementation patterns - domain logic becomes embedded in the FSM implementation. For instance, I would not want to couple my business logic to Machina.js.
Coupled with communication - many FSM frameworks are built into Actor systems, the most famous being Erlang and Akka actors.

I really like Erlang and Akka actors including their FSM implementations. However, I understand how complex those frameworks can be. I don't think most applications need the power of an actor framework.

FSM patterns can also be nauseating depending on your taste for Object-Oriented (OOP) or Functional Programming (FP). The common OOP methodology is the State Pattern, which requires the creation of a subclass for every state of an entity:

class Order {
  constructor(repository, initialState) {
     Object.assign(this, initialState, repository);
  }
  // subclasses call this to update object state on record
  // and in database.
  async setState(state) {}
}

class CreatedOrder extends Order {
  async addItem(item){}
  async finalize(){}
}

class FinalizedOrder extends Order {
   async pay(){}
}

In FP, the pattern is most easily implemented with pattern matching (implemented in Scala since JavaScript does not have pattern matching yet):

def invalid(state: Order.Statuses, behavior: Behavior) {
  throw new InvalidBehaviorForStateException(state, behavior)
}

def addItemToOrder(order: Order, context: AddItem) : Order = {
  // ...
}

def finalizeOrder(order: Order, context: Finalize) : Order = {
  // ...
}

def processPayment(order: Order, context: Pay) : Order = {
  // ...
}

def transition(
  order: Order,
  behavior: BehaviorContext
) : Order = {
  order.status match {
    case Order.Statuses.Creating =>
      behavior match {
        case AddItem => addItemToOrder(order, behavior)
        case Finalize => finalizeOrder(order, behavior)
        case _ => invalid(order.status, behavior)
      }
    case Order.Statuses.Finalizing =>
      behavior match {
        case Pay => processPayment(order, behavior)
        case _ => invalid(order.status, behavior)
      }
  }
}

The other common alternative is using the Actor model to implement FSM, where the Actor's receive function (routing) is swapped for each state:

/* I'm going to show this in JavaScript in case you are not familiar
with Scala or Erlang.  However, realize that Node.js doesn't have a
popular Actor framework (that I'm aware of). */

class Order extends Actor {
  constructor() {
    this.become('creating');
  }

  become(status) {
    // 'creating' becomes 'receiveCreating'
     const funcName = [
      'receive',
      status[0].toUpperCase(),
      status.slice(1),
     ].join('');
     this.receive = this[funcName].bind(this);
  }

  async receiveCreating(message) {
    switch (message.type) {
      case 'add-item': return this.addItem(message.payload);
      case 'finalize':  return this.finalize(message.payload);
      default: return this.eventNotApplicableForState(message);
    }
  }

  async receiveFinalizing(message) {
    switch (message.type) {
      case 'pay': return this.pay(message.payload);
      default: return this.eventNotApplicableForState(message);
    }
  }

  async finalize(finalizationContext) {
    // Do whatever the finalize transition must do.
    this.become('finalizing');
    return 'whatever';
  }
}

When should I use a State Machine?

Just because I'm an advocate for State Machines, doesn't mean I think it's appropriate for all use-cases. Simple entities probably don't need the complexity of a formalized state-transition model. But what constitutes a simple entity versus one that would require a state machine?

There are clear signs that an entity could benefit state-transition model:

Entity has a formal "status" property that tracks its current state.
Behavior changes based on status (e.g. some methods cannot be executed in certain statuses).
Events are published as a result of changes to the entity's state.
Spaghetti code in the domain model, specifically a lot of conditional logic in "generic-update" functions.

If your code reeks of these code smells, you should consider implementing a simple state machine. Fortunately, it's not hard to implement a simple FSM using plain code (no frameworks). In the next section, I will demonstrate the creation of a minimalist state machine in JavaScript.

Implementing State Machines

Before you write any code, diagram the states of your entity and the valid transitions between those states. Refer to the diagram at the beginning of the post for a simple example. Most of the effort in implementing a state machine is understanding the model. Remember, state machines are just an advanced formalization of that model!

Next, consider what behaviors should be present for each state. Perhaps list them on the state diagram. If the behavior transitions the state, note which states it will transition to (usually only one, but its possible conditional parameters might cause the state machine to go to others).

Finally, detail the events that will occur as a result of state transitions with the entity. Define each event and its associated properties.

Implementing the state machine in code is fairly easy. State machines tend to have the following code components:

A mechanism to prevent the improper state transitions.
- This can be a "transition map" if you choose to model transitions generically (not recommended)
- Transitions can be isolated to valid behaviors of the current state (recommended).
A mechanism to prevent accessing behaviors outside of the current state.
Behaviors associated with specific states.
A mechanism for notifying observers of state transitions.

As an example, let's turn the Order entity into a state machine. Assume all of the code is in the same file order.js. I've broken up the code so it's easier to understand.

First, let's enumerate our states.

const Statuses = Object.freeze({
  Creating: 'creating',
  Finalizing: 'finalizing',
  Paid: 'paid',
  Processing: 'processing',
  Shipped: 'shipped',
  Delivered: 'delivered',
  Cancelled: 'cancelled',
});

Here is our model, defined with Mongoose (an active-record client for Mongo):

const mongoose = require('mongoose');
const { Schema } = mongoose;

const LineItem = new Schema({
  product: String,
  count: Number,
});

const Order = new Schema({
  customer: String,
  discountCode: String,
  description: String,
  status: {
    type: String,
    default: Statuses.Creating,
    enum: Object.values(Statuses),
  },
  items: [LineItem],
});

// Make the statuses enumeration available on the
// Order "class" (i.e. Order.Statuses)
Object.assign(Order.statics, { Statuses });

module.exports = mongoose.model('Order', Order);

1. Model borrowed from a previous article: The case against the generic use of PATCH and PUT in REST APIs.

2. If you are using Node.js, I wrote an article on how to model enumerations in JavaScript.

Next, we will create simple functions to guard against the use of behaviors unassociated with the current state:

// Maybe we have an error specific to this condition.
const { InvalidBehaviorForStateError } = require('../error');

function assertBehavior(
  behavior,
  currentStatus,
  ...expectedStatuses
) {
  if (expectedStatuses.includes(currentStatus)) {
    throw new InvalidBehaviorForStateError(behavior, {
      currentStatus,
      expectedStatuses,
    });
  }
}

Finally, implement the Behaviors for each state. I will add these behaviors as instance methods on Order for simplicity, but they can also be modeled as pure functions in separate files.

// Eventing platform, whatever it may be.
const eventBus = require('./event-bus');

Order.virtuals.addItem = async function(item) {
  // Ensure we are in the proper state
  assertBehavior('addItem', order.status, Statuses.Creating);
  // Modify the Order -- this is an example, so don't harrass me
  // about not validating the item, etc.
  this.items.push(item);
  // Save the item
  await this.save();
  // Notify observers
  eventBus.publish('order.item.added', {
    orderId: this._id,
    item,
  });
  return this;
};

Order.virtuals.finalize = async function() {
  assertBehavior('finalize', order.status, Statuses.Creating);
  // This is the state transition!  Remember, we don't need a
  // transition map because the behaviors isolate what
  // transitions can occur!
  this.status = Statuses.Finalizing;
  await this.save();
  // Notify observers
  eventBus.publish('order.status.finalizing', { orderId: this._id });
  return this;
};

Orders.virtuals.modifyOrder = async function() {
  assertBehavior('finalize', order.status, Statuses.Finalizing);
  // return back to 'creating'
  this.status = Statuses.Creating;
  await this.save();
  eventBus.publish('order.reopened', { orderId: this._id });
  return this;
};

Orders.virtuals.pay = async function(paymentInfo) {
  assertBehavior('finalize', order.status, Statuses.Finalizing);
  // run the payment
  this.status = Statuses.Paid;
  await this.save();
  eventBus.publish('order.status.paid', {
    orderId: this._id,
    paymentInfo,
  });
  return this;
};

Order.virtuals.cancel = async function() {
  assertBehavior(
    'finalize',
    order.status,
    Statuses.Creating,
    Statuses.Finalizing,
    Statuses.Paid,
    Statuses.Processing
  );
  const oldStatus = this.status;
  // Maybe reverse the payment if status === paid
  this.status = Statuses.Cancelled;
  await this.save();
  eventBus.publish('order.status.cancelled', {
    orderId: this._id,
    oldStatus,
  });
  return this;
};

And that's a simple state machine. You probably noticed a bunch of repetitive code, particularly the pattern:

// 1. Validate state
assertBehavior('behavior',order.status, ...validStatuses);
// 2. Excute code
doSomething();
// 3. Change status
this.status = newStatus;
await this.save();
// 4.  Notify state change or significant event
eventBus.publish(topic, event);

If you desire, you can abstract this common code into a function, say Order.virtuals.transition. In that sense, you would be building a more generalized state machine framework for your entities reducing the boilerplate. The important thing is that you find the right balance of domain logic and framework code that meets your needs.

Tips for Implementing State Machines

Now that you understand how to implement a state machine, I would like to offer some tips that will steer you away from trouble:

1. Don't use more than one state machine per model

In the case of Order, you might be tempted to model the state of the payment, order processing, or the delivery process either by expanding the list of statuses or creating new variables to track the subprocesses. Simply put, don't model those state machines in the Order model!

The behavior of the delivery system, warehouse, or payment processor is not a concern of the Order. Trying to fit the lifecycles of those subprocesses onto this model makes no sense, and that's because these subprocesses are really nested or related aggregate models, with Order as the root aggregate.

As mentioned in the previous section, if you do have an aggregate entity (whether it's nested in the root model or referenced by ID), you should model their lifecycle outside of the root. This does not mean that the related or nested models are independent of the root entity. On the contrary, nested or related entities will be absolutely be affected by the lifecycle of their parent.

For instance, if an order is canceled, Payments will need to be reversed and the warehouse will need to be notified to stop packaging the order and return the stock to inventory. The key here is to separate the logic into separate entities, or perhaps, completely different microservices (which is what I would do in this instance).

3. Entity behaviors should map to calls against your API

It should be obvious, but now that you have defined the behaviors for each state, you now know what types of actions should be exposed in your API. Each behavior should map 1:1 with a gRPC rpc method, a RESTful "intent" (refer to my article on modeling intents with REST), asynchronous event, or some other atomic command invoked by a client.

4. Use events, not hooks

In my very humble opinion, state machines don't need middleware - the whole concept seems absurd to me. Middleware is typically used to transparently observe or modify the behavior of the action it's wrapping. The pattern makes sense for a web server where you don't want to burden an HTTP request handler (and by extension, the business logic) with knowing how to authenticate the request. In our case, however, the state machine is deep in the business layer.

const machina = require('machina');

const Order = new machina.Fsm({
  // ...
});

const order = new Order();

order.on('finalizing', (context) => {
  // do something
});

Machina.js offers hooks, though I do not believe a failure in the hook causes the state machine to break.

I see three problems with hooks/middleware:

If a hook throws or returns an error, what happens with the state machine?

This is a philosophical topic (so there's no right answer), but I would argue that only a behavior of the state should be able to stop the state machine. It's hard to imagine a case where this makes sense to allow a middleware component to stop a transition.

Most hook logic is better implemented in the behavior it's wrapping.

I would argue that the business logic should handle rules related to the business process directly (in an imperative fashion). If you need to reach out to an external system (like a Policy Agent), inject the component into the controlling entity and use it directly in the behavior. Remember, this is your core business domain, it's hard to see anything other than tracing being a generic concern.

Hooks require observers to register for each instance of the state machine.

If you have an external component that needs to be notified of significant events in the lifecycle of your entity, it's better to use events. Hooks require the observing entity to register themselves directly with the entity being observed. This becomes a real pain, particularly if your entities are transient (looked up in the DB, modified or returned to the caller, and then garbage collected). It means that you (or some framework) would need to register all listeners everytime you created or retrieved entities or the framework.

The better approach is to add indirection to the model using a PubSub (publish-subscribe) event delivery system. Using the PubSub pattern, the model only needs to be aware of how to publish events (as seen with the eventBus example). Using this pattern, subscribers register one handler with the eventBus to receive all events in that category. In the case of our payment processor or warehouse system, they could listen for the 'order.status.cancelled' and initiate their own internal processes for reversing payments or returning inventory to stock.

const eventBus = require('./event-bus');

// Original behavior
Order.virtuals.cancel = async function() {
  assertBehavior(
    'finalize',
    order.status,
    Statuses.Creating,
    Statuses.Finalizing,
    Statuses.Paid,
    Statuses.Processing
  );
  const oldStatus = this.status;
  // Maybe reverse the payment if status === paid
  this.status = Statuses.Cancelled;
  await this.save();
  eventBus.publish('order.status.cancelled', {
    orderId: this._id,
    oldStatus,
  });
  return this;
};

// Subscribers would simply hook into the event bus
eventBus.subscribe(
  'order.status.cancelled',
  async ({ orderId, oldStatus }) => {
    if (oldStatus === Order.Statuses.Processed) {
      await returnItemsToInventory(orderId);
    }
  }
);

Another amazing benefit of using events is that the PubSub system could produce forward events over the network making them available to other services. This brings us to our last topic:

Distributed Business Processes with State Machines

The events of a state machine don't need to be confined to the process/service managing them. If you are developing a microservice architecture, events produced during state transitions are excellent ways of notifying other services of actions they may need to take. I consider this approach to be the ideal integration pattern for microservices:

Silvrback blog image

The supplier microservice remains blissfully unaware of its consumers. This removes cyclic dependencies between services which tend to couple them (turning microservices into microliths).

If you intend to develop a microservice architecture and distribute your business processes, please consider the following advice:

Use a messaging solution to distribute events

This may seem obvious, but if you distribute events between services use a reliable messaging solution like RabbitMQ, Kafka, SQS, NATS, IronMQ, etc. Outside of the obvious problems of trying to roll your own solution, I've seen engineers attempt to use point-to-point communication protocols (HTTP) and databases to synchronize events between services. These technologies are completely inappropriate for asynchronous, one-to-many notifications.

A consumer also needs to be able to consume all events related to a state machine (and potentially aggregates of that state machine). Therefore, it may not make sense to use a topic for each event type using a solution like Kafka. You may be able to get away with this approach in AMQP where you can bind multiple topics to a single queue, but I would argue this approach makes ops/maintenance more complex. Basically, understand the strengths and weaknesses of the messaging system.

Event Cascade pattern

Once you have your messaging infrastructure in place, you can now implement Event Cascades with your distributed state machine. A cascade is simply a chain of events that trigger from an event produced by a top-level state machine. If we consider Order the top of the model hierarchy (aggregate root!!!!), an event like order.status.cancelled will affect models in other bounded contexts (Inventory, Warehouse, Payments, etc.).

Event Cascades are constructed when event handlers produce an event in response to another. For instance, when order.status.cancelled is received by the Warehouse Service, it would probably notify workers in the warehouse to stop the delivery and place the items back on the shelf. When a worker acknowledges the item being returned, the Warehouse Service would publish the warehouse.item.returned event. The subscribe->act->publish pattern represents one link in the chain.

Event Cascades is excellent for simple synchronization tasks like updating cached values of aggregates in external domains or chaining simple state machine transitions.

Silvrback blog image

While Event Cascades is powerful, the pattern does have some drawbacks. First, it is are to debug due to the loose coupling and complexity of the messaging infrastructure. Another drawback is that the pattern does not handle failures well. For instance, if the Warehouse Service is incapable of returning an item there's no easy way to stop the cancellation of the order. Event Cascades is a bad pattern if the business process you are modeling is transactional (rollback on failures). It's not impossible to design a triage process, but I guarantee it will be much harder than our next pattern: Coordination.

Coordination pattern for complex workflows

Coordination is the use of a workflow (i.e. BPM) to choreograph activities between one or more microservices modeling a complex business process. There are various workflow technologies, but all generally involve the use of an external service that manages the choreographed steps of the workflow. These steps might include calling an API endpoint, producing or waiting for an event, or executing a child workflow.

Consider the previous Event Cascade as a Coordination:

Silvrback blog image

Coordination provides better traceability and error handling than Event Cascades, but the pattern can be complicated to setup and use. However, coordinated workflows are worth the effort. Relying on Event Cascades to model complex business processes can be really difficult to manage, particularly if you need error handling.

Don't roll your own!

If you are using AWS, I highly recommend looking at Amazon Step Functions as they are fully managed and have a simple programming model. If you are not on Amazon, I have heard great things about Uber Cadence, though I have no personal experience using the framework.

Conclusion

State machines make business logic significantly more maintainable by forcing developers to think about the states and behaviors an entity will go through during its lifecycle. If you expose the state machines behaviors as intents in your API, and publish transition events over a messaging system, you will have the capability of scaling your state machines across microservices. However, don't abuse the event system! Over-reliance on events leads to unmaintainable business processes. Instead, consider employing a workflow engine to provide managed Coordination of activities across microservices for better error handling and traceability.

Thank you for sticking around to read another long post! The content of this post comes directly from my experiences implementing state machines in our domain models at my last company. The advice was hard won from the many mistakes we made. If you have any questions, please don't hesitate to reach me on Twitter: @richardclayton.

You might also be interested in these articles...

Richard Clayton

Stumbling my way through the great wastelands of enterprise software development.

March 17, 2018

Subscribe to this blog

Posted in: architecture nodejs

Use State Machines!

FSMs are not as complex as you think and they make your code better.

What is a State Machine?

Why should I use a State Machine?

When should I use a State Machine?

Implementing State Machines

Tips for Implementing State Machines

1. Don't use more than one state machine per model

2. Handle nested and related aggregates as separate state machines

3. Entity behaviors should map to calls against your API

4. Use events, not hooks

Distributed Business Processes with State Machines

Use a messaging solution to distribute events

Event Cascade pattern

Coordination pattern for complex workflows

Conclusion

Share this article with friends