Coauthored with Jonas Chapuis
Software, as opposed to hardware, is “soft”. We can easily change the code of an application to address new requirements.
In fact, we have to accommodate to new requirements, and quickly iterate on the code base to satisfy all our customers’ needs.
Unfortunately, changes in our API may break existing clients. Similarly, since our platform is made of micro-services, API changes in one service may result in breaking a downstream service.
How can we keep the freedom of changing things without fearing of breaking others?
In this article, we describe the different types of breakages that can happen, and the types of changes that are safe. Then, we present various tools we set up at Bestmile to continuously check that the changes we introduce in an API won’t break other services or clients, and that the changes we introduce in data structures serialized into event logs won’t break the consumers of these logs.
When two parties communicate together, they need to write and read messages. As long as both parties use the same version of the messages’ format, everything works fine.
However, things become more complicated if one node uses a different version of the format than the other node. As an example, consider a data type modeling a user, with a name and an age:
I’m using TypeScript to define JSON data formats, since JSON is a popular format for serialization.
What happens now if we change this format to include an additional field, email?
Is this a breaking change? It depends how the two nodes (let’s name them Alice and Bob) interact with this type.
If Alice writes a User by using the second format (which includes the email field), and Bob reads it by using the first format, it works: Bob only expects to find the name and age fields, he just discards the email field.
However, if Alice writes a User by using the first format (which does not include the email), and Bob reads it by using the second format, it fails: Bob misses the email.
We see that compatibility issues depend not only of the changes we do on messages’ formats (such as adding or removing fields), but also on whether we read or write such messages.
Here are some concrete examples.
The evolution of our system is described as a sequence of events. This means that a node reading the event log must be able to read old formats of the events. Here, changing the format of an event to include a new field would make it impossible to read the old events (which didn’t include that field). So, this would be a breaking change. However, changing a format to remove a field would be fine, because that field would just be ignored when we read an old event that included the field.
The Bestmile orchestration platform exposes a public HTTP API that can be used by our clients. The client/server relationship is interesting because their roles are dual: the server reads requests and writes responses, whereas the client writes requests and reads responses. What does it entail regarding the evolution of the API? Adding a field to a request entity breaks clients that still use the previous version of the endpoint, but adding a field to a response entity is fine (clients that are not aware of this new field will just ignore it). Conversely, removing a field from a response entity breaks clients, but removing a field from a request entity is fine. Adding a new response status code breaks clients, but removing a response status code is fine. These are just a few examples.
To make things even more complex, the situation is slightly different for the endpoints that are only used internally by our micro-services (ie, endpoints that are not publicly exposed) because in that case we have the ability to update the clients (our micro-services) to use the new version of the endpoint before we update the server itself. This means that a change that would break external clients may not break our internal services if we carefully handle their deployment order.
As demonstrated above, multiple factors determine compatibility constraints, with varied consequences and strategies in case of breakage. In the face of this complexity, our first step was to agree on a convention to enumerate cases and recommended approaches. A data compatibility situation exists when some party produces the data (the writer), and another consumes it (the reader).
Backward compatibility
In the general sense, backward compatibility is understood as allowing for interoperability with a legacy system. Wikipedia has this definition: “A data format is said to be backward compatible with its predecessor if every message or file that is valid under the old format is still valid, retaining its meaning under the new format.”
This can lead to an interpretation where more recent readers can still decode messages written in the older format. However, in software, we tend to have a different intuition for backward compatibility, for instance, in library or application versions. A software library is backward compatible when it can be read by a program defined with an earlier version. In such cases, we have a legacy reader (the program consuming the library) able to read a newer version.
When it comes to data, we naturally transfer this intuition and interpret backward compatibility as legacy implementations’ ability to gracefully read data output from newer versions and ignore new parts that are not understandable. In other words, we can have a legacy schema on the reader side and a new schema on the writer side.
This contrasts with the aforementioned definition and how certain others (e.g., Avro protocol) understand backward compatibility. Still, our intuition is so established around this notion that we prefer living with this discrepancy. This case is actually typical of endpoint responses. We also generally have an intuitive understanding of this: making changes on the response schema that are backward-compatible with existing clients means ensuring they keep decoding newer responses. This matches what happens when exporting the schema and auto-generated decoders as part of an API library (our practice using the endpoints4s library): a newer version of the library needs to be backward-compatible with the legacy one which existing clients are using.
Forward compatibility
We understand forward compatibility as the reversed situation: newer implementations of a protocol can gracefully read data designed for an older version. This is typical of endpoint requests: the upgraded reader is the endpoint server processing the request coming in an older format, and the legacy writer is the client. Newer readers will ignore whatever was written in removed fields, they can understand a wider scope of values, and they will interpret new optional fields as absent. We can’t add a mandatory field as older writers aren’t able to provide it, and we can’t reduce the set of supported values in our types as we would receive some values we no longer understand.
Full compatibility
Any combination of readers and writers of different versions still interoperate without any issues, i.e., both readers and writers can process and generate data in the older format.
Compatibility recipes
We have identified a set of common scenarios to guide us in picking the right compatibility types depending on who’s producing and consuming the data (client, server, message consumer, etc.).
Naming scheme
We have also decided to use a naming scheme for our data definitions when the version isn’t part of the payload. We embed both version and compatibility type in the names of all our DTO types so it’s easy to figure out what are safe modifications. E.g., UserBackwardV1
, ItemFullV3
, LocationForwardV2,
etc.
The previous section showed how complex it is to assess whether a change breaks or not. Would it be possible to automate these compatibility checks? This section presents the two tools that we use to assess the compatibility of the changes we make.
We have already mentioned that we want to be able to read forever the events that have been persisted in an event log, even if the format of these events has changed over time. Here is our process to check that we don’t introduce a breaking change.
- When we introduce an event type, we automatically generate a bunch of random instances of this event, which are serialized into JSON and saved to the filesystem.
- Our test suite checks that the JSON codecs successfully decode the JSON files, and that if we re-encode them, we get the same JSON documents as we initially read.
- Every time we change our event type definition, we automatically generate additional JSON files with new arbitrary instances of the event.
- Our test suite checks that we can successfully decode all the JSON files (including those containing events serialized with the old format).
This process is inspired by the circe-golden library. The difference with circe-golden is that we keep the JSON files generated with the old formats so that we can check that our current codecs are still able to decode the old events ; whereas circe-golden tests fail if you introduce any change in your formats, even non-breaking changes.
The second tool checks the changes we apply to our HTTP APIs. We use the endpoints4s library to define our HTTP APIs, so that we can automatically generate an OpenAPI specification. Then, we use the tool openapi-diff to assess the compatibility of our changes.
More precisely, here is what happens every time we submit a pull request to change something in a service.
- We generate the OpenAPI specification of the current version of the service, by using endpoints4s.
- We download the OpenAPI specification of the previous version of the service (we automatically publish the specification of all our deployments to our internal Sonatype repository).
- We compare the two specifications with openapi-diff. Since the tool openapi-diff only performs one-side checks, we first compare the new API with the previous API (to detect backward incompatibilities), and then we compare the previous API with the new API (to detect forward incompatibilities). Detecting forward incompatibilities is useful to know whether we have to deploy the server before the clients, or if it doesn’t matter.
- In case we detected some incompatibilities, we post a comment on the pull request with a detailed report. We don’t fail the CI when we detect an incompatibility because, as we explained earlier, some incompatibilities are fine as long as they stay in our internal services and that we take care of deploying the clients and servers in the correct order.
Delivering new features may require changing the format of the messages read or written by the components of a system. Sometimes, this leads to a situation where some parties read a message based on a format that is different from the format used to write the message. If a reader fails to decode a message, we say that the format of the message changed in an incompatible way.
In this article, we have described and analyzed various situations where readers and writers use different message formats. For each situation, we have listed the types of changes that are safe, and the ones that are not. Finally, we have presented a couple of tools we use to automatically check the compatibility of the changes we make.