Coauthored with Pascal Rodriguez and Gregory Marti
I remember our early production deployments at Bestmile, back in 2016. There were three main software components at the time. The “core” component — holding the seed of Bestmile’s Fleet Orchestration Platform; our homemade API Gateway — holding the plumbing around the core; and finally, the Dashboard web front-end.
This early version of the platform was not serving thousands of customers. Yet, the deployment felt uncomfortable. I remember these deployments for two reasons:
- We had to do it “after hours”, when our main customers had parked and shut down their autonomous shuttles, which put the deployment in the middle of the afterwork-beer time.
- We had to plan it carefully, and manually deploy each component individually in the right order, making it difficult to actually drink the beer.
These were the early days and since then, Bestmile’s platform grew beyond three components (30+ as of Q1 2020). While the complexity increased, Bestmile’s customers’ operations were growing in size and requirements:
- Features — and bugs, sometimes — kept pouring in the product requirements.
- Operations, driven by autonomous fleets and new business models that kept evolving and changing, brought new features.
- New, bigger customers required more technical abilities and proofs, better performance, and shorter time to market.
- Most importantly, there were soon to be no such things as “after hours” as businesses would run almost 24/7 around the world.
Continuous Development is one of those “non-functional” product requirements that are not visible, yet unavoidable to keep the platform running 24/7, while at the same time reducing the time for new features to come to life.
Overview — technical background
What is Continuous development?
Continuous Development is an umbrella term usually including both Continuous Integration (CI) and Continuous Deployment (CD), as well as the processes (human and automated) around specifying, delivering, and operating software continuously.
CI is the technical process where software code is built, tested and validated automatically. It gives software engineers feedback on the quality of their development.
CD usually happens after CI and is the automated or semi-automated process of continuously pushing the result of the CI process to environments where the software can be operated — eventually to a production environment used by customers.
Why is it important?
At first, for someone not used to software development lifecycles, it might sound counter-intuitive to push code continuously to production in a system that is critical to daily transport operations. That means the product would be continuously changing, though not always visually changing. Still, the advantages of Continuous Development exceed the drawbacks.
For business teams, Continuous Development drastically reduces the time-to-market of new features and fixes. From specification to delivery, time can be reduced from months to days.
For product teams, more frequent and smaller changes mean quicker and more focused feedback from internal stakeholders and from customers. It also helps with innovation, when we don’t always know in advance what will work well and need to experiment in a real production setup. In a changing mobility market, that is gold.
For engineers, making sure production systems are always up and running leads to smaller changes in the code. Smaller changes are more stable and controllable. Smaller changes also mean easier rollbacks when something goes bad. With a Continuous Development mindset (a.k.a. DevOps), developers are also owners of production systems. They see the result of their work sooner and can adapt faster.
What are the risks and drawbacks?
Quality
Continuous Development without a good testing strategy will lead to catastrophes in production. With the testing strategy becoming central, the role of the QA Engineer changes drastically in this setup. Instead of being a gatekeeper, the QA Engineer becomes a coach and a developer focused on quality. Automated test-suites become part of the product and cover the whole pyramid of tests, as “manual testing” becomes the exception in a Continuous Development setup. That is a significant effort distributed among developers and testers.
Tooling
Automated tests, automated deployments, and monitoring are essential to making CD work. Bringing new tools means supporting the infrastructure for them, or paying for 3rd parties. This costs both money and effort upfront.
What about engineers’ roles?
In the old days, each “role” had his own time in the process: Design, Architecture & Security, Development, Testing (QA) — including Perf and Security — and Deployment.
With a Continuous Development process, all these roles are merging, and the responsibilities associated fall back on all developers.
At Bestmile the QA role transformed from a “pre-deployment-gateway” role to a “quality strategist” role. Testing is now the responsibility of each developer and starts at the Unit Test level (bottom of the pyramid). The QA Engineer is here to identify the bigger risks, catch quality gaps as early as during specification process, and own the testing infrastructure and automation scripts.
Regarding deployments, each engineer has the right, duty, and responsibility to perform deployments in production. There is no deployment group.
Bestmile’s transformation towards continuous development
Continuous Development is not only about the tools. It’s also the mindset. Being in the context of a startup, with very scarce resources (money and time), and with engineers who need to take care of wide breadth of features and other non-functional requirements, Continuous Development becomes “one of the many things” on the table.
Transforming the way code is deployed requires the right mindset, teamwork, and specific cross-team processes that need to evolve with the product.
There is never one point in time where someone could say at Bestmile: “That’s it, we have Continuous Development setup and working”. Instead, it has become one of those non-functional platform features that we keep evolving along with the product.
The first step for this transformation is the mindset of the engineering team, and all other actors such as product managers and customer facing teams.
The mindset, and “the continuous development of continuous development”
First rule of the Continuous Development club: do not talk about it… err. Actually, you need to talk about it early, and all the time. The goal being to make not only the engineers, but also the larger team understand how code changes are going to hit production, and what that means regarding SLAs, quality, impact on customers, documentation, and communication.
Below are a few rules we set that are part of every development effort. Spoiler: they are pretty standard in the software industry.
#1 — All developers publish their code in production
There is no handover to any dedicated “Deployment” team. By publishing to production, developers are more aware of the conditions necessary for their code to work. They are also in the best position to support any following issues.
By having the responsibility and the ownership to deploy to production, every developer will also make sure that they factor in the right level of quality into their work. They have very good incentive to block development work that would jeopardise production stability.
#2 — Feature promotion: have a clear and fixed promotion path from the developer’s machine to the production environment, even for hotfixes
To be able to automate testing, and to build predictability regarding the feature’s reach to production, it is necessary to define a clear path. When is the feature being integrated? When is it tested and by which test phase? How do we rollback and hotfix and what is the impact on the deployment process? Bypassing these promotion paths puts the quality at risk.
At Bestmile, the feature promotion is done automatically, based on a gitops process (see below), and looks like this:
#3 — Backward compatible by default
This is probably one of the most important rules, and also one of the costliest. When changes are applied to the code, developers need to identify any breaking change in terms of function signature (REST API e.g.) but also behavior.
At Bestmile, as development is happening fast, endpoints are all versioned independently. When a breaking change is necessary, the endpoint’s versions are usually “bumped”, and the compatibility with lower versions is guaranteed as much as possible.
As a rule of thumb:
- Adding properties and functionalities, no migration needed
- Removing or renaming, you will need a new version
There are some corner cases, but all developers at Bestmile are aware of and follow backward compatibility principles
#4 — Dissociate deployments (to production) from deliveries (to clients)
Continuous Deployment of code means that the code is reaching production before the full readiness of an end-to-end feature. It’s especially true in bigger, more complex systems.
Some features might require multiple sprints to be fully ready for a customer to use. In a more traditional deployment format, features used to be ready-to-use when the code was deployed to production.
With a Continuous Deployment approach, some components might be ready in sprints, if not months in advance, while the full feature is not yet usable.
#5 — Feature flags
Feature flags are a mechanism that allows the user (or an admin) to enable/disable the usage of the feature for customers.
Linked to rule #4 above, this allows us to orchestrate the development of a feature at different speeds between teams. That way, even if the full feature is not available yet, we can already start using part of it, or we can enable it earlier to some preselected and trusted customer. This allows faster feedback loops. The drawback of this is that you need to account for these flags in advance, and document them.
What about the product design itself?
Customer facing teams need to understand the orchestration of delivery vs. deployment to know when to communicate what to customers. On the other end, we found that product management teams have a big role to play in that orchestration.
Some leading companies in the software field call that the BusProdDevOps (Business+Product+Development+Operations) mentality, related to what was known in software as DevOps (Development+Operations)
This highlights the need for business and product to be aligned with how the software is built.
The technical stack
The details of what works for Bestmile at the moment are going to be shared in a separate article. Below is a summary (and spoiler!).
Bestmile leverages Kubernetes and Docker container technologies to orchestrate 0-downtime, Continuous Deployments using out-of-the-box rolling updates strategies.
But installing Kubernetes is only the first step. We had to prepare all our services to support this kind of deployments, and shape the tools that would fit our team processes.
Stay tuned for more details of how we use and orchestrate those technologies in the follow-up part of this article.
Our takeaways so far
- Our time to market for most of our features is faster and shorter than before, sometimes as short as a few days, while complexity increased. This would have not been possible without the groundwork on Continuous Deployment done during those years.
- Deployments of the platform in the product happens in average every two days. This is a significant achievement given the size of the team, and the size of our stack.
- The development work deployed is smaller, more stable, and easier to control. This has shown also in the product specifications. The scope is smaller and more agile as a result.
- Being forced to be backward compatible is a frustrating consequence sometimes. But it also helped us orchestrate the work better between teams, and improved our quality. Not all teams and customers are forced to migrate right away when you have to push a change.
- As most non-functional requirements, ROI of some tools or investment in automation is very hard to sell upfront, especially when the time and effort is competing with features that are critical for winning in the market. It is even harder if the market (the mobility and public transport industry) is not yet aware or used to software engineering processes such as Continuous Deployment.
- The main bottleneck is not necessarily technical. Tools exist today to help with technical bottlenecks. To deploy fast you need to specify fast also, and you need to plan the time to build your pipelines. Continuous Development is hard to sell when faced with clear feature gaps.
- Today the business and the product teams see the indirect business value in CICD. Which helps us plan time to improve our processes and tooling. This requires human efforts, training, and processes.
- Training and communication of teams and customers. There is a biased perception of causal effects that is hard to fight: “If today you deploy every month, and you bring down the production every two deployments, that means that if you deploy every day, you’re going to break the production every two days”
The last point proves to be wrong, as smaller incremental deployments tend to break less, are more stable, are easier to roll back, bring feedback sooner, and are also easier to fix (in case a rollback is not an option).
Stability keeps improving, as engineers get better at deployment and build habits and tools, spread the mindset, the skills and the processes around this deployment practice.
Issues still happen, but they are less critical, less frequent (or at least not more), and last for a shorter amount of time.
The state of our current Continuous Development process, in numbers
As of the publication of this article our Platform and Dashboard now totals
- 300+ deployments per month on Staging
- 20+ deployments per month on Production
- 9000+ Unit tests
Continuous Development of our Mobility Platform was originally published in Bestmile on Medium, where people are continuing the conversation by highlighting and responding to this story.