It’s 3pm PST on June 2nd, 2020 and the world is watching. Riot’s new game VALORANT is about to launch.
Hundreds of engineers, architects, operators, product managers, designers, and PR folks are standing by. Any large-scale all-at-once launch is nerve-wracking, but our infrastructure team of around 120 people is ready. We go live.
Players immediately begin showing up, just like we had expected, modeled, and tested for. Instead of fire-fighting, the infrastructure team goes about their normal tasks as all green metrics stream in around us.
My name is Ala Shiban. I led the Cloud Services group at Riot Games, a centralized technology team which enabled hundreds of engineers to ship multiple new large-scale live services for over 180 million users around the globe. It took us 3 years, 50 people, and many cross-company alignment efforts, and we built a highly effective, proprietary, global hybrid cloud platform. It had the ability to describe a large set of microservices as a single versionable package that could be configured, deployed, and operated on top of our platform running in heterogeneous clouds. It also meant we could run the entire set of services on Riot’s 12+ data centers around the world, including AWS and Tencent in China.
When I left Riot, I looked back at all the great work we had done, and I thought to myself:
“We succeeded in such incredible ways… and we shouldn’t have needed to do any of it.”
Computer engineering has always been pushed to its limits to enable larger and more ambitious dreams. Cloud computing is no exception; parallel computing, cluster computing, grid computing, edge computing are all constantly expanding what we think of as possible. Which simultaneously makes it harder to develop against.
The industry is now in this streamlining complexity phase of cloud computing. The most evident examples are integrated solutions that optimize for certain workloads or development models: Google’s Anthos, AWS Outposts, Azure Stack Hub or the Hashistack.
Those solutions bundle together building blocks necessary for larger scale applications and systems, but are complicated low-level interfaces that need developers and operators to configure, learn, assemble and scale appropriately.
There’s a similar complexity reduction evolution happening in programming languages: Punch cards, assembly, C, C++, Java … Continuous improvement keeps happening, but at some point, an architecture shift emerges that tackles the accumulation of complexity.
I’d like to take a look at a few principles that I view as critical for this architectural shift to emerge, and what we need from products to effectively take us into the new world of cloud computing.
To effectively reduce complexity, we need to absorb it outside of the purview of the developer or operator… not pass it around like a hot potato. Let’s take a look at a principled approach that should properly address complexity by design.
A solution should:
- maintain benefits from existing architectures
- keep tools and programming languages usable
- integrates with an ecosystem instead of trying to replace it
- ensure user code is recognizable, debuggable, and patchable–even in production
Maintain benefits from existing architectures
Incidents are part of any live system, distributed or not, and no company is immune. But in a monolithic world, observing and tracing application-level incidents tends to be more straightforward due to centralized instrumentation and the availability of cross-API context.
In one of the teams I worked on, it took on average several months to get a new developer onboarded and productive on a microservices based architecture. An SDK team is put into place to simplify the process, but they’re soon overrun by reasonable requests by feature teams. Papercuts increase over time as smaller features are harder to prioritize. The more adoption of the platform, the worse the problems become.
Everything is an evaluation of existing tradeoffs.
Microservices make it easy to have fault isolation, independent deployments, custom per-service environments and modular code, as well as team boundaries.
Monoliths make it easier to be productive, deploy and test features, trace errors, and create an integrated developer experience.
Startups and companies continuously attempt to superglue in new systems to solve old problems. We all used to boast about the number of services we wrote and operated, only to realize that thousands of microservices means thousands of puzzle pieces used in different, disconnected ways.
This is a tradeoff we’ve made as a community to gain the flexibility and benefits of microservices.
“A new architecture must maintain all the previous architecture’s benefits while reducing the complexity of gaining them”
Keep tools and programming languages usable
There are effective patterns that solve many of the problems in either architecture. You don’t usually find yourself asking how to call a profile API in a monolithic architecture. And you don’t usually ask who to talk to for a custom OS to run container code with microservices. But any transition in architectures makes them difficult to use, as the underlying capabilities that facilitate their ease are not necessarily there anymore.
I was once on a team where we couldn’t spin up a new test environment–let alone a local one–because it required pulling together hundreds of microservices, coordinating deployments, and dividing what configurations should be set without the benefit of knowing each system. But in a monolithic architecture, you’re an F5 press away.
“A problem that’s been solved in a previous architecture can be solved the same way in the new architecture”
Integrates with an ecosystem instead of trying to replace it
Large companies have teams that all use different tech stacks, whether it’s due to team knowledge and familiarity, or because it was the right choice for the problem at that time. Friction becomes the norm once there’s a need to centralize efforts to gain economies of scale.
On another one of my teams, it took us 4 months to validate that a best-in-class observability SaaS solution would work for a diverse tech organization, because it required retrofitting and redeploying each one of the hundreds of mission-critical services to get the real value. The high time-to-value meant we couldn’t replace it down the line either, placing us in a horrible negotiating position once the contract needed to be renewed.
“A new architecture must integrate into existing ecosystem tools with a significantly lower time-to-value”
Ensure user code is recognizable, debuggable, and patchable
Live systems are the lifeblood of a business. Their maintainability and reliability make the biggest difference between a sustainable and agile environment versus constantly being in a reactive on-fire mode with no ability to move forward.
Solutions today streamline or absorb the complexity into expert systems–ones that require significant training and understanding to operate or patch. Several of those companies and solutions are explicitly launching managed services due to the difficulty of operating their solutions… so much so that they’re de facto locked-in by sheer complexity.
A good way to determine operable solutions is something called the phone call razor.
Here’s how it works. Let’s say there’s a SEV0 outage on your service, and you’re not sure what’s happening. Is your solution simple enough that you can easily decide if the bug is in your code or the abstraction? Once you figure it out, is it simple enough for you to work around the bug until it’s fixed upstream?
If it isn’t simple enough, you’re betting your entire company on a phone call with one specific vendor. That’s quite an illusion of self-sustainability.
“The new architecture and abstraction must be simple enough to leave, operate, and modify, especially in live outage scenarios”
The Next Cloud Computing Architecture Is Here
Cloud computing has truly reached peak complexity, with the dominant available architectures shifting this complexity from one location to another instead of addressing it at its source.
I believe the right solution requires a new architecture that follows key desirable design principles–ones that maintain benefits from previous architectures without requiring relearning tools and how to work.
In my next post, we’ll take a close look at how we built Klotho around these principles. I’ll be introducing the next cloud computing architecture that comes after monoliths, microservices, and serverless, a solution which fundamentally solves the complexity of cloud development without sacrificing everything we’ve grown to appreciate about it so far.
Want to be notified when it’s posted? Join our newsletter and you’ll be first to know: