devscoach.

Effective DevOps Consultant Guide

2024-04-01

devops

consulting

As we near the end of 2023, I thought I would share with you -- what I think -- are the essential steps of being an effective DevOps consultant. Whether you are referred to as DevOps, Infrastructure Engineer, or SRE you want to be solving your clients problems smoothly and efficiently. This usually means keeping things simple, tracking important metrics, and being rock-solid.

Like children in the 50s, you should focus on not being noticed. While the product engineers are building new features and flashy UIs, the role of an SRE is to stay in the background keeping the whole operation running. Although some might find it thank-less, I believe there is great beauty in a well-oiled machine.

The following three points, to me, cover the most important 80% of the work.

Embrace Simplicity in Solutions

It has been repeated ad nauseam, but the less moving parts, the fewer things that can break. It might be fun to try out "serverless" on your own time, but if a client is paying you, stick to the boring tech. What I find most engineers don't understand is the hidden cost of various solutions.

Hidden costs of infrastructure changes usually manifest themselves in one of the following ways:

Worse DX (Developer Experience) - How easy is it run on their machine?
Maintenance Cost - How hard are bugs to track down, how hard are upgrades?
Scaling Cost - If we have 10x growth, is the price linear or exponential?
Complexity Cost - How hard is it to mentally model for an engineer?

When you are considering adding or refactoring a piece of infrastructure, ask yourself these questions:

What business problem will this fix?
What are the hidden costs of this solution?
Is there a simpler way to solve the problem using the existing infrastructure?

A LOT of developers, SREs, CTOs will look at a fancy marketing landing page, see the value proposition that aligns with their problem, and decide to use that solution. This is exactly the point of those landing pages! To accurately weigh if you should or should not use a tool, you need to do -- gasp -- actual research.

So what is this actual research, that I so sarcastically mentioned? Well it involves:

Reading the documentation thoroughly
Doing a practical experiment with the tool
Looking through open GitHub/bug tracker issues (how long does it take the maintainer to respond, how many people are actively using it, are there serious bugs with the tech stack my company is using, etc.)

So what does this have to do with simplicity? Usually the simplest solution will have the least hidden costs, the least bugs, and will allow you to scale more easily later on. But I'm not going to be prescriptive here. Maybe you are a Kubernetes master, or have experience scaling to 100k users with Serverless. Personally, I like using Docker on ECS, VPS using Kamal, or just bare metal (with Chef, Ansible). But what you should choose should be boring to you, because it should be the most familiar. Familiarity repeated breads excellence, and your clients want excellence.

Track Important Metrics

You can't improve what you don't measure. But more important than just tracking everything, is focusing on the subset of metrics that are actually important to the application you are monitoring. For example, tracking writes per second to the database is unimportant if you are working on a read-heavy application. I'm not saying you shouldn't track it, but it shouldn't be your focus.

At first, try to narrow down to the three most important metrics. As a rule of thumb you want to know: where requests are spending the majority of their time (p95 latency, avg. db query time, etc.), how are the servers handling the current load (mem, cpu, disk, etc.), and failures (5xx, dead procs, etc.). The metrics will be unique to the application you are working on, but following those guidelines should get you close to the goods.

The thing that can set you apart from other backend devs and ops engineers is building metrics that can be effectively communicated with the rest of the business. I have seen plenty of SREs get upset when a product manager doesn't understand a dashboard they are looking at. The problem isn't that the product manager is dumb, as much as we would like to think that. The problem is that this isn't knowledge that is vital to their daily work. You need to package it in a way that makes sense to everyone involved. I suggest creating multiple dashboards:

One for yourself: All the little details that let you know things are humming along
One for the development team: The top 3 metrics and then other things they will care about: real page load time, 5xx errors, db call latency, etc.
One for the management/executive team: Number of concurrent active users, requests broken down by country (using IP data), up-time, monthly cost

Datadog and other hosted metrics services are easy to get setup if your employer is willing to pay the fee. Otherwise, I suggest getting familiar with an open-source solution such as Grafana with Telegraf & InfluxDB, or Prometheus.

Be Rock Solid & Reliable

The true mark of a great SRE is reliability. I'm not talking about the reliability of the service -- which is table stakes -- I'm talking about you as an operator. Whether it is appreciated or not, the service's day-to-day operation relies on the back of your work. When shit hits the fan, and it will, the company needs to have 100% confidence that you as a person will be able to fix it.

You are the company's insurance policy & firefighter. If you don't or are unable to fulfill your end of the bargain then you will lose business and miss out on potential referrals. Protecting your reputation as a rock solid operator is a must for any DevOps consultant.

So how do you do that?

Be communicative about the current state of the infrastructure: how much load can it handle?
Be responsive, even on off hours, especially during an emergency.
Maintain a calm, mature attitude. SREs need to deal with situations that can be tense, keeping a calm attitude goes a long way in others' viewing you as reliable.

Okay that's all well and good, but how can you relax, how can you take time off? The answer is teaching this same reliable attitude to the engineers that you are working with. One of the main reasons simplicity is at the beginning of this guide is because a simple solution can be documented, operated, and taught simply.

If you build an eight-headed hydra that requires a PhD in recursive yaml to understand, then you have failed. Because, now, only you can operate it, and you will always be called to fix it. If instead, you build a nice simple service with clear documentation, nicely written run books, and easy to understand metrics, then you and your company can rest easy.

It sounds like I'm telling you to work yourself out of a job. Maybe! But now you have an amazing reference, a reputation as a rock solid operator, and the good will of a company that can scale (and pay you more later). Additionally, the more you impress people, the higher you can set your rates!

Conclusion

Maybe at the beginning I fooled you a bit by mentioning 2024. What does this have to do with 2024?! Absolutely nothing. These points are the guiding principles that any infrastructure engineer worth their salt should follow. New technologies are sweeping the world, AI workloads are increasing at a huge rate, and the year of the Linux desktop has finally arrived (not really). To wade through everything new and come up employable and relevant, you need to have a firm grasp on the foundations. Which will pay dividends year over year.