When I first started at Traveloka, I realised that there are some good engineering practices, but other practices could be improved. It’s only through this crisis that we could discover how to embrace our inner engineers through working from home.
One of the first things I noticed was missing was a lack of centralised documentation. How can you onboard an engineer remotely? You don’t have the luxury of a whiteboard to show them how things work. You also don’t have the ad-hoc time to ask questions and jump into peer programming with your peers.
So, what I introduced was the RFC process. I wanted a way to centralise these, so I suggested we use the company wiki. The way I introduced the RFC process was by writing the first RFC myself and introducing the format.
# RFC-NNN - [short title of solved problem and solution] * Status: [proposed | rejected | accepted | deprecated | … | superseded by [RFC-005](link to RFC)] <!-- optional --> * Deciders: [list everyone involved in the decision] <!-- optional --> * Date: [YYYY-MM-DD when the decision was last updated] <!-- optional --> Technical Story: [description | ticket/issue URL] <!-- optional --> ## Context and Problem Statement [Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.] ## Decision Drivers <!-- optional --> * [driver 1, e.g., a force, facing concern, …] * [driver 2, e.g., a force, facing concern, …] * … <!-- numbers of drivers can vary --> ## Considered Options * [option 1] * [option 2] * [option 3] * … <!-- numbers of options can vary --> ## Decision Outcome Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)]. ### Positive Consequences <!-- optional --> * [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …] * … ### Negative Consequences <!-- optional --> * [e.g., compromising quality attribute, follow-up decisions required, …] * … ## Pros and Cons of the Options <!-- optional --> ### [option 1] [example | description | pointer to more information | …] <!-- optional --> * Good, because [argument a] * Good, because [argument b] * Bad, because [argument c] * … <!-- numbers of pros and cons can vary --> ### [option 2] [example | description | pointer to more information | …] <!-- optional --> * Good, because [argument a] * Good, because [argument b] * Bad, because [argument c] * … <!-- numbers of pros and cons can vary --> ### [option 3] [example | description | pointer to more information | …] <!-- optional --> * Good, because [argument a] * Good, because [argument b] * Bad, because [argument c] * … <!-- numbers of pros and cons can vary --> ## Links <!-- optional --> * [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) --> * … <!-- numbers of links can vary -->
To be clear, I didn’t come up with the format above. I actually took it from here. It’s a nice concise format with the pros and cons of each of your considered options and the outcome. When writing an RFC, you should have multiple approaches to the problem, and the team will collectively agree on the way forward.
My RFC suggestion is far from perfect. There have been several suggestions for the process and several comments. For example, a common complaint is that the wiki is not interactive enough and that Google Docs offers a better collaborative environment. We’ll work together to come up with an incrementally better solution for the future.
Another problem we faced was that we weren’t taking care of our services. We were having incidents but weren’t really following up on our post-mortem process. We saw repeating incidents without any clear action items.
One of the first things I did after an incident was hosting a post-mortem retrospective. What I tried to facilitate was a blameless and learning process. How can we improve and learn from our mistakes? What can we do better in the future? What were the processes that failed?
By having this retrospective, we made sure that our action items from each post mortem made sense and that we also investigated the root cause of each incident. This solidified our knowledge and shared it among the team (not just engineering, but also product, quality, and business).
One thing that did come out of these retrospectives is that engineers did not understand why we needed post-mortems. Why should we go over what happened in production? Why create all this documentation? A common reaction was that engineers felt like the blame was put solely on them. Especially as the business started to look into these post-mortems and senior technology stakeholders were interested in the outcomes.
To answer why, we need to explain that from failure, we can learn. By reviewing where we went wrong, we can figure out what we can do to prevent similar issues in the future. If you never share those concurrency problems, how will others know to review their systems and be vigilant for these issues that might appear one day on their side?
Google has an impressive SRE culture, and from that, an impressive post-mortem culture. Amazon also has great attention to detail regarding post-mortems (from what I’ve heard). While we may not be Google or Amazon, we owe it to ourselves to embrace our inner engineers, figure out the root causes, and learn from our mistakes.
Something we recently started to help with our technical operations was a bi-weekly technical operations sync. This is only for my teams to make sure that they are monitoring the right things and keeping track of achievements and learnings.
I was actually inspired by Sebastian, who told us about how he handles his teams and keeps them on-focus. While his format and methods are slightly different, I agree with many of his teams’ practices. It helps keep teams operationally focused and not to lose sight of how our services are running.
The meeting agenda is as follows:
What: To help us become more technical operations focused, I’ve created this bi-weekly meeting for just the teams under me (to begin with). What we will be focusing on is how we can improve our operational excellence. This includes things like dashboards, alerting, automation, and processes.
Who: You may nominate someone from your team to take your place each session and report back to you.
Wins: This is where we may talk about improvements that we have done over the last two weeks.
- Did our availability go up?
- Did we create a new alert?
- Did we change a process within our team?
Retrospective: This is where we can all learn from our failures.
- Were there any incidents over the last week?
- What were the MTTD and MTTR of the incidents?
- What were the learnings from the incidents?
Deep dive: Each session we will deep dive into a team’s technical operations. I will do a round-robin of each subdomain/team to begin with, and then we will randomize it. Be prepared.
- What are your dashboards?
- What are you measuring?
- How do these dashboards help you solve incidents?
- What are your key metrics?
- How do you run the on-call process? Is there a playbook?
- What logs do you turn to? How noisy are they?
- What alerts do you have?
- How much does on-call impact your work?
- Are you often paged outside of work hours?
MoM: Link to Google Docs
I want to give a shout out to all my teams. Since we adopted this meeting, we’ve been looking at ways to measure our availability, our services, and why we have alerts that were set up 2 years ago. Backend engineers are beginning to understand what and where they need to look for information on the frontend and vice versa. We’re creating lines of communication and thought that did not exist a few months ago.
Lastly, I want to give you an overview of how we handle our backend on-call engineering handover during this pandemic and remote time. We noticed that engineers are often given ad-hoc tasks from the product, customer support, or other teams (especially if you are an enabler team used by other teams).
What I set out to do was create a brief on-call log. This idea actually came from Anshul, for some of the other teams in the fintech domain. But I didn’t want to complicate my team with a formal process.
In my Rewards backend engineering team, we have a weekly on-call round-robin schedule. This ensures that ad-hoc tasks don’t overburden the engineers and that they also have time to investigate and fix critical issues.
The on-call log is a really informal document that lists what each engineering during his (or her) week has done. For example, did you get four ad-hoc tasks from product admins to inject or query some data? The more we can see these patterns, the more we can look at automating these and justifying the opportunity cost to product.
We don’t yet have a formal handover or review process for the logs, but I imagine once a month, when we have some more data, we’ll go over them with the team.
Engineering is so much more than programming. It’s about finding the little things we can improve and keep an eye on our systems. It’s about technical and operational excellence in what we do. It’s about learning and sharing that knowledge with our team.
By choosing to practice, we bring out the inner engineers in all of us.