How to Promote an Engineer

To understand how to promote we need to understand why we have titles, and what they are for.

Most modern tech companies (Amazon, etc) have the IC (Individual Contributor) level track concept, so I will use this as a basis. It works in roughly these levels

  • IC1 – Associate Engineer
  • IC2 – Engineer
  • IC3 – Senior Engineer
  • IC4 – Lead
  • Etc

Titles are important for recognition of people’s achievements, to set them targets to drive personal improvement, they also help with reconciliation of compensation to make sure people are paid what they deserve but I don’t personally believe compensation is their primary purpose.

They can also have a negative culture impact if used in the wrong way, sometimes people can use their title to boss or lord over others, but my advice on this is that it’s not a problem with the title, it’s a problem with the person, and this is toxic behavior, if they can’t fix it, show them the door.

Titles are also used implicitly to set the expectations of others. When people work on a team with someone of a higher title than them, they should hopefully be inspired to be a better engineer, and in turn help drive their own career progression.

Usually titles come with a defined list of Qualities that can be quiet subjective and high level, like “Practices the Latest in CI/CD Technology”, while useful as a guide, these aren’t very actionable or objective that you can give to an Engineer to do.

Many Managers I talk to about career progression tend to look at goal setting as a method for Engineer to prove themselves, I like goal setting and I do it a lot, but when it comes to career progression I think it’s a bit flawed and I’ll explain why.

Some examples people used with me for goals for Engineers were:

  • Do a blog post
  • Lead a project to completion
  • Do a tech talk

Goals like this are fine, but if used for career progression you can effectively create a checklist list for a promotion, and after the Engineer has done X,Y,Z on the list we promote him, this doesn’t mean after being promoted they will continue to do this. If we take the example of an Engineer who is set the above three goals, does them in a Quarter or two, gets promoted to senior engineer, then goes back to doing the same thing he did before. He’s not likely to inspire those around him who are doing the same job now, but don’t have the title. In fact, it may even have a negative impact on the team.

And when you are asked by someone “why is he senior?” is your response of “He did a design review and a tech talk 2 years ago” going to be a good answer?

So when are goals ok?

Goals I believe are good for short term, they are good to push someone out of their comfort zone to give them a taste of something, or to defeat fear. A bit like young children and swimming; children are usually worried about getting wet and will cry and complain, but once you finally get them in the water it’s hard to get them out. Goal setting is good for pushing people out of their comfort zones, and also for giving people a taste of something new that they other wise would never have tried, perhaps in the example of cross training, or opening conversation of new career paths, is my opinion.

But back to career progression

If we want people to be doing the above things, they should be self-motivated to do them not doing them because they are led by the promotion carrot stick. So what we are after more, is a change on mindset as opposed to a “To Do” list, so that it becomes are part of their day-to-day thinking.

How to change or measure people’s mindset?

You can’t measure that I’ve found, but the best proxy I’ve found is the behavior people exhibit. The advantage of using behavior is it is a day-to-day thing. The way people conduct themselves when dealing with others, specific to engineering scenarios, on a day-to-day basis is something you can set goals around, or more so, expectations.

Setting expectations of behavior is something that Ben Horowitz talks about, he wrote a blog a long time ago called “Good PM, Bad PM” applying this to Product Mangers in the 90s and 00s.

If we promote people based on their day-to-day behavior the exhibit, they are likely to continue this behavior as it part of their routine, they are unlikely to degreed their behavior over time, and if more goals are set around improving behavior then they will most likely progress.

Taking the example of the “Inspiration from the Senior Engineer on my team”, if we assume that behavior is consistent over time then the answer to the question about “why is he senior?” becomes easier to answer in that he acts in fashion A,B,C on a day-to-day basis.

I have some example of what I set, to try to explain the method:

  • A Senior Engineer identifies and helps with skill gaps in his immediate area, escalating when they are too much for him to handle. He is the guy that says in a stand up “hey Bob, you haven’t had much experience in system X how about you pick up that task today.” He encourages continuous improvement of the team in the day to day.

The above is an expectation around collaboration and system ownership, this is from my Senior Engineer expectations, you can see how its worded that it’s day-to-day behavior expectation around being a positive influence in the team.

The thing missing from this that’s present in the Horowitz article though is the “Bad PM”. Horowitz remarks on calling out explicit negatives in behavior as well. This is very useful for calling out common bad behavior people pick up within the organization (or in the industry in general) that might be common and help to correct them.

Here’s an example from my basic Engineer Expectations:

  • An Engineer Tests. They employ automation to do so and understands when to use Unit vs System vs Integration Testing. An engineer does not have a “Tester” on their team whose responsibility it is to do the testing.

This is a common pitfall from the industry, especially from older engineers who used to work on teams where they did have “testers”. Engineers like this that have any form of Quality role attached to their team think they have a “tester”, and this is very bad for not only cross functional teams but also the correct use of automation. So by calling out this negative behavior we help to correct this by setting the expectations.

Be careful though, the expectations I have here are specific to my context, not everyone should have the same expectations, there will be things unique to your company, team, etc. that they should change. From the example above, maybe you do have “Testers” on your team, and that is ok for you.

In closing though, I would recommend trying to Set “Behavior Expectations” around your career levels as a method to drive the right change, in your staff, for promotions.

Build System integration with Environment Variables

Different CI systems expose a variety of an array of information in environment variables for you to access, for example commit hash, branch, etc which is handy if you are writing CI tooling. Some of them even seek to standardize these conventions.

This post is primarily about collating that info into a single source for lookup. Ideally if you are writing tooling that you want a lot of people use you should support multiple CI systems to increase adoption.

As we look at each the first thing we need to do is tell which system is running, each CI platform has a convention to allow you to do this that we’ll talk about first

Below is a table of each Major build system and example bash of how to detect that the process is running in them, as a well as link to documentation on Env Vars that the system exposes.

Jenkins“$JENKINS_URL” != “”
Travis“$CI” = “true”
“$TRAVIS” = “true”
AWS Codebuild“$CODEBUILD_CI” = “true”
Teamcity“$TEAMCITY_VERSION” != “”
Circle CI“$CI” = “true”
“$CIRCLECI” = “true”
Semaphore CI“$CI” = “true”
“$SEMAPHORE” = “true”
Drone CI“$CI” = “drone”
“$DRONE” = “true”
Heroku“$CI” = “true”
Appveyor CI“$CI” = “true” || “$CI” = “True”
“$APPVEYOR” = “true” || “$APPVEYOR” = “True”
Gitlab CI“$GITLAB_CI” != “”
Github Actions“$GITHUB_ACTIONS” != “”
Bitbucket“$CI” = “true”

Below is 4 commonly used Parameters as an example, there are much more available, but as you can see form this list there is a lot of commonality.

Build SystembranchcommitPR #Build ID
Github Actions GITHUB_REFGITHUB_SHAcan get from RefGITHUB_RUN_ID

For Teamcity a common work around to it’s lack of env vars is to place a root level set of parameters that will inherit to every config on the server


env.TEAMCITY_BUILD_URL = %teamcity.serverUrl%/viewLog.html?

The `set-env` command is disabled in GitHub Actions

Recent security updates in GitHub actions prevented you from using the Environment variables, but there is a pretty easy work around that i am going to show.

Here’s an example command where we set a version number (APP_VERSION in this case is the name of the env variable) that will be used in various subsequent steps

run: echo "::set-env name=APP_VERSION::${{ env.MAJOR_MINOR_VERSION }}${{ github.run_number }}"        

We can instead pipe a string like this this to the GITHUB_ENV variable

run: echo "APP_VERSION=${{ env.MAJOR_MINOR_VERSION }}${{ github.run_number }}" >> $GITHUB_ENV

Later we can use it the same way we did environment variables before as seen below e we build a docker container using the version number in a subsequent step

run: docker build -t mydockerreg.internal/${{ env.APP_NAME }}:$APP_VERSION -f ${{ env.DOCKER_FILE_PATH }} .

It’s that easy.

If you are running on prem it’s tempting to just enable environment variables, but if you do and one day you want to scale your build workload into GitHub cloud then you’ll have problems, better to be complaint imo.

The full details about this from GitHub are here

Building dotnet in Containers

A lot of people use the dotnet cli to build their projects, and its a very handy tool, but there’s a lot of version of dotnet out there these days, and maintaining build agents in your CI with all the right version can be troublesome.

So that’s where containers can help you, and make sure you are building with the right sdk.

I’ll use a recent example from a project we did in github acitons.

in this build we run

dotnet restore mySolution.sln
dotnet build mySolution.sln --configuration Debug --no-restore
dotnet test --no-build

This project in question we are using dotnet 3.1 sdk.

So to run these in docker we can simple use a docker run with the sdk container

docker run -v ${PWD}:/scr --workdir /src dotnet build mysolution.sln

This will mount the current working directory into the container and use the sdk inside to run the build.

This isn’t the best approach though, the nuget cache for the container will be lost at the end of the build as its stored in user scope.

A better approach is to use a multi-staged docker file, then it will take advantage of cache in the docker chunks. You can add docker files from teh visual studio IDE that have things preconfigured for good performance for you.

Adding a docker file using visual studio

The basic docker file looks like the below

FROM AS base

FROM AS builder
COPY *.sln ./
COPY ["Agoda.Api.WebApi/Agoda.Api.WebApi.csproj", "Agoda.Api.WebApi/"]
COPY ["Agoda.Api.Core/Agoda.Api.Core.csproj", "Agoda.Api.Core/"]
COPY docker-compose.dcproj ./
RUN dotnet restore "Agoda.Api.WebApi/Agoda.Api.WebApi.csproj"
COPY . .
WORKDIR "/src/Agoda.Api.WebApi"
RUN dotnet build "Agoda.Api.WebApi.csproj" -c Release -o /app

FROM build AS publish
RUN dotnet publish "Agoda.Api.WebApi.csproj" -c Release -o /app

FROM base AS final
COPY --from=publish /app .
ENTRYPOINT ["dotnet", "Agoda.Api.WebApi.dll"]

There’s a couple of points to note here.

Firstly you can see the files are copied into the container to build, project files first, and restore is done separately. The reason for this is we create a chunk with the csproj files so that this chunk will only rebuild if the project files change (i.e. this is generally when you update your dependencies), so this chunk will remain cached on your local or on the CI agent and not rebuild unless you update the csproj.

The second thing to note is that there is multiple FROM statements, this is because it is a multi stage docker file, so it has a large builder container that starts with the SDK, and a smaller base container that the output of the build is copied to in order to have a small output container for production.

So to build this one we simply now run:

docker build . -f Agoda.Api.WebApi/Dockerfile -t agodaapi:1.0

This should be run from the root dir of your solution, and Visual Studio puts the docker file in the sub folder.

You can then run a docker push to publish your newly built container.

You can use the similar approach by using a dockerfile with your unit test project, but in that case you don’t need a multistage, you just need a normal docker file that use sdk image and runs “dotnet test” as the entrypoint.

My Arcade Machine

So people have been seeing this in my video feed on calls the last few days so I thought I’d write a post about it.

A few years ago I found a guy in Thailand through Facebook that builds replica Arcade machines. And ordered my favorite game from the 80s called Gauntlet, this was one of the first 4 player machines, and its infamous because it has infinite levels.


Internally though, its not like the original, just the outside is the same as the original, include the descriptions and text of the characters you can play in the game.

Apparently a lot of the artwork from these old games is available online and he was able to download it and print vynal stickers and make it look just like the original. But with a few additions, the white buttons were not on the original, and also the original didn’t have 6 buttons. The addition of the buttons is to support other games.

Also the coin slots are different from the original which had 4 coins (one dedicated to each player), as describe on the instructions seen printed below.

2020-03-31 17.24.59

The 3rd and 4th players only have only 4 buttons because there is no 3 or 4 play arcade games that have 6 buttons.

Original screen was 19″ CRT but there screen inside is 24″ LCD he was able to fit while still maintaining the original dimensions.

Which leads me to whats on the inside.

2020-03-31 14.06.13

Inside there is a PC with MAME and a few other emulators for old consoles. about 20,000 games in total, all legal ones that are out of copy write of course 🙂



If you are wanting to know where to contact this guy to build you one unfortunately he has disappeared, his Facebook page deleted and his phone disconnected, I haven’t been to his house to check it out because I’ve lost the address too. But its a common profession restoring these old things so I’d do some searching and I’m sure you’ll turn up someone.


From Code Standards to Automation with Analyzers and Code Fixes

We started to talk about standards between teams and system owners at Agoda a couple of years ago. We first started out on C#, the Idea was to come up with a list of recommendations for developers, for a few reasons.

One was we follow polyglot programming here and we would sometimes get developers more familiar with Java, JavaScript and other languages that would be working on the C# projects and would often misused or get lost in some of the features (e.g. When JavaScript developers find dynamics in C#, or Java Developer get confused with Extension Methods).

Beyond this we want to encourage a level of good practice, drawing on the knowledge of the people we have we could advise on potential pit falls and drawbacks of certain paths. In short the standards should not be prescriptive ones, as in “you must do it this way”, they should be more “Don’t do this”, but also teach at the same time, as in “Don’t do this, here’s why, and here’s some example code”. But also includes some guidance as well, as in “We recommend this way, or this, or this, or even this, depending on your context”, but we avoid “do this”.


The output was the standards repository that we’ve now open sourced

It’s primarily markdown documents that allow us to easily document, and also use pull requests and issues to start conversation around changes and evolve.

But we had a “If you build it they will come” problem. We had standards, but people either couldn’t find them, didn’t read them, and even if they did, they’ll probably forget half of them within a week.

So the question was how do you go about implementing standards amongst 150 core developers and hundreds more casual contributors in the organisation?

We turned to static code analysis first, the Roslyn API in C# is one of the most mature Language Apis on the market imo. And we were able to write rules for most of the standards (especially the “Don’t do this” ones).

This gave birth to a cross department effort that resulted in a Code fix library here that we like to call Agoda Analyzers.


Initially we were able to add them into the project via the published nuget package and have them present in the IDE, and they are available here.


but like most linting systems they tend to just “error” at the developer without much information, which we don’t find very helpful, so we quickly moved to Sonarqube with it’s github integration.

This allows a good experience for code reviewers and contributors. The contributor get’s inline comments on their pull request from our bot.


This allows the user time to fix issues before they get to the code reviewers, so most common mistakes are fixed prior to code review.

Also the link (the 3 dots at the end of the comment) Deep links into sonarqube’s WebUI to documentation that we write on each rule.


This allows for not just “Don’t do this”, but also to achieve “Don’t do this, here’s why, and here’s some example code”.

Static code analysis is not a silver bullet though, things like design patterns are hard to encompass, but we find that most of the crazy stuff you can catch with it, leaving the code review to be more about design and less about naming conventions and “wtf” moments from code reviewers when reviewing the time a node developer finds dynamics in C# for the first time and decides to have some fun.

We are also trying the same approach with other languages internally, such as TypeScript and Scala are our two other main languages we work in, so stay tuned for more on this.




The Transitive Dependency Dilemma

It sounds like the title of a Big Bang Theory episode, but its not, instead is an all to common problem that breaks the single responsibility rule that I see regularly.

And I’ve seen first hand how much butt hurt this can cause, a friend of mine (Tomz) spent 6 weeks trying to update a library in a website, the library having a breaking change.

The root cause of this issue comes from the following scenario, My Project depends on my Library 1 and 2. My Libraries both depend on the same 3rd party library (or it could be an internal one too).


Now lets take the example that each library refers to a different version.


Now which version do we use, you can solve this issue in dotnet with assembly binding redirects, and nuget even does this for you (other languages have similar approaches too). However, when there is a major version bump and breaking changes it doesn’t work.

If you take the example as well of having multiple libraries (In my friend tomz case there was about 30 libraries that depended on the logging library that had a major version bump) this can get real messy real fast, and a logging library is usually the most common scenario. So lets use this as the example.



So what can we change to handle this better?

I point you to the single responsibility principle. In the above scenario, MyLibrary1 and 2 are now technically responsible for logging because they have a direct dependency on the logging library, this is where the problem lies. They should only be responsible for “My thing 1” and “My thing 2”, and leave the logging to another library.

There is two approaches to solve this that i can recommend, each with their own flaws.

The first is exposing an interface that you can implement in MyProject,


This also helps with if you want to reuse you library in other projects, you wont be dictating to the project how to log.

The problem with this approach though is that you end up implementing a lot of ILoggers in the parent

The other approach is to use a shared common Interfaces library.


The problem with this approach however is when you need to update the Common interfaces library with a breaking change it becomes near impossible, because you end up with most of your company depending on a single library. So I prefer the former approach.

Your Load Balancer Will Kill You

Let’s start by talking about the traditional way people scale applications.

You have a startup, you have a new idea, so you throw something out there fast, maybe it’s a Rails app with a mongoDB backend if you’re unlucky, or something like that

Now thing are going pretty good, maybe you have time and you re-write into something sensible at this point as you business grows and you get more devs, so now you’re on a nice react website with dotnet or node backend or something. But your going slow due to too many users, so you start to scale, first thing people do is this, load balancer in front, horizontally scale the web layer

Now that doesn’t seem too bad, with a few cache optimizations you’re probably handling a few thousands users simultaneous and felling happy. But you keep growing and now you need to handle tens of thousands, so the architecture starts to break out vertically.

So lets imagine something like the below, and if we’ve been good little vegemites we have a good separation of domains, so are able to scale the db by separating the domains out into microservices on the backend with their own independent data.

Our web site then ends up looking a bit like a BFF (Backend For Frontend), and we scale nicely and are able to start to scale up to tens or into the hundreds of thousands of users. And if you are using AWS especially you are going have these lovely Elastically Scaling Load balancers everywhere.

Now when everything is working its fine, but let’s look at a failure scenario.

One of the API B server’s goes offline, like dead, total hardware failure. What happens in the seconds that follow.

To start, let’s look at load balancer redundancy methods, LBs use a health-check endpoint, an aggressive setting would be to ping it every 2 seconds, then after 2 consecutive failures failures take the node offline.

Let’s also take the example we are getting 1,000 requests per second from our BFF.

Second 1
Lose 333 Requests

Second 2
Lose 333 Requests
Health check fails first time

Second 3
Lose 333 Requests

Second 4
Lose 333 Requests
Health check fails second time and LB stops sending traffic to node

So in this scenario we’ve lost about 1300 requests, but we’ve recovered.

Now you say, but how about we get more aggressive with the health check? This only goes so far.

At scale, the more common outage are not ones where things going totally offline (although this does happen), they are ones where things go “slow”.

Now imagine we have aggressive LB health checks (the ones above are already very aggressive so you cant get much more usually), and things are going “slow” to the point health checks are randomly timing out, you’ll start to see nodes pop on and offline randomly, your load will get unevenly distributed to the point usually where you may even have periods of no nodes online, 503s FTW!. I’ve witnessed this first hand, it happens with agressive health checks 🙂

Next is, what happens if your load balancer goes offline? While load balancers are generally very reliable, things like config updates and firmware updates are times when they most commonly fail, but even then, they still can succumb to hardware failure.

If you are running in e-commerce like I have been for the last 15 odd years then traffic is money, every bit of traffic you lose can potentially be costing you money.

Also when you start to get into very large scale, the natural entropy on hardware means hardware failure becomes more common. For example, if you have say 5,000 physical server in your cloud, how often will you have a failure that takes applications offline.

And it doesn’t matter if you are running AWS cloud, kubernetes, etc hardware failure still takes things offline, your VMs and containers may restart with little to no data loss, but they still go offline for periods.

How do we deal with this then? How about Client-side Weighted round-robin?

WTF is that? I hear you say. Good Question!

It’s were we move the load balancing mechanism to the client that is calling the backend. There is several advantages to doing this.

This is usually coupled with a service discovery system, we use consul these days, but there is lots out there.

The basic concept is that the client get a list of all available nodes for a given service. They will the maintain they own in memory list of them and round robin through them similar to a load balancer.

This removes infrastructure (i.e. cost and complexity)

The big differences comes though that the client can retry the request on a different node. You can implement retries when you have a load balancer in front of you, but you are in effect rolling the dice, having the knowledge of the endpoint on the client side means that the retries can be targeted at a different server to the one that errored, or timed out.

What’s the Weighting part though?

Each client maintains its own list of failures, so for example if a client got a 500 or timeout from a node, it would weight him down and start to call him less, this cause a node specific back off, which is extremely important in the more common outage at scale of its “slow”, so if a particular node has been smashed a bit too much by something and is overloaded the clients will slow back off that guy.

Let’s look at the same scenario as before with API B and a node going offline. We usually set our timeouts to 500ms to 1 second for our API requests, so let’s say 1 second, as the requests start to fail they will retry on the next node in the list, and weight down the offline server in the local clients list of servers/weighting, so here’s what it looks like:

Second 1
220 Requests take 1 second longer

Second 2
60 Requests take 1 second longer

Second 3
3 Requests take 1 second longer

Second 4
3 Requests take 1 second longer

Second 5

The Round robin weighting kicks in at the first failure, as we only have 3 web servers in this scenario and they are high volume the back-off isn’t decremented in periods of seconds its in number of requests.

Eventually we get to the point that we are trying the API once every few seconds with a request from each web server until he comes back online, or until the service discover system kicks in and tells us he’s dead (which usually takes under 10 seconds)

But the end result is 0 lost requests.

And this is why I don’t use load balancers any more 🙂

Importing Custom TypeScript tslint rules into Sonarqube

I’ll be the first to say I am not a fan of sonarqube, but is the only tool out there that can do the job we need. Getting TypeScript working with it was royal butt hurt, but we got there in the end so I wanted to share our journey.

The best way we found to work with it was to store our rules in our tslint config in source control with our own settings and use it as, this is good because it’ll help keep the sonarqube server rules in sync with the developers.

The problem we run into is that the rules need to exist on the server, so if you for example add the react-tslint rules to your project they also need to be defined in the sonarqube server here



Once they are there sonar understand the rules, but will not process them, but rather than setting up the processing on the server we decide to use our build.

So what we do is

  1. Import ALL rules to sonar server (once off)
  2. run tslint and export failed rules to file
  3. import failed rules using sonar runner (instead of letting runner do analysis)

The server is aware of ALL rules, but its our tslint output that tells it which ones have failed, so you can disable rules in your tsling config that the server is aware of and it won’t report them.

This then means that the local developer experience and the sonarqube report should be a lot more in sync than having to maintain the server processing, and means it easier to run multiple project on the one server with disparate rule sets.

The hard part here though is the import of rules

For our initial import we did the follow rule sets:

  1. react-tslint
  2. tslint-eslint-rules
  3. tslint-consistent-codestyle
  4. tslint-microsoft-contrib

And I have created some powershell scripts that generates the format that is needed from the rules git repos.

To use this clone each of the above repos, then run its corresponding script to generate the output file, then copy and paste this into the section in the sonarqube admin page (it’s ok, this is a once off step).

[gist /]

you should create one record below for each of the four imports, then paste the output from each powershell script into the boxes on the right, as seen below.


Once this is done you need to restart the sonarqube server for the rules to get picked up

WARNING: check for duplicate rule names, there is some (I forgot which ones sorry) and they prevent the sonarqube server from starting and you will need to edit the SQL database to fix it.

Then browse to your rule set and active the rules into it. I recommend just creating a single rule set and put everything in it, like i said you can control the rules from your tslint run, and just add all rules to all projects on the sonarqube server side.


After this run your sonarqube analysis build (see here if you haven’t built it yet ) and you are away.