Troubleshooting the right way

Change your mindset; it’s a challenge

  • Why do I need to fix this issue
  • I wish this issue wasn’t time-critical for the Developers’ team
  • I hate issues
  • This challenge tickles my brain; I’m glad I was assigned to do it
  • The more time-critical the task is, the greater the challenge, and I love being “the one who solves things quickly”
  • I love challenges


  1. Define the challenge in simple words
  2. How will you test that it works — write it down and don’t adjust it to your solution. Defining, beforehand, the way you’ll test the solution and the expected result will mitigate the risk of being biased towards a specific solution
  3. Map the components — The least you can do is write (type) down the components that should be analyzed. A better thing to do is sketch the components and the flow of the process that needs to be analyzed. The best thing to do is draw a diagram in, but keep in mind that it’s time-consuming and only necessary for very complicated challenges.
  4. Prioritize the solutions from best to worst according to
  • Reliability — A permanent solution needs to be reliable (stable) and requires a good design. Attempting to apply an ad-hoc solution, “just to make it work” will make it harder to troubleshoot for your colleagues or future you
  • Time to first results — If it fails quickly, it’s easier to move on to the next solution. Don’t start with the “longest” solution; you’ll be exhausted once you get to the other ones on the list
  • Effort to explain the solution — If you aim for a permanent solution, then your colleagues, or future you, should also understand the logic behind it. If it takes 5 hours to explain the solution, then it should be prioritized very low or probably dropped

Real-life technical challenge


  1. Cloud-provider firewall (AWS Security Group) — Remove the Allow outbound to (Difficulty: Simple via AWS Console)
  2. Server’s (EC2 instance) firewall (ufw) — Add a deny all outbound to rule to EC2 instance firewall. A quick Google search got me to this Stackoverflow answer, which provided a fairly easy solution (Difficulty: Okayish by SSH to the EC2 instance)
  3. Application — Add a Kubernetes Network Policy that blocks access to the Requires writing down a yaml file and dealing with the Kubernetes ecosystem (Difficulty: Overkill, do it when all else fails)
  4. Subnet’s Network Access List (NACLs) — Add the rule outbound deny to the subnet's NACL. Dropped this one because it might affect other resources in the same subnet, which in turn can disturb other developers' work (Difficulty: Simple via AWS console)
  5. Subnet’s Routes Table — Remove the route to from the routing table. Simple via AWS console. Dropped for the same reason, I dropped NACLs (Difficulty: Simple via AWS console)
  1. Time to first results
  2. Reliability
  3. Effort to explain

Iterating over the solutions

Security Group Rule

  1. AWS Console > EC2 > Security Groups > Edit Security Group
  2. Outbound rules > Remove
  3. NewRelic > Check for new data > No good, data is still coming
  4. Added back the rule to allow outbound to

Server’s firewall (ufw)

  1. SSH to EC2 instance
  2. Execute sudo ufw default deny outgoing
  3. NewRelic > Check for new data > No good, data is still coming
# -v = verbose
# -w = timeout after 3 seconds
# 443 = check this port, in our case it's HTTPS
$ nc -v -w 3 443 nc: connect to port 443 (tcp) timed out: Operation now in progress # This is good. It means that all access to is blocked
# Example for a successful response
# Keep in mind that we DON'T want it to succeed
$ nc -v -w 3 443 Connection to 443 port [tcp/https] succeeded

Is it me? Or Newrelic?

The Epiphany

$ kubectl scale --replicas=0 deployment/prometheus deployment.apps/prometheus scaled# Modifying the Security Group Rule
# 1. AWS Console > EC2 > Security Groups > Edit Security Group
# 2. Outbound rules > Remove ``
$ kubectl scale --replicas=1 deployment/prometheus deployment.apps/prometheus scaled# 3. NewRelic > Check for new data > Tada! No new data!


