Troubleshooting the right way

Change your mindset; it’s a challenge

  • Why do I need to fix this issue
  • I wish this issue wasn’t time-critical for the Developers’ team
  • I hate issues
  • This challenge tickles my brain; I’m glad I was assigned to do it
  • The more time-critical the task is, the greater the challenge, and I love being “the one who solves things quickly”
  • I love challenges
challenge-accepted

Cheatsheet

  1. Define the challenge in simple words
  2. How will you test that it works — write it down and don’t adjust it to your solution. Defining, beforehand, the way you’ll test the solution and the expected result will mitigate the risk of being biased towards a specific solution
  3. Map the components — The least you can do is write (type) down the components that should be analyzed. A better thing to do is sketch the components and the flow of the process that needs to be analyzed. The best thing to do is draw a diagram in draw.io, but keep in mind that it’s time-consuming and only necessary for very complicated challenges.
  4. Prioritize the solutions from best to worst according to
  • Reliability — A permanent solution needs to be reliable (stable) and requires a good design. Attempting to apply an ad-hoc solution, “just to make it work” will make it harder to troubleshoot for your colleagues or future you
  • Time to first results — If it fails quickly, it’s easier to move on to the next solution. Don’t start with the “longest” solution; you’ll be exhausted once you get to the other ones on the list
  • Effort to explain the solution — If you aim for a permanent solution, then your colleagues, or future you, should also understand the logic behind it. If it takes 5 hours to explain the solution, then it should be prioritized very low or probably dropped

Real-life technical challenge

Analyzing

  1. Cloud-provider firewall (AWS Security Group) — Remove the Allow outbound to 0.0.0.0/0. (Difficulty: Simple via AWS Console)
  2. Server’s (EC2 instance) firewall (ufw) — Add a deny all outbound to 0.0.0.0/0 rule to EC2 instance firewall. A quick Google search got me to this Stackoverflow answer, which provided a fairly easy solution (Difficulty: Okayish by SSH to the EC2 instance)
  3. Application — Add a Kubernetes Network Policy that blocks access to the 0.0.0.0/0. Requires writing down a yaml file and dealing with the Kubernetes ecosystem (Difficulty: Overkill, do it when all else fails)
  4. Subnet’s Network Access List (NACLs) — Add the rule outbound deny 0.0.0.0/0 to the subnet's NACL. Dropped this one because it might affect other resources in the same subnet, which in turn can disturb other developers' work (Difficulty: Simple via AWS console)
  5. Subnet’s Routes Table — Remove the route to 0.0.0.0/0 from the routing table. Simple via AWS console. Dropped for the same reason, I dropped NACLs (Difficulty: Simple via AWS console)
  1. Time to first results
  2. Reliability
  3. Effort to explain

Iterating over the solutions

Security Group Rule

  1. AWS Console > EC2 > Security Groups > Edit Security Group
  2. Outbound rules > Remove 0.0.0.0/0
  3. NewRelic > Check for new data > No good, data is still coming
  4. Added back the rule to allow outbound to 0.0.0.0/0

Server’s firewall (ufw)

  1. SSH to EC2 instance
  2. Execute sudo ufw default deny outgoing
  3. NewRelic > Check for new data > No good, data is still coming
# -v = verbose
# -w = timeout after 3 seconds
# 443 = check this port, in our case it's HTTPS
$ nc -v -w 3 metric-api.newrelic.com 443 nc: connect to metric-api.newrelic.com port 443 (tcp) timed out: Operation now in progress # This is good. It means that all access to 0.0.0.0/0 is blocked
# Example for a successful response
# Keep in mind that we DON'T want it to succeed
$ nc -v -w 3 metric-api.newrelic.com 443 Connection to metric-api.newrelic.com 443 port [tcp/https] succeeded

Is it me? Or Newrelic?

The Epiphany

$ kubectl scale --replicas=0 deployment/prometheus deployment.apps/prometheus scaled# Modifying the Security Group Rule
# 1. AWS Console > EC2 > Security Groups > Edit Security Group
# 2. Outbound rules > Remove `0.0.0.0/0`
$ kubectl scale --replicas=1 deployment/prometheus deployment.apps/prometheus scaled# 3. NewRelic > Check for new data > Tada! No new data!

References

Final words

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store