Uncategorized – A6h S5s

This is the first of hopefully many posts talking about my learnings with Kubernetes(k8s).
This link can provide some context on teh specific application this applies to.

A little bit about the simulation service which we will be calling AtriarchSimc. Note: This isn’t a brand but it helps me differentiate between SimulationCraft, RaidBots, and my own homebrewed application.

The AtriarchSimc app engine internally runs SimulationCraft for each instance deployed. This is important because, one of the reasons RaidBots is helpful is that typical home PCs may not have enough RAM to support very large simulations. I personally wonder if it’s a drawback to simulation craft and if I get the time I’d like to go see if I can batch simulation results to disk and see if I can make some changes with a focus on conservation of RAM. However, there are a lot of good people contributing to SimC so I’m not optimistic that it will be so easy. But I digress, because of this high RAM usage, my service limits simulation size to prevent cluster nodes from being overburdened. This limits is only a half fix however, since the RAM usage is not immediate, the utilization builds up over time as the simulation runs. This allows a case where if multiple simulations are started at the same time, when the app engine scales up, k8s can place multiple instances of the engine on the same node. If the simulations are sufficiently large they can run the node out of memory. It’s also possible that k8s will evict one of the engines, stopping the simulation, requiring the system to allocate the simulation to the next available engine and loosing any progress.

Enter Kubernetes Documentation, and as good as it is… it can be a little much to digest for someone who is still learning. So, I’m hoping to answer in post the narrow question “How do I ensure that I never have the same pod more than once on the same node?”

One solution I came across was Pod Topology Constraints. This was a pretty useful too for enforcing relatively even distribution of pods across nodes, however for my use case, this wasn’t able to be used to force no more than one instance of a specific pod on any node.

TLDR:
At the end of my searching I found pod affinities/anti-affinities. This solution was specifically what I was looking for to fit my use case and I finally settled on the following deployment yaml. (Note this yaml has irrelevant sections removed for brevity)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atriarch-simc-engine
spec:
  selector:
    matchLabels:
      app: atriarch-simc-engine

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - atriarch-simc-engine
            topologyKey: "kubernetes.io/hostname"

Using pod anti-affinities I was able to inform the default scheduler that if a node already has a pod with the label “atriarch-simc-engine” scheduled on it then do not schedule another pod with the same label. The node is identified by the topology key kubernetes.io/hostname and ensures that not two nodes with the same hostname will have a duplicate of the same pod. This solution allowed my engine pods to scale to the number of worker nodes in the cluster and distribute the load on RAM as multiple engines are running simulations.

A little background, I played World of Warcraft during the Shadowlands expansion and there came a time that I began to care about min/maxing my character. For context, min/maxing is the process of minimizing undesirable qualities and maximizing desirable ones. Typically, when min/maxing a character, you pursue best-in-slot (BiS) items and use item attribute weights ( aka stats weights) to inform choices for preparing your character for various challenges. A typical tool that is used to calculate this information is an open source project called SimulationCraft. Since this application can be somewhat involved and require a moderate strength computer, not all players are able to make use of it and are unable to explore min/maxing. Introduce RaidBots, a service that allows a player to run SimulationCraft simulations to calculate their character’s ideal items and their stat weights (among many other things) through a website. It’s a well built and convenient tool providing free services as well as a paid tier for players that want more.

With that context, lets talk about me. I’m cheap… I investigated RaidBots and took advantage of their free tier. When using the free tier, when you request service, you get in line with all other free tier users. During low traffic times, the wait for services is about 1-2 minutes. During high traffic times however, I’ve waited as long as 20 minutes for a simulation to start. If I bumped up to the paid tier, I could skip this line. As a note, I understand why there’s a line for the free tier and the cost of the paid tiers is acceptable for the value you get. Now, I’ve been working on a personal project for a while and had needed a way to distribute Android application packages (apks) to my testing friends because I don’t want to wait till I finish the laundry list of requirements to distribute a beta application on the Play store. Additionally, sending the apks through discord/GoogleDrive/Email was clunky and unreliable at best. One night while waiting ~15 minutes in the RaidBots public line I got the idea to build my own RaidBots like service and use it as a reason to prove out a content delivery network (CDN) to also solve my apk delivery woes.

Fast forward to the end, I built it and it has saved many hours of sims and provided me lots of fun/interesting problems to solve and technologies to study.

I’ll be adding more posts later to further describe the system if there’s interest or I if get bored. However, included below is a high level diagram of the application. The services shown, except for the Identity Server service and the TruNAS server, are automatically built and deployed to a Kubernetes (k8s) cluster (This is honestly a meaningful use for k8s as opposed to this other project of mine).

High level overview of the Atriarch Simc Runner architecture

Additional technology not listed in the diagram that helps to bring this application together is listed below:

ELK stack
Kubernetes
NexusOSS
ArgoCD
GitHub
Jenkins

I hope this was interesting at the very least. If you have a question about problems I may have tacked, place them in the comments below and I’ll see what I can answer.

Category: Uncategorized

Ensuring only one Instance of a Pod on a Kubernetes Node

Building My Own RaidBots Service