Stop looking at CPU usage; start looking at Linux PSI

parth21shah 4 hours ago

I wrote this because alerting on CPU% / loadavg as machine health indicator has burned me a few times. The simple split I use now is: - CPU% = how busy the cores are - PSI = how much time tasks are stalled (CPU / memory / IO) In an eBPF agent I am working on (Linnix), I ended up looking at CPU and PSI together. High CPU + high PSI is interesting. High CPU + low PSI is usually just “busy”. This obviously doesn’t replace latency/SLO alerts at the app level.It’s only about which host metric to look at.