Notes for episode-0225

Posted at — Jan 19, 2019

Elixir v1.8 Release Notes

https://elixir-lang.org/blog/2019/01/14/elixir-v1-8-0-released/

Custom struct inspections

defmodule User do
  @derive {Inspect, only: [:id, :name, :age]}
  defstruct [:id, :name, :age, :email, :encrypted_password]
end

Calendar.TimeZoneDatabase behaviour, Date.day_of_year/1, quoter_of_year/1, etc
Faster compilation, more efficient code in some cases
Improved instrumentation and ownership with $callers

[your code] -- calls --> [supervisor] ---- spawns --> [task]

[your code]              [supervisor] <-- ancestor -- [task]
     ^                                                  |
     |--------------------- caller ---------------------|

Started working on mix release

On Infrastructure at Scale: A Cascading Failure of Distributed Systems

https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df

Target Application Platform (TAP) - a toolchain to manage hosting/deployment/capacity/…
Many core services - centralized systems for the entire enterprise (db, elk, messaging)
They heavily leverage Kafka, including logs and metrics (via TAP). Kafka in OpenStack in the DC
A lot of K8s clusters under TAP + large k8s cluster for dev envs
sidecars: logs, metrics + consul agent
Outage:
- OpenStack network upgrade -> several hours outage -> Kafka is not available
- When Kafka is online again, all logs from all the dev envs woke up and send logs -> cpu spike
- High CPU spike of docker daemon -> k8s nodes “are unhealthy”
- k8s scheduler moved load from “unhealthy” to healthy nodes
- some new nodes could withstand the load, but some couldn’t -> “unhealthy” -> move load to the “healthy” -> cycle
- new pods are up -> consul agent registers in gossip -> high consul load
- 41k new k8s pods during this event -> high Consul latency (gossip mesh + high cpu load)
- gossip mesh increased the cpu load on all the consul agent -> higher cpu load
- Consul is “down” (too slow) -> Vault couldn’t work, TAP LB generation couldn’t work
- After DAYS of research they fixed the Consul+Vault by “gossip encryption” -> previous gossip messages were rejected
Their Conclusions:
- Several small k8s clusters is better than one large
- Shared Docker daemon is a failure point

IMHO:

What is the “gossip poisoning”:
- why after fixing k8s, when real nodes were not restarted it didn’t calm down???
Why loggers could overload the “docker” so that nodes became “unhealthy”? Incorrect configuration -> cause of the outage, imho
Good example of system that looked well-configured, but nevertheless had broken.
When building such complex systems, try to predict the unknowns, though it is very difficult
DaemonSets are better? They won’t rise CPU load? In any case, sidecar shouldn’t break the main container

Gliush Notebook

Notes about IT, work, and life

Notes for episode-0225

Elixir v1.8 Release Notes

On Infrastructure at Scale: A Cascading Failure of Distributed Systems