Elixir v1.8 Release Notes
https://elixir-lang.org/blog/2019/01/14/elixir-v1-8-0-released/
- Custom struct inspections
defmodule User do
@derive {Inspect, only: [:id, :name, :age]}
defstruct [:id, :name, :age, :email, :encrypted_password]
end
Calendar.TimeZoneDatabase
behaviour, Date.day_of_year/1
, quoter_of_year/1
, etc
- Faster compilation, more efficient code in some cases
- Improved instrumentation and ownership with
$callers
[your code] -- calls --> [supervisor] ---- spawns --> [task]
[your code] [supervisor] <-- ancestor -- [task]
^ |
|--------------------- caller ---------------------|
- Started working on
mix release
On Infrastructure at Scale: A Cascading Failure of Distributed Systems
https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df
- Target Application Platform (TAP) - a toolchain to manage hosting/deployment/capacity/…
- Many core services - centralized systems for the entire enterprise (db, elk, messaging)
- They heavily leverage Kafka, including logs and metrics (via TAP). Kafka in OpenStack in the DC
- A lot of K8s clusters under TAP + large k8s cluster for dev envs
- sidecars: logs, metrics + consul agent
- Outage:
- OpenStack network upgrade -> several hours outage -> Kafka is not available
- When Kafka is online again, all logs from all the dev envs woke up and send logs -> cpu spike
- High CPU spike of docker daemon -> k8s nodes “are unhealthy”
- k8s scheduler moved load from “unhealthy” to healthy nodes
- some new nodes could withstand the load, but some couldn’t -> “unhealthy” -> move load to the “healthy” -> cycle
- new pods are up -> consul agent registers in gossip -> high consul load
- 41k new k8s pods during this event -> high Consul latency (gossip mesh + high cpu load)
- gossip mesh increased the cpu load on all the consul agent -> higher cpu load
- Consul is “down” (too slow) -> Vault couldn’t work, TAP LB generation couldn’t work
- After DAYS of research they fixed the Consul+Vault by “gossip encryption” -> previous gossip messages were rejected
- Their Conclusions:
- Several small k8s clusters is better than one large
- Shared Docker daemon is a failure point
IMHO:
- What is the “gossip poisoning”:
- why after fixing k8s, when real nodes were not restarted it didn’t calm down???
- Why loggers could overload the “docker” so that nodes became “unhealthy”? Incorrect configuration -> cause of the outage, imho
- Good example of system that looked well-configured, but nevertheless had broken.
- When building such complex systems, try to predict the unknowns, though it is very difficult
- DaemonSets are better? They won’t rise CPU load? In any case, sidecar shouldn’t break the main container