He is system architect, and always explain system design in OTP terms (even if he’s talking not about Erlang/Elixir)
Where bugs come from: from “unknown”: spectre/meltdown, thundering herd.
Main target: move bugs from “unknown” to the “known” space. How?
Difficult: Increase knowledge of developers.
Simple: Hire more seniors (temptation: send seniors alone to the difficult parts. They’ll leave some day). Culture of teaching, mentorship, training, knowledge sharing.
Have more diverse team. The same background -> the same “known circle”. If you hire a team of people who love bitcoin -> they’ll come up with a solution that includes a blockchain, no matter what the problem was.
Effective approach: exploratory testing. Pros: will discover plenty of bugs. Cons: difficult to find the tester
Fuzzing. AFL (American Fuzzy Lop), назван в честь кролика
In Erlang/Elixir world, Property based testing - is a must have. To find corner cases.
When your team prove itself wrong - they hopefully reduce the gap between “you think you know” and “you know”
Next general approach - do not go to the “unknown” area, put safeguards to prevent we do odd or unpredictable things:
write less code
formal methods (TLA+, exhaustive model checking, automated proof). Everything in your system is proven to work.
strict assertions, type signatures, type analysis
For erlang/elixir:
use dialyzer to prevent unexpected type errors and invalid states.
use code formatters
let it crash. do not try to massage unusable data into acceptable data.
If things go wrong, they won’t go wrong in the places you planned (no logging, few metrics -> 0 knowledge). So learn how to introspect the VM, how to use tracing, dig information (Erlang in Anger)
Make “unknown” irrelevant: Group bugs and handle them the same way
Redundancy: one node breaks, send traffic to another nodes. No matter what the bug is -> overcome it.
Supervisors theory: types, how they start their children (depth-first)
how system works - is an easy stuff. how do you want the system to fail?
Supervisors theory: how to restart/fail
Return error, let those, who has context, to decide.
Drop messages, if you can’t handle them or use dead letter queue. (new message type, that is not handled yet, …)