OpenStack Erisを調べてみた - とんちゃんといっしょ

Chaos Engineeringを調べてるときに「OpenStackにもFault injectionでOpenStackを試験するプロジェクトがあるよ」と聞いたので調べてみた。

OpenStack Eris

docs.openstack.org

OpenStackに様々な負荷をかけて、OpenStackの性能改善やResilienceにするためのテストフレームワークや、テストスートを作るプロジェクト

Load Injection
Fault Injection

もできる（予定）

OpenStack Erisに関する情報

OpenStack Summit Sydney のフォーラムに２件あったと教えてもらった

Extreme/Destructive Testing

www.openstack.org

LCOO = Large Contributing OpenStack Operators

LCOO - OpenStack

フォーラムのディスカッション内容

What sort of tests do you run today that are destructive in nature?
What is your desired workflow for issues that come up?
How do you determine success/failure of your testing scenarios?
Do you publicly publish your results (if so where)?
What KPI do you evaluate today?
What workloads do you run and against what architectures?

sessionは3part

References: Extreme testing is non-deterministic. Such testing generally is valid with the following reference.
Test Suite: What test scenarios are we trying to get done?
Frameworks & Tools: How do we enable the test suite?

etherpadはこっち https://etherpad.openstack.org/p/SYD-extreme-testing

フォーラムではどういうテストをするのかという議論が行われていたっぽい。また、Rallyとは何が違うのかというQAがあったのもログに残っている。

- What is the difference between Eris and Rally? (From spec it looks the same) (boris-42)
   (samP) Discussion is here http://lists.openstack.org/pipermail/openstack-dev/2017-November/124156.html
   (gautamdivgi) 
We are looking for a solution/mechanism where there the capacity for
- Load generation (Control + Data Plane)
- Failure injection (all over the cloud)
- Data capture (all touch points - not just API requests)
- KPI calculation (all touch points - not just API requests)
- Most importantly - have one part "influence" the other (E.g. run failures or a sequence of failures based on system criteria like netstat session counts or free memory remaining)
Rally does load generation and there are failure injection hooks - but cannot participate in more complex feedback, monitoring & computation mechanisms.
Rally is used today to generate load. Definitely looking to contribute some plugins for load gen & metrics collections enhancements.

LCOO-Extreme Testing-QA-ERIS

www.openstack.org

ここではErisのデモ行われたらしいがフォーラムだったのでビデオはないらしい。

Erisのデモ用の手順

openstack-lcoo.atlassian.net

etherpadはこちら https://etherpad.openstack.org/p/LCOO-Extreme_Testing-QA-ERIS

os-faultsなどが使われるのかというQAのログが残っているがframeworkのプラグインにすればという話が出ている

Will Eris support os-faults in the future? https://github.com/openstack/os-faults
- Need for extensible pluggable framework for fault injection, e.g.
    - Define levels of redundancy within infrastructure and inject faults up to the maximum which can be tolerated given that amount of redundancy (e.g. 2 node failures in a 5-node cluster)
    - Use os-faults to test failure scenarios defined by Self-Healing SIG and ensure that self-healing works
- Preference for reusing existing projects if possible
- If not possible, include gaps analysis in the spec
Maybe split spec into smaller pieces?
Can some example results / demo be published so that people can start looking at Eris and get a feel for what it can do?

まとめ

2年前ぐらいの情報しか見当たらないので特に進展はなさそうなのでもしOpenStackにたいしてChaos Engineeringをするのであれば別のツールを使ったほうが良いかも知れない。