Skip to content

Releases: litmuschaos/litmus

1.8.0

15 Sep 17:18
b8b4ade
Compare
Choose a tag to compare

New Features & Enhancements

  • Introduces the alpha-0 version of Litmus Portal. The portal helps you to execute & visualize chaos workflows, amongst many other things. Learn more about it here

  • Extends Litmus Probes with “Continuous” mode to validate the hypothesis around application behavior during chaos execution as against just at specific points/phases (start & end of chaos)

  • Adds Node & Pod level I/O stress chaos experiments with the ability to tune worker threads and filesystem usage, to the generic experiment suite.

  • Supports network chaos on Containerd & CRI-O runtimes, in addition to Docker.

  • Supports network chaos between distinct microservices (in addition to total interface level egress traffic chaos) specified by their IPs or hostnames/service FQDNs

  • Enhances the ChaosSchedule schema for repeat mode by adding IncludedHours & IncludedDays. The StartTime/EndTime definitions have been made optional to allow flexibility in being able to run from the point of creation of schedule CR or indefinitely until removal.

  • Migrates Cassandra ring disruption experiment to go-based chaoslib

  • Adds the ability to specify a target pod (env: TARGET_POD) or node (env: APP_NODE) as the application/resource under test, apart from randomized selections based on labels.

  • Enables the definition of blast radius for an application as a percentage value (PODS_AFFECTED_PERCENTAGE), by which an appropriate number of replicas undergo the specified chaos in parallel.

  • Improves the litmus chaoslib to take container fs & runtime socket file paths as tunables to support different Kubernetes platforms

  • Includes an additional pumba-based chaoslib for cpu/memory stress that uses external chaos containers (non-pod exec mode)

  • Adds chaos command tunables (for chaos injection & revert) for cpu/memory chaoslib (in pod exec mode) - in order to cover different base images & distros.

  • Supports broader filtering of pods within a namespace when no application labels are provided in .spec.appInfo. Users can also choose to skip the specification of application namespace explicitly, in which case the target pods are selected randomly from the ChaosEngine resource namespace.

  • Modifies the litmus chaos containers (operator, runner) to run with non-root users

  • Allows the definition of an INSTANCE_ID in the ChaosEngine to provide additional context or metadata to an experiment run. This also aids the creation of newer ChaosResult resources instead of patching/overwriting existing ones in case of repeated executions.

  • Improves the experiment code standards by fixing the issues listed in the GoGitOps report card for the litmus-go repository.

  • Generates events against the ChaosResult resource to indicate the experiment verdict (Pass, Fail, Stopped). These are useful in annotating monitoring dashboards with experiment results.

  • Enhances the Chaos Exporter to push chaos metrics to AWS CloudWatch

  • Improves the kubernetes-chaos helm chart by including options in the values.yaml to selectively install experiments via a whitelist/blacklist. Also maps the experiment names to reflect those on the ChaosHub.

  • Enhances the litmus-e2e with increased reporting around component-tests, the addition of e2e tests for new experiments, and Docker-based Gitlab runner for litmus-portal pipelines

  • Provides additional documentation based on experiment enhancements. Updates the get started documentation for general Kubernetes/OpenShift/Rancher platforms.

  • Enhances the litmus-demo scripts to generate a pdf report for the chaos experiments executed

  • Operationalizes the Litmus community Special Interest Groups (SIGs) for Documentation, Observability & Integrations.

Major Bug Fixes

  • Constructs ChaosResult name using experiment names passed from the ChaosExperiment resource instead of hardcoded experiment names

  • Fixes the chaos verification (whether chaos injection has occurred) steps in the container-kill experiment & retains the helper containers in case of errors for further debugging

  • Fixes the chaos event messages to be meaningful & include probe information only when the probes are defined

  • Removes the need for privileged containers to execute disk-fill chaos experiment

  • Handles the case where cpu/memory hog chaos processes are terminated or the target containers are OOM-Killed (this typically occurs when the memory hog/injection value exceeds resource limits set against the pods/containers). The error code 137 is handled appropriately with warning logs and the experiment proceeds with verification steps instead of erroring out/failing (the OOM-Kill is an expected behavior based on inputs provided)

  • Fixes the behavior in node-memory hog experiments where the provided input (percentage of node memory) is measured against the available memory instead of the total system memory

  • Propagates the custom chaos experiment annotations provided in the ChaosExperiment to the helper pods, if any. This is especially useful in cases where annotations decide scheduling or are mapped to certain IAM role/accounts etc.,

Deprecations & Breaking Changes

  • The instance count (.spec.schedule.instanceCount) property on the chaosSchedule has been deprecated in favor of maintaining just the minChaosInterval as a means of defining chaos cadence.

Major Known Issues & Limitations

Issue

  • The network chaos experiments (especially on docker runtime, using the litmus pumba lib) can end up with a Failed ChaosResult, and the app stuck in CrashLoopBackoff state in case of application deployments configured with liveness probes (that are set up to access health/service endpoints). Typically, this lib injects the tc netem rule against the interface by running a “chaos container” that attaches to the network namespace of the target container via the target’s container ID. The same ID is used in a subsequent container launched to revert the rule/chaos. However, with liveness probes, the container is restarted several times during the course of the chaos duration, causing the ID to change. The revert fails, with the network rule still persisting (courtesy the Kubernetes pause container for this app pod) leading to the app entering a CrashLoopBackOff state.

Current Workaround

  • Delete/reschedule the target pod manually to recreate the pause container/network namespace.
  • Use Target IPs or Hosts to inject the chaos b/w specific microservices while keeping the probe alive.

Note: This is expected to be fixed in a 1.8.x patch release

Issue

  • The kubelet-service-kill experiment makes use of systemctl to stop/start the service today. Running this experiment w/o an external LIB_IMAGE & leveraging the experiment image can throw the error Failed to connect to bus: No data available as the experiment runs with a non-root user.

Current Workaround

  • A standard Ubuntu image that runs as root can be used in a “helper” pod that injects this chaos. However, user-discretion is advised in terms of providing this access.

Issue

  • The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail, in spite of chaos being injected successfully - due to the unavailability of certain default utils (that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration) in the target’s image.

Workaround

  • Users can identify the necessary commands to derive and kill the chaos PIDs and pass them to the experiment via env variable CHAOS_KILL_COMMAND

  • Alternatively, they can make use of the chaos lib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Note: This is expected to be fixed in a 1.8.x patch release

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.8.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.8.0-RC2

15 Sep 04:21
3c34d21
Compare
Choose a tag to compare
1.8.0-RC2 Pre-release
Pre-release
Merge pull request #2071 from rajdas98/cherry-pick-1.8.0-rc2

Cherry pick 1.8.0 rc2

1.8.0-RC1

10 Sep 09:49
8623ee3
Compare
Choose a tag to compare
1.8.0-RC1 Pre-release
Pre-release
chore: (litmus-portal) Refactoring and bug fixes (#2027)

This commit has the following changes:
- folder structure change for models and useEffect fixes
- user redux fixed
- graphql documents re-organised

Signed-off-by: arkajyotiMukherjee <[email protected]>

1.7.0

15 Aug 16:19
ae0b913
Compare
Choose a tag to compare

New Features & Enhancements

  • Introduces experiment probes to enable declarative specification of entry/exit (success) criteria via the chaosengine. This release supports the Command, Kubernetes & HTTP probe types that can be configured in SoT (Start of Test), EoT (End of Test) & Edge execution modes. With this, users can reuse generic experiments to test a variety of app-specific/context-specific chaos scenarios.

  • Enhances the chaosresult status schema to include the ProbeSuccessPercentage score that gives an overview of the app/infra resilience to a specific chaos experiment run

  • Refines operational modes of litmus: Introduces namespaced operator support in helm charts to support multi-developer/shared cluster use-case with dedicated namespaces, such as in the Okteto Cloud, while updating the admin & standard mode functionality to watch engine resources in litmus & across namespaces respectively

  • Adds functionality to look for target applications in the chaosengine resource namespace if the target namespace is not explicitly specified.

  • Validates/prevents malformed application labels in the chaosengine

  • Improves the ChaosEngine status schema to hold more info (experiment pod names, runner names) that can aid other tools/abstractions running the experiment to derive/parse useful info for further reuse (logs extraction, for ex.)

  • Adds Microsoft Azure Kubernetes Service (AKS) as a supported platform for the generic experiment suite.

  • Adds a new chaos experiment to scale pods/test node autoscale functionality

  • Adds the libraries for the execution of AWS chaos using chaostoolkit, orchestrated by Litmus.

  • Adds support for the specification of host file mounts in chaos experiments

  • Allows setting polling intervals and timeouts for status checks via chaosengine to aid tuning execution for slower environments

  • Removes dependencies on multiple experiment “helper” (auxiliary) images and makes the litmus go-runner self-sufficient in handling the required chaos business logic. This eases maintenance, especially in the case of air-gapped environments / downstream projects that build the litmus components in their respective CI/CD pipelines.

  • Enhances the experiment to “fail fast” upon failed app checks in cases where containers are terminated

  • Upgrades the ansible-runner to use python3

  • Enhances the developer experience for litmus chaos experiments by using Okteto CLI to develop & test experiment business logic in-cluster over repeating image-build-job-run cycles

  • Updates the scaffold utils to generate the experiment bootstrap code based on the latest developments in the experiment structure.

  • Adds chaos-instrumented grafana dashboards for the sock-shop application along with details on setting up monitoring for chaos experiment runs.

  • Adds pre-defined/usable workflows for repeatable execution of node resource chaos in the chaos-charts repo

  • Pushes the technical preview / pre-alpha version of the litmus-portal (available on the master branch).

  • Refactors the litmus-e2e repo/code-structure to simplify the addition of new BDD tests (modularization, removal of bash utils, formatted errors, klog usage, scenario coverage parameters)

  • Adds additional stages in litmus-e2e GitLab pipelines to execute both the go-based & ansible-based chaos experiments

  • Improves github-actions based comment-triggered e2e runs with log details

  • Features a completely revamped & improved ChaosHub

  • Improves the project wiki with more information for users and developers (architecture docs, video tutorials, charters for the Litmus Special Interest Groups)

Major Bug Fixes

  • Patches the chaosengine with the right (‘stopped’) and fixes the event to provide the right reason in cases where app filtering is unsuccessful. This will allow a re-apply of the engine to re-trigger the application.

  • Adds a check to factor-in cordoned (SchedulingDisabled) status of nodes in kubelet & docker-service kill experiments.

  • Provides the tc_image used in network chaos experiments as an experiment tunable over hardcoding in order to support users with internal image registries

  • Decides experiment termination based on chaos container status over that of chaos pod objects to support operations in a service-mesh environment (istio, linkerd) where all pods (including chaos resources) are injected with sidecars. Without this, the experiment runs forever due to the proxy sidecars.

  • Sets the restart policy of the experiments jobs to Never over OnFailure to prevent repeated re-execution for certain experiment failure conditions.

  • Fixes the incorrect eventType for chaos events in cases of failures & skipped executions.

  • Fixes the go-based pod-cpu-hog & pod-memory-hog experiments to execute the chaos processes (commands) in the target container by passing them as a args to shell instance (/bin/sh -c) to account for targets which may run with different entrypoints.

  • Fixes permission issues on the infra helm chart resulting in failed metrics collection

Breaking Changes

  • Stops support for the ansible-runner/executor (EoL) (Not to be confused with the ansible-based chaos experiments)

  • Removes the following repositories:

    • litmuschaos/pages: The operator manifests are available over gh-pages sourced out of litmuschaos/litmus

    • litmuschaos/chaos-helm: The experiments helm chart is also into the litmus-helm repo.

    • litmuschaos/community: The demo procedures & community info are now available within the litmus-demo & the litmus repo respectively.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.7.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.6.0

15 Jul 15:13
38b701c
Compare
Choose a tag to compare

New Features and Enhancements

  • Specification of pod and container security context for the experiment resources via chaosexperiment spec
  • Introduces pod scheduling policy support via NodeSelector specification on the chaosengine (instance-specific attribute)
  • Ability to override experiment images from the chaosengine
  • Pushes an experiment execution summary event on the chaosresult resource
  • Adds the network chaos experiment to induce packet duplication
  • Adds node chaos experiment to force pod evictions via taints
  • Adds service chaos experiment to kill docker service on the node
  • Extends the golang chaoslib support for all existing chaos experiments in the generic suite
  • Validation webhook enhancements to verify if application labels specified in the chaosengine are propagated to pod templates of the applications under test (AUT)
  • Additional examples to illustrate litmus chaos-workflows using nginx benchmark using apache benchmark tool with parallel pod-kills
  • Migrates the ansible-based chaos experiments to the litmus-ansible repo from litmuschaos/litmus in line with the litmus-go, litmus-python repo structure
  • Improves the unit-test based coverage for chaos operator by 30%
  • Extends the capability trigger on-demand e2e runs for PRs via GitHub comments to chaos operator
  • Adds framework to determine e2e coverage percentage based on comparison of executed tests in the pipeline versus test plan
  • Introduces an e2e portal to view e2e pipeline data and coverage
  • Improves the Travis-based CI pipeline of the test-tools repo to build images only if respective Dockerfile or scripts are modified instead of building all images irrespective the nature of the commit.
  • Increases sources for (helm-based) litmus installation to include helm hub & jfrog chartcenter artifact repositories
  • Adds betterci integration to charthub to obtain UI/UX previews for PRs
  • Enhances individual experiment documentation with abort procedure & troubleshooting references
  • Enhances the experiment failure and uninstall troubleshooting sections to include more conditions
  • Includes steps to run chaos experiments on rancher platform
  • Includes missing video links/examples for chaos experiments in the generic suite
  • Updates all the litmuschaos websites (docs, charthub, project website) based on CNCF guidelines
  • Enhances the release guidelines doc with an enhanced release checklist

Major Bug Fixes

  • Fixes invalid Jinja template for chaos injection (helper) pod in the kubelet-service-kill experiment
  • Specifies an upper limit for the memory hog experiment docs based on the current resource exhaustion approach via dd
  • Adds instructions in infra (node) chaos experiments to cordon the AUT before the execution of chaos to prevent the restart of litmus pods
  • Fixes a race condition in the pod-delete experiment where the verdict is flagged as fail despite successful execution
  • Fixes Kafka experiment failure while trying to derive leader broker for the test topic (partition) due to missing ns and improper regex
  • Fixes coredns experiment regression (caused due to introduction of helper pods logic for the pod-delete experiment) due to missing
    lib_image in experiment CR

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.6.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.5.1

09 Jul 05:16
8ba20bc
Compare
Choose a tag to compare
[Cherry Pick to 1.5.1] Inhibit experiment image creation from branche…

1.5.0

15 Jun 17:41
993d2e7
Compare
Choose a tag to compare

New Features and Enhancements

  • Features a revamped chaos charthub with a more resilient design and improved user experience

  • Introduces ability (github workflows) to trigger individual/multiple e2e tests or complete e2e test-suite for litmus PRs via GitHub comments

  • Adds a new repo litmuschaos/litmus-demo to provide a fully packaged demo environment to run chaos under 10 min

  • Adds node service kill chaos chaos libraries (& kubelet kill chaos experiment on specified nodes)

  • Improves the pod cpu hog experiment by adding go chaoslib to support containerd/crio runtime

  • Introduces chaoslib pattern to choose blast radius / percentage (target) pods and abort chaos on target containers

  • Improves the chaos-scheduler controller to halt/resume chaos

  • Enhances the chaos-schedule CR schema to provide dedicated attributes for the schedule modes (now, once, repeat) over mutually-exclusive fields with enhanced OpenAPI schema validation

  • Introduces ImagePullPolicy as a chaosexperiment CR attribute (.spec.definition.imagePullPolicy) to support usecases where the experiments are needed to be run with locally built images, as with PR-triggered e2e

  • Enhances the container-kill experiment to repeat the chaos per an interval over a total duration with support for containerd/crio runtime.

  • Adds go-based helper pods for pod-delete and container-kill chaos libraries

  • Improves the litmus-go scaffold tool to use lighter base images & improved default events

  • Improves the validating webhook-based admission controller to call out missed annotations on target applications

  • Improves unit-test coverage for chaos-operator

  • Enhances the getting started (chaosengine construction) & troubleshooting docs (uninstallation steps)

Major Bug Fixes

  • Fixes the missing/clustered event generation on litmus-go chaos experiment

  • Fixes operator behavior of triggering chaos disregarding annotation status on the target application

  • Fixes the cluster level running experiment count metric from chaos-exporter

  • Adds concurrent updation of the event counter for each iteration of chaos injection

  • Fixes chaos experiment failures (securitycontext additions) on OpenShift 4.3

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.5.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.4.1

03 Jun 05:42
961c7fa
Compare
Choose a tag to compare
[Cherry-Pick for 1.4.1]  (#1535)

* (chore)roadmap: update roadmap status (#1530)

Signed-off-by: ksatchit <[email protected]>

* update(helper-pod): Wait till the helper pod come into running state (#1533)

Signed-off-by: shubhamchaudhary <[email protected]>

Co-authored-by: Shubham Chaudhary <[email protected]>

1.4.1-RC1

29 May 20:11
af7ce00
Compare
Choose a tag to compare
1.4.1-RC1 Pre-release
Pre-release
fix(pod-delete): Fixing pod-delete chaolib (#1526) (#1528)

Signed-off-by: Udit Gaurav <[email protected]>

Co-authored-by: UDIT GAURAV <[email protected]>

1.4.0

15 May 12:41
80c61b1
Compare
Choose a tag to compare

New Features and Enhancements

  • Introduces the ChaosSchedule CRD & Controller to execute background chaos jobs with a variety of scheduling policies: Immediate, at specific timestamp or between a defined start & end time. Supports both randomized as well as strictly scheduled execution of chaos.

  • Introduces argo-based Chaos Workflows as a means to help users construct complex scenarios around chaos experiments such as ability to parallelize benchmark runs with chaos operations. The initial commits include workflows to gauge impact of pod failures on the performance of a multi-replica nginx deployment.

  • Introduces litmus-go - a repo to hold experiments and chaoslib written in golang, with an alpha litmus-go SDK that has the ability to scaffold go experiments, complete with all artefacts, including the chaosexperiment custom resources. Also introduces litmus-python, which primarily holds chaostoolkit-based chaos experiments.

  • Introduces an alpha Validation Webhook for Litmus to offload experiment dependency validation checks from chaos-operator & chaos-runner components.

  • Adds support for chaos on DeploymentConfig resources on OpenShift

  • Introduces ability to insert user-defined annotations into chaos resources (chaos-runner, experiment pods) via chaosengine

  • Adds support for instance specific metadata (id) definition by users to specify the purpose/track chaos experiment and lend uniqueness to the chaosresult via chaosengine environment variable

  • Refactors the chaos exporter metrics to provide aggregated cluster level chaos metrics with improved naming convention.

  • Introduces a suite of standard observability resources to aid with visualization & monitoring of chaos experiments - including events (heptio eventrouter-prometheus-grafana, metricbeat-elasticsearch-kibana), metrics (chaos-exporter-prometheus-grafana) & logs (promtail-loki-grafana).

  • Homogenizes chaos experiments to use LIB model to invoke chaos injection functions

  • Improves the litmus helm chart to support admin mode installation. Also includes optional install of chaos-exporter.

  • Updates to use stress-ng over stress in chaos libraries to support greater chaos support

  • Adds helm chart testing in CI for litmus-helm repo

  • Updates the litmus-e2e gitlab job scripts to function on on-prem Kubernetes clusters over NAT

  • Shifts to Go Modules for dependency management across litmus components

  • Improves general & troubleshooting FAQs on litmus-docs around failed chaos experiment execution.

Major Bug Fixes

  • Fixes inability to run litmus experiment containers in OpenShift due to “AnsibleError: Unable to create local directories” by generating resource manifests from jinja templates into /tmp.

  • Fixes disk-fill experiment execution on Gravity Kubernetes cluster via dynamic container data path.

  • Fixes exceptions seen in chaos-operator due to lack of resource permissions for replicasets

  • Fixes “unable to update resource” / “operation cannot be fulfilled” transient errors on chaos-operator

  • Fixes broken BDD tests in chaos-runner, chaos-operator CI pipelines

  • Enforces hard stop of pod-delete chaos experiment at total_chaos_duration via chaos timestamp comparisons

  • Fixes algolia-based search functionality in litmus-docs

  • Fixes the analytics count round off issue for operator installation & experiment run count in the charthub

Getting Started

Prerequisites to install

  • Make sure you have a healthy Kubernetes Cluster.
  • Kubernetes 1.12+ is installed

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.4.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs