Skip to content

Releases: litmuschaos/litmus

1.13.3

15 Apr 23:20
dc086b3
Compare
Choose a tag to compare

New Features & Enhancements

  • For updates on 2.0.0-Beta releases, refer the notes for Litmus 2.0.0-Beta3.

  • Enhances the EC2 termination experiments to filter targets by tags (apart from IDs), along with support for list and percentage-based selection of instances, serial and parallel failure modes

  • Supports collection of chaos metrics for all ChaosEngine resources by default instead of selective monitoring controlled via spec attributes

  • Supports the definition of ‘context’ (metadata) for an experiment via a Kubernetes label on the ChaosEngine that translates to a metric label value on Prometheus. This can be used to group experiment results via context/reason or derive useful insights from metrics.

  • Introduces a new chaos metric litmuschaos_experiment_verdict that provides an instance-specific run result (instead of cumulative result stats) that can be used alongside the litmuschaos_awaited_experiments to obtain improved chaos interleaved dashboards.

  • Adds documentation around supported chaos metrics and their utility.

  • Allows users to specify the terminationGracePeriodSeconds for the chaos experiment and helper pods to allow abort routines to go through (useful in clusters with high API traffic or under group chaos execution on multiple apps at once)

  • Provides new environment variables (translating to stress-ng flags) for node resource chaos experiments to ensure the granular definition of the load/stress profile.

  • Adds abort routines for infra/node and autoscaler experiments and optimizes the same for pod experiments in which they are already defined.

  • Introduces a randomness factor in the pod-delete experiment to ensure that the delete operations occur at random intervals (the random periods being picked within a time range defined by lower-upper bounds).

  • Enhances the pumba chaoslib for stress experiments by providing an additional ENV var for defining the stress image (that is pulled at runtime on the target pod’s node to inject the stressor). This is useful for folks running experiments with images from their private registries.

  • Introduces a tech-preview of a DNS-chaos experiment (available in litmuschaos/go-runner:ci image) that can cause dns errors/failure in target containers

  • Updates the Chaos Github Actions used in the PR/commit-based e2e suite on the litmus-go repository.

  • Improves the e2e dashboard to represent the experiment e2e coverage in a clearer way.

  • Begins the migration process of specific e2e pipelines to GitHub Actions from Gitlab to aid definition of multiple component/feature-based workflows from within a single branch

  • Adds a new utility (nsutil) to execute commands on the target containers namespace, with potential usage in multiple pod-level chaos experiments

Major Bug Fixes

  • Fixes repeated scheduling of experiment pods upon helper failure/ungraceful exits (error state )- the pods will now enter the completed state upon first error.

  • Appends missing CRD validation schema for image pull policy for experiments

  • Upgrades all litmus artifacts containing CRD spec to use version v1 from v1beta1 to support newer Kubernetes platforms

  • Adds checks to validate the definition of app labels when annotation checks are set to false on the ChaosEngine (and fail fast with appropriate error).

  • Fixes the behavior where multiple “downstream” probes defined in the same phase (pre/post/on chaos) fail if the first probe evaluates to failure.

  • Fixes an issue that is seen when running chaos on multiple application replicas/targets at once, where chaos injection against the last replica/target alone is considered for the success of the experiment.

  • Adds retries to factor in the pending status of helper pods in populated/dense clusters where it takes time for the pod to be scheduled.

  • Adds logs to the Kafka liveness/load pod launched during the Kafka broker failure experiments to verify successful service discovery & topic creation success/failure.

Major Known Issues & Limitations

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Fix:

This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.3.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.13.2

15 Mar 18:58
dc086b3
Compare
Choose a tag to compare

New Features & Enhancements

  • Litmus Portal progress will be tracked in the fortnightly 2.0.0-beta releases henceforth as we build towards Litmus 2.0 GA. Check out the release notes for Litmus 2.0.0-Beta0 & Litmus 2.0.0-Beta1

  • Enhances the comparator logic for cmdProbe & promProbe to include more operations (OneOf, Range, etc.,) against a list of int, float & string values.

  • Streamlines the application health status check process by considering only the pods of annotated parent workloads (when annotationCheck is true). Also provides support to verify the readiness for a specific container (specified by ChaosExperiment/ChaosEngine env var TARGET_CONTAINER) within a pod.

  • Simplifies the disk-fill generic chaos experiment to work on applications that don’t have ephemeral storage limits defined explicitly, with env var EPHEMERAL_STORAGE_MEBIBYTES.

  • Improves the chaoslib & health-checks used in the AWS ec2-terminate experiment to work with managed node groups (ex: KOPS, EKS)

  • Extends support for termination signal type SIGKILL (as env var) in the container-kill experiment for Containerd, CRI-O runtimes.

  • Removes the need to change permissions of the container runtime socket-files on the node (mounted into the experiment helpers and permissions updated via init-containers) when using Litmus LIB for container-kill & network-chaos experiments

  • Extends OnChaos mode of probe operation to all native Litmus experiments

  • Improves the K8sProbe schema to be more meaningful, by replacing the “command” field with GVR (group-version-resource) fields in the probe inputs. Also adds K8s CRUD operation as a valid probe input.

  • Speeds up the abort routine (impact observed at scale) of network chaos experiments by shifting the chaosresult and event generation (kube-api calls) steps from helper pods to experiment pod.

  • Provides support for defining ResponseTimeouts for API calls in the httpProbe.

  • Adds flexibility in the definition of applabel in ChaosEngine (recommended kubectl patterns such as =, ==, !=, in, notin, exists) with annotationCheck enabled/disabled.

  • Introduces a new chaos category for Azure Cloud with the Azure VM instance kill experiment (available in tech preview mode with image: litmuschaos/go-runner: azure)

  • Adds additional labels (ChaosEngine name) in the Chaos Exporter metrics for improved tracking purposes in monitoring systems.

  • Adds new e2e tests (for validation of chaos execution in serial/parallel mode when pod_affected_percentage > 0) for PRs on chaos-operator

  • Increased unit-test coverage (+35%) on chaos-runner component

  • Includes the litmus-portal pipeline coverage details & execution results to the litmus-e2e dashboard.

Major Bug Fixes

  • Removes the force flag (terminationGracePeriodSeconds: 0 ) from the abort function to facilitate the experiment & helper pods to successfully execute the chaos rever/rollback and notification (via event) routine. The need is for the chaos rollback to occur instantaneously and in a guaranteed manner rather than immediate removal of the chaos pods/resources.

  • Ensures guaranteed rollback/revert of chaos process (tc rule) on target pods upon abort when the chaos injection occurs in parallel at scale (100-150 replicas). This is enabled via a change in the ordering of tasks in the abort routine and preventing further execution in the inject routine once SIGTERM is received.

  • Skips the AUT (Application-Under-Test) status checks in the chaos namespace when the .spec.appInfo is not specified within the ChaosEngine (with portal driven execution, the chaos namespace may contain completed experiment/argo pods that are not in “Running” status).

  • Handles invalid DESTINATION_HOSTS better in the network chaos experiments. The inability to resolve specified hostnames to valid IPs is logged, with only valid hosts injected with chaos instead of applying total egress chaos on the interface.

  • Fixes the OpenAPI schema validation error for httpProbe that resulted in a failed evaluation of the probe.

  • Fixes the result notification error for failure cases in the Kafka Broker Pod Failure experiment and makes the partition leader identification process more resilient.

  • Includes the missing step to propagate the ImagePullPolicy of experiment images to the helper pod (via ENV vars)

  • Includes the missing step to propagate the imagePullSecrets to the helper pods (via spec attribute)

Major Known Issues & Limitations

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Fix:

This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.2.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.13.0

15 Feb 19:23
dc086b3
Compare
Choose a tag to compare

New Features & Enhancements

  • Moves the Litmus Portal to beta-2 phase with the following improvements:

    • Ability to disable workflow schedules
    • Support for configuration of private Git repositories as a source for experiments & predefined workflows (private MyHub)
    • Allows the full set of CRUD operations on the embedded ChaosHub/MyHub
    • Improves the chaos visualization via horizontal/vertical workflow views and proper formatting of logs for the workflow nodes.
  • Enhances the ChaosExperiment CRD to take HostPath Volume Type input.

  • Removes the limitation that only a single workload (amongst those sharing the labels) can be annotated for chaos.

  • Enhances the httpProbe to perform POST operations with payload described in the ChaosEngine or via a file mounted as a configmap.

  • Simplifies node resource chaos experiments to accept resources in units (mebibytes) along with relative percentage inputs.

  • Makes the termination mode configurable for the container-kill experiment (defaults to SIGKILL)

  • Adds more details to experiment logs around annotated workloads & filtered pod targets

  • Improves the disk-fill chaos experiment to use the helper pod approach for injection instead of running a dummy pod with a sleep command into which multiple exec operations occur.

  • Additional unit tests in the chaos-operator & chaos-runner repos.

  • Improves e2e tests (PRs/Commits) (pod chaos with combinations of pods_affected_perc & sequence env, annotation on multiple workloads etc.,) in the litmus-go repo

  • Updates the litmus-sdk based on recent changes to experiment templates

Major Bug Fixes

  • Ensures that different helper pods within an experiment instance are labeled with unique values (for fixed keys) in order to query them for status. Without this, these helper pods were being filtered by common labels resulting in incorrect validation. This is more so when multiple instances of the same experiment are executed in parallel.

  • Reflects the correct verdict of the experiment upon failure and abort, along with improved events in the Kafka & Cassandra chaos experiments.

  • Ensures smooth re-run of network chaos on a target with residual tc rule from the previous instance of chaos injection (RTNETLINK answers: File exists)

  • Fixes the console spamming log messages on chaos-exporter which were seen until the ChaosResult/Engine resources were created.

Major Known Issues & Limitations

Issue:

Forced removal of the experiment helper pods (where applicable: notably network chaos experiments) either manually or due to Kubernetes eviction can render the chaos revert operation at the end of the chaos duration a failure/ a non-event. This will cause the application under test (AUT) to continue being subjected to chaos unless manually recovered.

Workaround:

With experiment pod logs it can be deciphered that the helper operations have failed. In which case, the AUT pod(s) can be deleted so they can be rescheduled again (this is applicable only to those applications deployed as a higher-level controller such as deployment/statefulset/daemonset, etc.,) with a new network namespace.

Fix:

This is being actively worked on (retry mechanism for chaos revert initiated in case of failed/missing helper pods) and should be available in a subsequent release.

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Fix:

This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.12.0

15 Jan 19:05
390d134
Compare
Choose a tag to compare

New Features & Enhancements

  • Moves the Litmus Portal to beta-1 phase with the following improvements:

    • Supports edit of (cron) schedule in chaos workflows
    • Ability to suspend/disable schedules
    • Improved chaos workflow diagrams with appropriate log representation for different stages/steps
    • Increased (K8s) validation in the workflow construction wizard
    • Adds the infra changes necessary to support private repositories for MyHub (UI support to come in 1.13.0)
  • Introduces a revamped chaos-exporter that removes the current dependency on the heptio event-router for the experiment execution state, which was being used to build chaos-interleaved application dashboards. The chaos exporter now pushes an increased set of metrics on chaos start/end times, status, success percentage per run, experiment specific cumulative pass/fail counts, etc., and has options to operate in both cluster-wide as well as the namespaced modes.

  • Enhances the httpProbe with options to skip certificate checks via the insecureSkipVerify flag in the ChaosEngine schema

  • Enhances the pod-autoscaler experiment with the ability to scale multiple applications (type: deployments, statefulsets) based on an APP_AFFECTED_PERC environment variable, with the apps being filtered via label selectors. Also adds support for OnChaos probes for the experiment.

  • Supports random selection of EC2 instances/Kubernetes nodes for the ec2-terminate experiment in cases where the target instance is not explicitly specified.

  • Improves error handling logic in the node-drain experiment and also adds a timeout (equal to the chaos duration period) flag to the drain operation to prevent indefinite execution (ex: to honor pod disruption budgets, stuck evictions)

  • Extends the ImagePullPolicy configuration to external probe pods (in cases where the cmdProbe is configured to run on “source” images other than the litmus go-runner).

  • Homogenizes the experiment pod logs for target pod information prior to chaos injection

  • Promotes the non-root go-runner from tech-preview to a release image. Accompanied by changes to experiments where applicable (commands, paths & file permissions)

  • Introduces a tech-preview of enhanced chaos rollback/revert logic (used initially for network chaos experiments executed in “serial” sequence ) to achieve guaranteed chaos rollback/revert under failure conditions (helper pod eviction, unexpected chaos process termination, deletion/removal, etc.,) (litmuschaos/go-runner:1.12.0-revert)

  • Enhances the ChaosResult schema to hold cumulative success/failure count information of the different run instances for a given experiment.

  • Introduces a new scaffolded chaoslib template in the litmus SDK that allows injection and revert of chaos via the CHAOS_INJECT_COMMAND & CHAOS_KILL_COMMAND environment variables, thereby giving users flexibility in creating preview experiments.

  • Releases the v0.3.1 of the chaos-ci-lib with fixes and enhancements to the chaos BDD library, and updates the e2e suites to use it.

  • Migration to GitHub Actions (with parallel workflows for lint, security scan, e2e & build/push operations) from TravisCI (where applicable) in lieu of reduced support for OSS projects on the latter.

  • Enhances the litmus-e2e suite with new tests for verification of annotation-enabled & disabled chaos execution, ec2-terminate experiment & pumba-based chaoslib functionality. Adds the feature coverage tracker with an initial set of testcases for litmus-portal e2e pipelines

  • Enhances the litmus-helm chart testing workflows as per the latest K8s/Helm standards

  • Improves the node-restart & adds node-poweroff experiment documentation with steps to obtain the ssh-keys & setup the secrets for execution.

  • Simplifies the experiment pages UX on the ChaosHub with explanation/steps to use the chaos artifacts

Major Bug Fixes

  • Fixes spurious events received on ChaosEngines installed with engineState set to stop (for deferred execution purposes). Also ensures that the ChaosInitialization is recorded once finalizers have been applied on the CR

  • Prevents a false positive with probe execution (in cases where probes were defined without the RunProperties specification) by mandating the latter using CRD validation.

  • Fixes failed/timed-out helper pod checks in the node-restart and node-poweroff experiments with an enhanced status check logic that looks for variadic/desired pod states (such as Succeeded, Running, etc..,) instead of just “Running”

  • Fixes the failure to kill target docker containers using the “litmus” LIB due to the missing “host” flag pointing to the correct daemon socket path

  • Fixes a regression on the pod-cpu-hog experiment that caused only a single md5sum process to be launched on the target pods irrespective of the CPU_CORES (number of cores) input to the experiment.

  • Fixes a regression (panic) on the chaos-runner caused upon secret volumes definition in the ChaosExperiment/ChaosEngine

  • Synchronizes event messages (from the experiment pod as well as chaos-runner pod sources) with the latest experiment status/verdict in case of repeated execution (caused by frequent abort/restart operations) instead of holding stale info.

  • Replaces hardcoded socket paths in experiment helper configurations with values derived from the SOCKET_PATH environment variable

  • Fixes failed application status checks on infra-chaos experiments where the .spec.appinfo.applabel is not specified/skipped. In this case, the health of all pods in the chaos namespace is verified.

  • Fixes the documentation with the correct kubectl command to patch the ChaosEngine for abort/restart.

Major Known Issues & Limitations

Issue:

Forced removal of the experiment helper pods (where applicable: notably network chaos experiments) either manually or due to Kubernetes eviction can render the chaos revert operation at the end of the chaos duration a failure/ a non-event. This will cause the application under test (AUT) to continue being subjected to chaos unless manually recovered.

Workaround:

With experiment pod logs it can be deciphered that the helper operations have failed. In which case, the AUT pod(s) can be deleted so they can be rescheduled again (this is applicable only to those applications deployed as a higher-level controller such as deployment/statefulset/daemonset, etc.,) with a new network namespace.

Fix:

This is being actively worked on (retry mechanism for chaos revert initiated in case of failed/missing helper pods) and should be available in a subsequent release.

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Fix:

This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.12.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.11.0

15 Dec 19:52
d2fdadb
Compare
Choose a tag to compare

New Features & Enhancements

  • Moves the Litmus Portal to beta-0 phase with first-cut API documentation, view-only users, install/operation support in air-gapped environments, non-root/non-privileged containers etc.,

  • Introduces the Prometheus Probe to facilitate metrics based SLO validation during experiment runs

  • Enhances the litmus probes by adding regex support for output comparison, OpenAPI v3 based CRD validation for probe schema, error handling & probe logging improvements

  • Adds the node-restart & node power-off experiments for Kubevirt based Linux VMs

  • Support for adding ENV variable values from ConfigMaps and Secrets in the ChaosEngine. This is especially useful in the case of platform-specific (Ex: AWS) chaos experiments.

  • Allows chaos annotations for more than one application workload that shares the same labels (controls chaos for a set of apps)

  • Supports the definition of resource requests/limits for chaos-runner & helper pods

  • Extends the native litmus chaoslib for network chaos on docker runtime while continuing to support pumba lib. This is expected to help users that do not want additional images (defined by the TC_IMAGE env in the network chaos experiments) pulled during the course of the experiment.

  • Cleans up the failed/orphaned helper pods based on the jobCleanupPolicy specified in ChaosEngine.

  • Propagates the ImagePullPolicy of the experiment resource to the helper pods

  • Refactors the chaos-runner to avoid experiment unnecessary dependency checks (for configmaps, secrets) where applicable and alter the flow to fail faster in case of issues such as missing experiment CRs, etc.,

  • Removes dependency on (availability of) crictl.yaml on the Kubernetes nodes for the execution of experiments on containerd/crio runtime (esp useful for K3s, MicroK8S platforms)

  • Reduces the image sizes for the chaos-operator & chaos-runner pods while significantly reducing vulnerabilities with a new base image

  • Adds non-root experiment (go-runner) images in the tech-preview stage for beta testing.

  • Introduces a recommended PodSecurityPolicy configuration for LitmusChaos experiments for use in restricted environments

  • Improves the experiment bootstrap experience with a simple scaffold CLI/SDK

  • Simplifies the ChaosEngine sample specs on the ChaosHub by removing redundant attributes, renaming the ENVs referring to remote services/hosts in the network chaos experiments, synchronizing runtime & socket-path vars, etc.,

  • Adds integration tests as a PR check (triggered on each commit unless skipped via tag) on a containerd based cluster (KIND) for the chaos-operator, chaos-runner, litmus-go & litmus-helm repos

  • Improves the litmus-e2e with dedicated pipelines on AWS cloud for pod level, infra (node) level experiment tests & control plane functionality tests with schedules setup for nightly builds on the ci tag. This aids in faster and easier on-demand execution.

  • Adds a first-cut visualization of the e2e metrics based on a coverage tracker

  • Includes a helm chart (with an entry/release item on the helmfile) for the litmus-portal

  • Provides an option to execute the Litmus Demo from a container and adds EKS as a test platform.

Major Bug Fixes

  • Fixes issues in the chaos-runner & experiment logic which led to failed event generation when the experiment is restarted post an abort operation

  • Adds the pods/exec resource to the experiment RBAC to support the source mode of operation of cmdProbe wherein the probe command is executed from within a dedicated pod whose source image has been specified. Without this change, probe execution is unsuccessful.

  • Fixes the behavior where the application pods configured with liveness probes enter CrashLoopBackOff state post network chaos injection in case of containerd runtime. This was caused due to an unsuccessful revert of chaos due to the change in container PID which was used by litmus to inject the netem rules. The fix involves injecting the rules on the corresponding sandbox container instead of the app container itself thereby facilitating successful chaos revert. With this, the app pods are expected to recover w/o manual intervention depending upon the existing backOff delay/no of restarts during the desired chaos duration.

  • Fixes the developer flow with Okteto based dev container/environments: executing the experiment code from within the litmus-experiment test deployment was seen to fail due to failed probe initialization (whereas the chaosengine is not defined at this stage at all). This has been fixed to ensure the probe initialization occurs only if the experiment is triggered by the chaosengine & probes are defined.

  • Removes “auxiliaryAppInfo” as an attribute in non-infra experiments (w/o cluster-wide rolebinding). Providing this attribute in pod-level experiments caused failed entry/exit application status checks due to lack of permissions.

  • Cleans up the permissions on the chaos operator cluster role to avoid listing of unrelated resources under API groups

  • Fixes version comparison on the ChaosHub server to reflect the latest chaos-charts release on the website

  • Fixes the chaos-exporter deployment crash upon startup with appropriate entrypoint script

  • Propagates the docker socket file path to the pumba helper pod for network chaos experiments instead of the hardcoded /var/run/docker.sock

Major Known Issues & Limitations

Issue

Forced removal of the experiment helper pods (where applicable: notably network chaos experiments) either manually or due to Kubernetes eviction can render the chaos revert operation at the end of the chaos duration a failure/ a non-event. This will cause the application under test (AUT) to continue being subjected to chaos unless manually recovered.

  • Workaround

    With experiment pod logs it can be deciphered that the helper operations have failed. In which case, the AUT pod(s) can be deleted so they can be rescheduled again (this is applicable only to those applications deployed as a higher-level controller such as deployment/statefulset/daemonset, etc.,) with a new network namespace.

  • Fix

    This is being actively worked on (retry mechanism for chaos revert initiated in case of failed/missing helper pods) and should be available in a subsequent release.

Issue

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

  • Workaround

    • Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
    • Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.
  • Fix:

    • This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.11.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.10.0

15 Nov 20:44
4cd5ed9
Compare
Choose a tag to compare

New Features & Enhancements

  • Introduces the alpha-2 version of Litmus Portal with:

    • Ability to configure custom chaos charts (experiment custom resources) source, a.k.a., “MyHub” to a project
    • Support for full CRUD operations on chaos (argo) workflows
    • Support for graceful removal of connected cluster targets
    • Optimizes the workflow for self-cluster connect, i.e., ability to add the cluster hosting the portal itself as a target.
    • Enhanced event handling for chaos workflows
    • Improves resiliency of the portal front-end
  • Adds support for resource filtering and chaos injection on pods managed by Argo Rollout resources, facilitating validation of blue-green & canary deployments

  • Promotes multiarch (amd64, arm64) docker images for all major litmus infra components: chaos-operator, chaos-runner, go-runner, chaos-exporter

  • Introduces a newer probe mode “OnChaos” for verification of steady-state only during the chaos injection period. This is specifically useful for “negative-test” scenarios where the result of steady-state checks are dependent/tied to the unavailability of certain services.

  • Extends the scope of the cmdProbe by supporting complex criteria against different output types: integer/float (equal to, less than/less than equal to, greater than/greater than equal to) and strings (substring, string match)

  • Paves way for increased application filtering and resource-specific status checks via propagation of application kind to the experiment job.

  • Supports definition of taint tolerations in the chaos-runner & experiment pods via ChaosEngine to enable scheduling of chaos resources on nodes specifically tainted for this purpose.

  • Supports the specification of NodeSelector in chaos-runner pods via ChaosEngine for guaranteed-schedule on dedicated nodes.

  • Includes experiments to induce chaos on platform resources (AWS) as part of the kube-aws experiment suite:

  • Terminates EC2 instances (cluster nodes) using a native litmus chaoslib that leverages the AWS Go SDK
    Induces disk loss via detachment of EBS volumes/disks attached to the specified instance

  • Introduces an SSH-based node restart experiment to the generic experiment suite (tech preview)

  • Lists use-cases for testing resiliency of Kubernetes system and add-on components (kube-proxy, kiam, calico, etc.,) based on pod-delete chaos under the kube-components suite

  • Provides an option to specify blast-radius (NODES_AFFECTED_PERCENTAGE) for node-level resource chaos experiments

  • Allows specification of a comma-separated list of target pods or nodes in cases where a known set of objects need to be targeted.

  • Adds specification of an optional VOLUME_MOUNT_PATH env variable to the pod-level IO stress experiment, thereby allowing capacity/stress chaos against both ephemeral and persistent storage volumes.

  • Enhances the pod-autoscaler experiment to:

    • Act on statefulsets, apart from deployments.
    • Abort experiment to result in an immediate rollback to initial replica count
    • Adds chaos-duration as the upper-limit for pod scale
  • Enhances the default pre-chaos criteria on the respective infra-level experiments to check infra components health (nodes, disk) apart from just the applications under test / auxiliary applications

  • Homogeneizes the environment variable naming patterns across experiments for pod and node details and improves probe logs to be more descriptive of the status and errors.

  • Adds more validation capability to the admission controller (presence of application namespace) along with increasing unit-test coverage

  • Improves the experiment e2e suite with tests for all the newly included enhancements with enhancements to add validation (chaos-execution checks) for network & resource chaos experiments

  • Provides a new helm chart for Litmus Portal with the ability to control mode of portal operation (namespaced v/s cluster scope) amongst other tunables

  • Enhances the litmus documentation with steps for helm based install, references to learning resources (tutorials, arch slides), docs for the newly added experiments & improved contributing guide.

  • Dockerizes the litmus-demo script to ease demo steps

  • The period of this release also saw the SIG-Orchestration being operationalized. Refer the meeting notes here

Major Bug Fixes

  • Prevents attempts to generate call-home metrics when the ANALYTICS environment variable is set to false on the chaos operator deployment. Multiple failed attempts to send the g.analytics events in air-gapped environments were seen to result in additional time taken to launch the experiment jobs (nearly 10-12s)

  • Reduces the time taken between successive events on the chaos-runner and also fixes the behavior of missed events

  • Optimizes the time taken to gauge successful experiment pod schedule and completion via reduced polling intervals

  • Fixes the behavior where the chaos events are overridden when more than one experiment is listed in the ChaosEngine

  • Fixes issues with the CI scripts in the chaos-charts repo that lead to repetition/duplication of experiments in the suite/category-wise concatenated experiments.yaml

  • Fixes incorrect schema in probe examples in the documentation

Major Known Issues & Limitations

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND. Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Issue:

Experiments requiring mount of the runtime socket file may fail on MicroK8s or K3s environments with error Falied to load config file: read /etc/crictl.yaml: is a directory.

Workaround/Fix:

This is being investigated

Issue

The pod-cpu-hog experiment using the pumba chaoslib can end ungracefully (after successfully injecting chaos for the specified duration) with this error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed, randomly, on some platforms like EKS. In this case, the experiment verdict can tend to show up as Fail due to the chaoslib pod entering a failed state, despite the chaos being injected.

Workaround/Fix:

This is being investigated

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.10.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.10.0-RC2

15 Nov 19:53
4cd5ed9
Compare
Choose a tag to compare
1.10.0-RC2 Pre-release
Pre-release
Fixing circleci config (#2355) (#2356)

Signed-off-by: Raj Babu Das <[email protected]>

1.10.0-RC1

14 Nov 06:44
5f3cf27
Compare
Choose a tag to compare
1.10.0-RC1 Pre-release
Pre-release
Merge pull request #2349 from rajdas98/cherry-pick-1.10.x-v1

Cherry picking from master to 1.10.x

1.9.0

15 Oct 11:24
de14444
Compare
Choose a tag to compare

New Features & Enhancements

  • Introduces the alpha-1 version of the Litmus Portal. Adds support for scheduled workflows, chaos workflows on external agents, namespaced mode of operation, workflow analytics comparison. Also includes additional pre-defined workflows, and enhanced UX around user management.

  • Enhances the K8s probe to support full CRUD operations against native/custom resources. This is especially useful during chaos on “control-plane” components where provisioning/de-provisioning abilities can be tested. Also adds more filters to the K8s probe (labelSelectors)

  • Supports ordered execution of probes with the ability to reuse probe (result) artifacts in “downstream” probes, thereby enabling the creation of complex exit checks in standard experiments. The probe artifacts are referenced via standard templates in the ChaosEngine schema.

  • Supports configmaps & secrets definition for the chaos-runner pod. One emerging use case that makes use of this feature is to achieve cross-cluster chaos, wherein the chaos-runner executes the experiment on a different cluster to the one where the chaos operator/runner (litmus control plane) resides.

  • Allows resource request/limits specification for chaos resources (chaos-runner, experiment pods) in the ChaosEngines. Aids operations in multi-tenant environments where the experiments are being executed simultaneously across several namespaces, leading to a large set of chaos pods.

  • Adds support for ImagePullSecrets for chaos resources in the ChaosEngine to enable operations in cases where private image registries are used.

  • Provides golang chaoslib for Kafka chaos with enhancements to dynamically retrieve "current" partition leaders for each iteration of the broker kill.

  • Supports network chaos between desired microservices (specified via service IP or hostname filters) on containerd & CRIO runtime

  • Introduces different modes of chaos execution - serial and parallel defined via a SEQUENCE env var for cases where the experiment blast radius is higher. This allows chaos to be executed sequentially or in parallel on the replicas of the application under test (AUT)

  • Supports abort operation for all node & pod-level chaos experiments (except kubelet/docker service kill), including those running chaos processes in the target container’s network/process namespace. Also handles probe status for abort scenarios.

  • Minimizes the permissions/scope of the clusterroles used in the chaos operator and admin-mode serviceaccount to better comply with standard security constraints.

  • Optimizes the code structure in the litmus-go repo to ensure a single experiment binary is built (which takes individual experiment names as args) instead of building binaries for each experiment, resulting in an experiment image with a much-reduced size footprint.

  • Releases a set of multi-arch (arm64, amd64) images with tag multiarch-1.9.0 for technical preview & feedback (built via docker buildx). Will be eventually assimilated into standard release images.

  • Improves build process via docker security checks, linting & formatting checks in missing components/repos.

  • Adds the recommended Kubernetes labels for all chaos resources to enable group-management by external tools.

  • Propagates the labels & identifiers of the chaos experiment pod (defined in the ChaosExperiment CRs) to the ChaosResults to allow segregation/management.

  • Improves error handling & logging (structured logs with logrus) in the chaos-runner & experiments.

  • Improves the scaffolding tool to bootstrap experiment artifacts with the latest schema enhancements (probe support, abort support, etc.,)

  • Improves the (validation webhook) admission-controller to verify availability of configmap & secret resources specified for a chaos experiment.

  • Introduces a helmfile for Litmus to package the infra (operator, CRDs) & the experiment helm charts as part of a single (litmus stack) installation.

  • Introduces on-demand e2e test (triggered via /run-e2e commands) for Pull Requests on litmus-go repository via github actions using KIND clusters

  • Improves the e2e coverage for chaos experiments (pod-io-stress, node-io-stress, pod-autoscaler, abort support, target specification) via new tests in the pipeline based on the new additions/enhancements. The existing tests are improved with increased validation to test the success of the chaos injection procedures.

  • Adds a new GitLab pipeline with an initial set of e2e tests for Litmus Portal functions

  • Enhances the litmus-demo scripts to set up the EKS environment & execute the generic chaos suite (KIND & GKE are the other supported platforms)

  • Introduces documentation standards (and consequent update/refactor) around naming conventions for resource names, attribute names, - contribution guidelines as part of the SIG-Documentation deliberations.

  • Adds new content to litmus-docs - chaos monitoring, chaos CR schema explanations, probe enhancements, troubleshooting faq additions, etc.,

Major Bug Fixes

  • Fixes the bug wherein applications configured with liveness probes are stuck in CrashLoopBackOff state upon being subjected to network chaos (docker runtime) with revert chaos being unsuccessful. The network chaoslib now uses the container ID of the Kubernetes pause container associated with the target pod to inject the tc rules in the network namespace instead of the target app containers themselves (as they are prone to restart via liveness probes).

  • Fixes the Failed to connect to bus: No data available error on kubelet-service-kill chaoslib pod

  • Fixes the regex patterns used in the CRD validation schema to support non-specification of .spec.appinfo in the ChaosEngine (either in case of node-level/infra experiments or for broader, randomized selection of pods in the pod-level experiments)

  • Adds logic to exclude the chaos-resource pods (operator, runner, experiment & helper pods) from the target list in cases where the .spec.appinfo is not specified.

  • Fixes the behavior where the chaos-runner runs forever without terminating the experiment, in cases where the experiment job is not successfully started (ImagePullBackOff, Pending etc.,). The chaos-runner is now configured to use StatusCheckTimeout defined in the ChaosEngine (defaults to 180s) to terminate the experiment.

  • Fixes the inability to inject network-chaos when the ChaosExperiment CR is created with a different name (other than the default names on the chaoshub). The logic to select the netem params based on the fixed experiment names has been altered with dedicated functions for each variant of network chaos (latency, loss, duplication, corruption).

  • Fixes improper entrypoint/command to the containerd/crio container-kill & node-io-stress chaoslib (helper) pods

  • Fixes inability to revert (downscale replicas) the pod-autoscaler chaos in cases where the application namespace and chaos namespace are different (as with admin mode execution).

Major Known Issues & Limitations

Issue:

  • The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

  • Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND. Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Note: Expected to be fixed in a subsequent patch/minor release

Issue:

  • The pod-cpu-hog experiment using the pumba chaoslib can end ungracefully (after successfully injecting chaos for the specified duration) with this error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed, randomly, on some platforms like EKS. In this case, the experiment verdict can tend to show up as Fail due to the chaoslib pod entering a failed state, despite the chaos being injected.

Workaround:

  • This is being investigated

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.9.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.9.0-RC1

13 Oct 14:15
183ff31
Compare
Choose a tag to compare
1.9.0-RC1 Pre-release
Pre-release
adding create configmap permission in subscriber manifest and few ref…