How do I stop an Argo workflow when a pod encounters an error?

Answer and Explanation

Stopping an Argo workflow when a pod encounters an error is crucial for efficient resource management and preventing cascading failures. Argo Workflows provides several mechanisms to handle pod errors and terminate workflows accordingly. Here's how you can achieve this:

1. Using `failFast` in Workflow Definition:

- The `failFast` option in your workflow definition is a straightforward way to stop the entire workflow immediately when any step fails. When set to `true`, if any pod in the workflow fails, the entire workflow will be marked as failed and terminated.

- Example in YAML:

apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: fail-fast-example- spec: entrypoint: main failFast: true templates: - name: main steps: - - name: step1 template: task1 - - name: step2 template: task2 - name: task1 container: image: alpine:latest command: ["sh", "-c"] args: ["exit 1"] # This will cause a failure - name: task2 container: image: alpine:latest command: ["sh", "-c"] args: ["echo 'Task 2 executed'"]

- In this example, if `task1` fails, the entire workflow will stop, and `task2` will not be executed.

2. Using `onExit` Handlers:

- You can define an `onExit` handler to perform specific actions when a workflow or a step fails. This can include cleanup tasks or sending notifications. While it doesn't directly stop the workflow, it allows you to handle failures gracefully.

- Example in YAML:

apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: on-exit-example- spec: entrypoint: main templates: - name: main steps: - - name: step1 template: task1 onExit: cleanup - name: task1 container: image: alpine:latest command: ["sh", "-c"] args: ["exit 1"] # This will cause a failure - name: cleanup container: image: alpine:latest command: ["sh", "-c"] args: ["echo 'Cleanup task executed'"]

- In this case, if `task1` fails, the `cleanup` template will be executed.

3. Using `exitCode` in `container` Template:

- You can define specific exit codes that should be considered as failures. By default, any non-zero exit code is considered a failure. However, you can customize this behavior using the `exitCode` field in the `container` template.

- Example in YAML:

apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: exit-code-example- spec: entrypoint: main templates: - name: main steps: - - name: step1 template: task1 - name: task1 container: image: alpine:latest command: ["sh", "-c"] args: ["exit 2"] exitCode: 2 # Only exit code 2 is considered a failure

- In this example, only an exit code of 2 will be considered a failure, and the workflow will stop if `task1` exits with code 2.

4. Using `retryStrategy`:

- While not directly stopping the workflow on the first error, you can use `retryStrategy` to limit the number of retries. If a pod fails repeatedly, the workflow will eventually fail and stop.

- Example in YAML:

apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: retry-example- spec: entrypoint: main templates: - name: main steps: - - name: step1 template: task1 - name: task1 container: image: alpine:latest command: ["sh", "-c"] args: ["exit 1"] retryStrategy: limit: 2 # Retry only twice

- In this example, if `task1` fails, it will be retried twice. If it continues to fail, the workflow will stop.

By combining these techniques, you can effectively manage errors in your Argo workflows and ensure that they stop when a pod encounters an error, preventing unnecessary resource consumption and simplifying debugging.

How do I stop an Argo workflow when a pod encounters an error?

More questions