Step Functions

AWS Step Functions provides serverless workflow orchestration for coordinating distributed applications. In the MBC CQRS Serverless framework, Step Functions are used for:

Long-running workflow orchestration
Saga pattern implementation for distributed transactions
Parallel batch processing with Distributed Map
Asynchronous task coordination with callback patterns

Architecture Overview

State Machines

The framework provides three pre-configured state machines:

Command State Machine

Handles data synchronization workflows with version control and parallel processing.

Key features:

Version checking: Ensures command ordering and prevents conflicts
Async callback: Waits for previous commands using task tokens
Parallel sync: Uses Map state to sync data across multiple targets
TTL management: Automatically sets expiration on records

Task State Machine

Executes parallel sub-tasks with controlled concurrency.

Key features:

Controlled concurrency: Limits parallel executions (default: 2)
Status tracking: Real-time task status updates
Error handling: Automatic failure detection and reporting

Import CSV State Machine

Processes large CSV files using AWS Distributed Map for massive parallelism.

Key features:

S3 native integration: Reads CSV directly from S3
Batch processing: Groups rows for efficient processing
High concurrency: Supports up to 50 concurrent batch processors
EXPRESS execution: Uses express workflows for child state machines

System Configuration Example

The following diagram shows how Step Functions integrate with other AWS services in a typical production environment:

Data Flow Example

Here is a typical data flow for a command execution with Step Functions:

CDK Implementation Examples

Complete Command State Machine

The following CDK code shows how to create a complete command handler state machine:

import * as cdk from 'aws-cdk-lib';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as logs from 'aws-cdk-lib/aws-logs';
import { Construct } from 'constructs';

export class CommandStateMachineConstruct extends Construct {
  public readonly stateMachine: sfn.StateMachine;

  constructor(scope: Construct, id: string, props: { lambdaFunction: lambda.IFunction }) {
    super(scope, id);

    const { lambdaFunction } = props;

    // Helper function to create Lambda invoke tasks
    const createLambdaTask = (
      stateName: string,
      integrationPattern: sfn.IntegrationPattern = sfn.IntegrationPattern.REQUEST_RESPONSE
    ) => {
      const payload: Record<string, any> = {
        'source': 'step-function',
        'context.$': '$$',
        'input.$': '$',
      };

      // Add task token for callback pattern
      if (integrationPattern === sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN) {
        payload['taskToken'] = sfn.JsonPath.taskToken;
      }

      return new tasks.LambdaInvoke(this, stateName, {
        lambdaFunction,
        payload: sfn.TaskInput.fromObject(payload),
        stateName,
        outputPath: '$.Payload[0][0]',
        integrationPattern,
        retryOnServiceExceptions: true,
      });
    };

    // Define states
    const fail = new sfn.Fail(this, 'fail', {
      stateName: 'fail',
      causePath: '$.cause',
      errorPath: '$.error',
    });

    const success = new sfn.Succeed(this, 'success', {
      stateName: 'success',
    });

    // Create task states
    const finish = createLambdaTask('finish').next(success);

    const syncData = createLambdaTask('sync_data');

    // Map state for parallel data sync
    const syncDataAll = new sfn.Map(this, 'sync_data_all', {
      stateName: 'sync_data_all',
      maxConcurrency: 0, // Unlimited concurrency
      itemsPath: sfn.JsonPath.stringAt('$'),
    })
      .itemProcessor(syncData)
      .next(finish);

    const transformData = createLambdaTask('transform_data').next(syncDataAll);
    const historyCopy = createLambdaTask('history_copy').next(transformData);
    const setTtlCommand = createLambdaTask('set_ttl_command').next(historyCopy);

    // Callback pattern for waiting on previous command
    const waitPrevCommand = createLambdaTask(
      'wait_prev_command',
      sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN
    ).next(setTtlCommand);

    // Choice state for version checking
    const checkVersionResult = new sfn.Choice(this, 'check_version_result', {
      stateName: 'check_version_result',
    })
      .when(sfn.Condition.numberEquals('$.result', 0), setTtlCommand)
      .when(sfn.Condition.numberEquals('$.result', 1), waitPrevCommand)
      .when(sfn.Condition.numberEquals('$.result', -1), fail)
      .otherwise(waitPrevCommand);

    const checkVersion = createLambdaTask('check_version').next(checkVersionResult);

    // Create log group
    const logGroup = new logs.LogGroup(this, 'StateMachineLogGroup', {
      logGroupName: '/aws/vendedlogs/states/command-handler-logs',
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      retention: logs.RetentionDays.SIX_MONTHS,
    });

    // Create state machine
    this.stateMachine = new sfn.StateMachine(this, 'CommandHandlerStateMachine', {
      stateMachineName: 'command-handler',
      comment: 'Handles command stream processing with version control',
      definitionBody: sfn.DefinitionBody.fromChainable(checkVersion),
      tracingEnabled: true,
      logs: {
        destination: logGroup,
        level: sfn.LogLevel.ALL,
      },
    });
  }
}

Task State Machine with Controlled Concurrency

export class TaskStateMachineConstruct extends Construct {
  public readonly stateMachine: sfn.StateMachine;

  constructor(scope: Construct, id: string, props: { lambdaFunction: lambda.IFunction }) {
    super(scope, id);

    const { lambdaFunction } = props;

    // Iterator task for each item
    const iteratorTask = new tasks.LambdaInvoke(this, 'iterator', {
      lambdaFunction,
      payload: sfn.TaskInput.fromObject({
        'source': 'step-function',
        'context.$': '$$',
        'input.$': '$',
      }),
      stateName: 'iterator',
      outputPath: '$.Payload[0][0]',
    });

    // Map state with concurrency limit
    const mapState = new sfn.Map(this, 'TaskMapState', {
      stateName: 'map_state',
      maxConcurrency: 2, // Process 2 items at a time
      inputPath: '$',
      itemsPath: sfn.JsonPath.stringAt('$'),
    }).itemProcessor(iteratorTask);

    // Create log group
    const logGroup = new logs.LogGroup(this, 'TaskLogGroup', {
      logGroupName: '/aws/vendedlogs/states/task-handler-logs',
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      retention: logs.RetentionDays.SIX_MONTHS,
    });

    // Create state machine
    this.stateMachine = new sfn.StateMachine(this, 'TaskHandlerStateMachine', {
      stateMachineName: 'task-handler',
      comment: 'Handles parallel task execution with concurrency control',
      definition: mapState,
      timeout: cdk.Duration.minutes(15),
      tracingEnabled: true,
      logs: {
        destination: logGroup,
        level: sfn.LogLevel.ALL,
      },
    });
  }
}

Distributed Map for CSV Import

For processing large CSV files, use Distributed Map which provides native S3 integration:

import { Map as SfnMap, ProcessorMode, ProcessorConfig, IChainable, JsonPath } from 'aws-cdk-lib/aws-stepfunctions';

// Custom Distributed Map class for S3 CSV processing
export class DistributedMap extends SfnMap {
  public itemReader?: DistributedMapItemReader;
  public itemBatcher?: DistributedMapItemBatcher;
  public label?: string;

  public override toStateJson(): object {
    const mapStateJson = super.toStateJson();
    return {
      ...mapStateJson,
      ItemReader: this.itemReader,
      ItemBatcher: this.itemBatcher,
      Label: this.label,
    };
  }

  public itemProcessor(processor: IChainable, config: ProcessorConfig = {}): DistributedMap {
    super.itemProcessor(processor, {
      ...config,
      mode: ProcessorMode.DISTRIBUTED,
    });
    return this;
  }

  public setItemReader(itemReader: DistributedMapItemReader): DistributedMap {
    this.itemReader = itemReader;
    return this;
  }

  public setItemBatcher(itemBatcher: DistributedMapItemBatcher): DistributedMap {
    this.itemBatcher = itemBatcher;
    return this;
  }

  public setLabel(label: string): DistributedMap {
    this.label = label;
    return this;
  }
}

// Usage in your stack
const csvRowsHandler = new tasks.LambdaInvoke(this, 'csv_rows_handler', {
  lambdaFunction,
  payload: sfn.TaskInput.fromObject({
    'source': 'step-function',
    'context.$': '$$',
    'input.$': '$',
  }),
  stateName: 'csv_rows_handler',
});

const importCsvDefinition = new DistributedMap(this, 'import-csv', {
  maxConcurrency: 50, // Process up to 50 batches in parallel
})
  .setLabel('import-csv')
  .setItemReader({
    Resource: 'arn:aws:states:::s3:getObject',
    ReaderConfig: {
      InputType: 'CSV',
      CSVHeaderLocation: 'FIRST_ROW',
    },
    Parameters: {
      'Bucket.$': '$.bucket',
      'Key.$': '$.key',
    },
  })
  .setItemBatcher({
    MaxInputBytesPerBatch: 10,
    BatchInput: {
      'Attributes.$': '$',
    },
  })
  .itemProcessor(csvRowsHandler, {
    executionType: sfn.ProcessorType.EXPRESS, // Use EXPRESS for child executions
  });

const importCsvStateMachine = new sfn.StateMachine(this, 'ImportCsvStateMachine', {
  stateMachineName: 'import-csv',
  comment: 'Processes large CSV files with distributed batch processing',
  definitionBody: sfn.DefinitionBody.fromChainable(importCsvDefinition),
  tracingEnabled: true,
});

Event Source Configuration

Configure DynamoDB Streams and SQS to trigger Step Functions:

// DynamoDB Stream event source
const tableNames = ['tasks', 'commands', 'import_tmp'];

for (const tableName of tableNames) {
  const table = dynamodb.Table.fromTableAttributes(this, `${tableName}-table`, {
    tableArn: `arn:aws:dynamodb:${region}:${account}:table/${prefix}${tableName}`,
    tableStreamArn: `arn:aws:dynamodb:${region}:${account}:table/${prefix}${tableName}/stream/*`,
  });

  lambdaFunction.addEventSource(
    new lambdaEventSources.DynamoEventSource(table, {
      startingPosition: lambda.StartingPosition.TRIM_HORIZON,
      batchSize: 1,
      filters: [
        lambda.FilterCriteria.filter({
          eventName: lambda.FilterRule.isEqual('INSERT'),
        }),
      ],
    })
  );
}

// SQS event sources
const queues = ['task-action-queue', 'notification-queue', 'import-action-queue'];

for (const queueName of queues) {
  const queue = sqs.Queue.fromQueueArn(
    this,
    queueName,
    `arn:aws:sqs:${region}:${account}:${prefix}${queueName}`
  );

  lambdaFunction.addEventSource(
    new lambdaEventSources.SqsEventSource(queue, {
      batchSize: 1,
    })
  );
}

Implementation Guide

Step 1: Infrastructure Setup

The framework automatically provisions Step Functions infrastructure using AWS CDK. Key resources include:

// State machine definition in CDK
const commandStateMachine = new sfn.StateMachine(this, 'CommandHandler', {
  stateMachineName: 'command',
  definitionBody: sfn.DefinitionBody.fromChainable(definition),
  timeout: Duration.minutes(15),
  tracingEnabled: true,
  logs: {
    destination: logGroup,
    level: sfn.LogLevel.ALL,
  },
});

Step 2: Define Step Function Events

Create event classes that extend the base Step Function event:

import { IEvent } from '@mbc-cqrs-serverless/core';
import { StepFunctionsContext } from '@mbc-cqrs-serverless/core';

export class CustomWorkflowEvent implements IEvent {
  source: string;
  context: StepFunctionsContext;
  input?: WorkflowInput;
  taskToken?: string;
}

Step 3: Implement Event Handlers

Create handlers that process Step Function events:

import { EventHandler, IEventHandler } from '@mbc-cqrs-serverless/core';
import { Logger } from '@nestjs/common';

@EventHandler(CustomWorkflowEvent)
export class CustomWorkflowHandler implements IEventHandler<CustomWorkflowEvent> {
  private readonly logger = new Logger(CustomWorkflowHandler.name);

  async execute(event: CustomWorkflowEvent): Promise<StepStateOutput> {
    const stateName = event.context.State.Name;

    switch (stateName) {
      case 'initialize':
        return this.handleInitialize(event);
      case 'process':
        return this.handleProcess(event);
      case 'finalize':
        return this.handleFinalize(event);
      default:
        throw new Error(`Unknown state: ${stateName}`);
    }
  }

  private async handleInitialize(event: CustomWorkflowEvent) {
    // Initialization logic
    return { status: 'initialized', data: event.input };
  }

  private async handleProcess(event: CustomWorkflowEvent) {
    // Processing logic
    return { status: 'processed' };
  }

  private async handleFinalize(event: CustomWorkflowEvent) {
    // Finalization logic
    return { status: 'completed' };
  }
}

Step 4: Configure Event Factory

import { EventFactory, IEvent, StepFunctionsEvent } from '@mbc-cqrs-serverless/core';

@EventFactory()
export class CustomEventFactory {
  async transformStepFunction(event: StepFunctionsEvent<any>): Promise<IEvent[]> {
    const stateMachineName = event.context.StateMachine.Name;

    if (stateMachineName.includes('custom-workflow')) {
      return [new CustomWorkflowEvent(event)];
    }

    return [];
  }
}

Step 5: Trigger State Machine Execution

Start a state machine execution from your service:

import { StepFunctionService } from '@mbc-cqrs-serverless/core';
import { Injectable } from '@nestjs/common';

@Injectable()
export class WorkflowService {
  constructor(private readonly sfnService: StepFunctionService) {}

  async startWorkflow(input: WorkflowInput): Promise<string> {
    const executionArn = await this.sfnService.startExecution({
      stateMachineArn: process.env.WORKFLOW_STATE_MACHINE_ARN,
      input: JSON.stringify(input),
      name: `workflow-${Date.now()}`,
    });

    return executionArn;
  }
}

Use Cases

Use Case 1: Data Synchronization

Synchronize data across multiple tables with version control and conflict resolution.

Scenario: When a command is created, sync the data to multiple read models.

// Trigger: DynamoDB Stream INSERT event
// Flow: check_version -> set_ttl -> history_copy -> transform -> sync_all -> finish

await this.commandService.publishAsync(
  {
    pk: 'TENANT#tenant1',
    sk: 'ORDER#order123',
    id: 'order-uuid',
    code: 'order123',
    name: 'Order',
    type: 'ORDER',
    version: 1,
    tenantCode: 'tenant1',
    attributes: { status: 'confirmed', total: 1000 },
  },
  { invokeContext },
);
// This triggers the command state machine automatically

Use Case 2: Batch Task Processing

Execute multiple related tasks in parallel with controlled concurrency.

Scenario: Process multiple items in a batch job with status tracking.

// Create tasks that will be processed by the task state machine
const items = [
  { itemId: 'item1', action: 'process' },
  { itemId: 'item2', action: 'process' },
  { itemId: 'item3', action: 'process' },
];

await this.taskService.createStepFunctionTask({
  input: items,
  taskType: 'batch-processor',
  tenantCode: 'tenant1',
}, { invokeContext });

Use Case 3: Large-Scale CSV Import

Import millions of rows from CSV files with distributed processing.

Scenario: Import a large CSV file from S3 with validation and transformation.

// Trigger CSV import via API or direct invocation
await this.importService.createCsvImport({
  s3Bucket: 'my-bucket',
  s3Key: 'imports/data.csv',
  tableName: 'products',
  processingMode: ProcessingMode.STEP_FUNCTION,
});

// The import-csv state machine will:
// 1. Read CSV from S3
// 2. Batch rows (default: 10 per batch)
// 3. Process up to 50 batches concurrently
// 4. Transform and validate each row
// 5. Create import commands

Use Case 4: Async Callback Pattern

Wait for external events using task tokens.

Scenario: Wait for approval before proceeding with a workflow.

// In your state machine definition
{
  "WaitForApproval": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
    "Parameters": {
      "FunctionName": "${LambdaFunction}",
      "Payload": {
        "taskToken.$": "$$.Task.Token",
        "requestId.$": "$.requestId"
      }
    },
    "Next": "ProcessApproval"
  }
}

// In your handler, store the task token
async handleWaitForApproval(event: ApprovalEvent) {
  await this.approvalService.createApprovalRequest({
    requestId: event.input.requestId,
    taskToken: event.taskToken, // Store for later callback
  });
}

// When approval is received, resume the workflow
async approveRequest(requestId: string) {
  const request = await this.approvalService.getRequest(requestId);

  await this.sfnService.sendTaskSuccess({
    taskToken: request.taskToken,
    output: JSON.stringify({ approved: true }),
  });
}

Callback Patterns with Task Tokens

The framework implements callback patterns using AWS Step Functions task tokens for coordinating long-running workflows and waiting for external events.

How Callback Patterns Work

When a Step Function state uses the WAIT_FOR_TASK_TOKEN integration pattern, the execution pauses until an external process sends a success or failure response with the task token.

StepFunctionService Implementation

The StepFunctionService provides methods for starting executions and resuming paused workflows:

import {
  SFNClient,
  SendTaskSuccessCommand,
  StartExecutionCommand,
} from '@aws-sdk/client-sfn';

@Injectable()
export class StepFunctionService {
  private readonly client: SFNClient;

  constructor(private readonly config: ConfigService) {
    this.client = new SFNClient({
      endpoint: config.get<string>('SFN_ENDPOINT'),
      region: config.get<string>('SFN_REGION'),
    });
  }

  // Start a new state machine execution
  startExecution(arn: string, input: any, name?: string) {
    return this.client.send(
      new StartExecutionCommand({
        stateMachineArn: arn,
        name: name && name.length <= 80 ? name : undefined,
        input: JSON.stringify(input),
      }),
    );
  }

  // Resume a paused execution using task token
  async resumeExecution(taskToken: string, output: any = {}) {
    // Wrap output in the expected format for Lambda integration
    const wrappedOutput = {
      Payload: [[output]],
    };

    return await this.client.send(
      new SendTaskSuccessCommand({
        taskToken: taskToken,
        output: JSON.stringify(wrappedOutput),
      }),
    );
  }
}

Version-Based Command Chaining

The command state machine uses callback patterns to ensure commands are processed in version order:

// Wait for previous command to complete using task token
protected async waitConfirmToken(
  event: DataSyncCommandSfnEvent,
): Promise<StepFunctionStateInput> {
  // Store task token in DynamoDB for later callback
  await this.commandService.updateTaskToken(event.commandKey, event.taskToken);
  return {
    result: {
      token: event.taskToken,
    },
  };
}

// When a command finishes, check if next version is waiting
protected async checkNextToken(
  event: DataSyncCommandSfnEvent,
): Promise<StepFunctionStateInput> {
  const nextCommand = await this.commandService.getNextCommand(
    event.commandKey,
  );

  if (!nextCommand) {
    return null; // No next command, chain ends
  }

  if (nextCommand.taskToken) {
    // Resume the waiting command
    try {
      await this.sfnService.resumeExecution(nextCommand.taskToken, {
        result: 'resumed_by_prev_version',
        prevVersion: event.commandRecord.version,
      });
    } catch (e) {
      this.logger.warn(
        `Could not resume command v${nextCommand.version}: ${e.message}`,
      );
    }
  }

  return null;
}

CDK Configuration for Callback Pattern

Configure the state to wait for task token in your CDK stack:

// Create a state that waits for callback
const waitPrevCommand = new tasks.LambdaInvoke(this, 'wait_prev_command', {
  lambdaFunction,
  payload: sfn.TaskInput.fromObject({
    'input.$': '$',
    'context.$': '$$',
    'taskToken': sfn.JsonPath.taskToken, // Include task token in payload
  }),
  stateName: 'wait_prev_command',
  outputPath: '$.Payload[0][0]',
  // Use WAIT_FOR_TASK_TOKEN integration pattern
  integrationPattern: sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
});

Long-Running Workflow Strategies

The framework provides several strategies for handling long-running workflows:

ZIP Import Orchestration

For complex multi-file imports, the framework uses a hierarchical orchestration pattern:

Task Token Propagation for Child Workflows

When triggering child workflows, the parent stores the task token for later callback:

// Trigger a child CSV job and wait for completion
private async triggerSingleCsvJob(event: ZipImportSfnEvent) {
  const s3Key = event.input?.s3Key || event.input;
  const { taskToken } = event; // Task token from parent workflow
  const { masterJobKey, parameters } = event.context.Execution.Input;

  // Create CSV job with stored task token
  await this.importService.createCsvJobWithTaskToken(
    {
      processingMode: ProcessingMode.STEP_FUNCTION,
      bucket: parameters.bucket,
      key: s3Key,
      tenantCode: parameters.tenantCode,
      tableName: tableName,
    },
    taskToken, // Store for callback when CSV processing completes
    masterJobKey,
  );
}

Workflow Timeout Configuration

Set appropriate timeouts for long-running workflows:

const taskStateMachine = new sfn.StateMachine(this, 'task-handler', {
  stateMachineName: 'task-handler',
  definition: sfnTaskMapState,
  timeout: cdk.Duration.minutes(15), // Overall workflow timeout
  tracingEnabled: true,
  logs: {
    destination: logGroup,
    level: sfn.LogLevel.ALL,
  },
});

Integration with Import/Export Patterns

The framework integrates Step Functions with the import module for scalable data processing:

CSV Import Flow

The CSV import uses a two-phase approach with Step Functions:

// Phase 1: Create import job and trigger Step Function
async handleCsvImport(
  dto: CreateCsvImportDto,
  options: ICommandOptions,
): Promise<ImportEntity[] | ImportEntity> {
  if (dto.processingMode === 'DIRECT') {
    // Process directly in Lambda (for small files)
    return this._processCsvDirectly(dto, options);
  } else {
    // Create job and let Step Function handle processing
    return this.createCsvJob(dto, options);
  }
}

// Phase 2: Step Function handler processes rows
@EventHandler(CsvImportSfnEvent)
export class CsvImportSfnEventHandler {
  async handleStepState(event: CsvImportSfnEvent): Promise<any> {
    if (event.context.State.Name === 'csv_loader') {
      // Count total rows and initialize job
      const totalRows = await this.countCsvRows(input);
      await this.importService.updateImportJob(parentKey, {
        set: { totalRows },
      });
      return this.loadCsv(input);
    }

    if (event.context.State.Name === 'finalize_parent_job') {
      return this.finalizeParentJob(event);
    }

    // Process batch of rows
    const items = event.input.Items;
    for (const item of items) {
      const transformedData = await strategy.transform(item);
      await strategy.validate(transformedData);
      await this.importService.createImport(createImportDto, options);
    }
  }
}

Progress Tracking with Atomic Counters

The import service uses atomic DynamoDB counters for accurate progress tracking:

// Atomically increment progress counters
async incrementParentJobCounters(
  parentKey: DetailKey,
  childSucceeded: boolean,
): Promise<ImportEntity> {
  const countersToIncrement: { [key: string]: number } = {
    processedRows: 1,
  };
  if (childSucceeded) {
    countersToIncrement.succeededRows = 1;
  } else {
    countersToIncrement.failedRows = 1;
  }

  // Use atomic update expression
  const command = new UpdateItemCommand({
    TableName: this.tableName,
    Key: marshall(parentKey),
    UpdateExpression: 'SET #processedRows = if_not_exists(#processedRows, :start) + :inc',
    ExpressionAttributeNames: { '#processedRows': 'processedRows' },
    ExpressionAttributeValues: marshall({ ':start': 0, ':inc': 1 }),
    ReturnValues: 'ALL_NEW',
  });

  const response = await this.dynamoDbService.client.send(command);
  const updatedEntity = unmarshall(response.Attributes) as ImportEntity;

  // Check if job is complete and update final status
  if (updatedEntity.totalRows > 0 && updatedEntity.processedRows >= updatedEntity.totalRows) {
    const finalStatus = updatedEntity.failedRows > 0
      ? ImportStatusEnum.FAILED
      : ImportStatusEnum.COMPLETED;
    await this.updateStatus(parentKey, finalStatus);
  }

  return updatedEntity;
}

Processing Mode Selection

Choose the appropriate processing mode based on data size:

Processing Mode	Use Case	Max Rows	Concurrency
`DIRECT`	Small files, immediate feedback	~1,000	Single Lambda
`STEP_FUNCTION`	Large files, background processing	Millions	Up to 50

// Example: Selecting processing mode based on file size
const processingMode = estimatedRows > 1000
  ? ProcessingMode.STEP_FUNCTION
  : ProcessingMode.DIRECT;

await importService.handleCsvImport({
  bucket: 'my-bucket',
  key: 'data/large-file.csv',
  tableName: 'products',
  tenantCode: 'tenant1',
  processingMode,
}, { invokeContext });

Step Functions Context

Every Step Function event includes context information about the execution:

interface StepFunctionsContext {
  Execution: {
    Id: string;        // Execution ARN
    Input: object;     // Original input
    Name: string;      // Execution name
    RoleArn: string;   // IAM role
    StartTime: string; // ISO timestamp
  };
  State: {
    EnteredTime: string; // When this state started
    Name: string;        // Current state name
    RetryCount: number;  // Retry attempt number
  };
  StateMachine: {
    Id: string;   // State machine ARN
    Name: string; // State machine name
  };
}

Error Handling

Implement robust error handling in your state machines:

Handler-Level Error Handling

The framework provides built-in error handling patterns for Step Function handlers:

// Command event handler with status tracking and error handling
@Injectable()
export class CommandEventHandler {
  async execute(
    event: DataSyncCommandSfnEvent,
  ): Promise<StepFunctionStateInput | StepFunctionStateInput[]> {
    // Update status to STARTED before processing
    await this.commandService.updateStatus(
      event.commandKey,
      getCommandStatus(event.stepStateName, CommandStatus.STATUS_STARTED),
      event.commandRecord.requestId,
    );

    try {
      const ret = await this.handleStepState(event);
      // Update status to FINISHED on success
      await this.commandService.updateStatus(
        event.commandKey,
        getCommandStatus(event.stepStateName, CommandStatus.STATUS_FINISHED),
        event.commandRecord.requestId,
      );
      return ret;
    } catch (error) {
      // Update status to FAILED and publish alarm on error
      await this.commandService.updateStatus(
        event.commandKey,
        getCommandStatus(event.stepStateName, CommandStatus.STATUS_FAILED),
        event.commandRecord.requestId,
      );
      await this.publishAlarm(event, (error as Error).stack);
      throw error;
    }
  }
}

Task Error Handling with Continuation

For task handlers, the framework supports continuing execution even after errors:

// Task handler with error handling that allows workflow continuation
@EventHandler(StepFunctionTaskEvent)
export class TaskSfnEventHandler implements IEventHandler<StepFunctionTaskEvent> {
  async execute(event: StepFunctionTaskEvent): Promise<any> {
    const taskKey = event.taskKey;

    try {
      await this.taskService.updateSubTaskStatus(taskKey, TaskStatusEnum.PROCESSING);
      const events = await this.eventFactory.transformStepFunctionTask(event);
      const result = await Promise.all(
        events.map((event) => this.eventBus.execute(event)),
      );
      // Update status to COMPLETED on success
      await this.taskService.updateSubTaskStatus(taskKey, TaskStatusEnum.COMPLETED, {
        result,
      });
    } catch (error) {
      // Update status to FAILED and publish alarm, but don't throw
      this.logger.error(error);
      await Promise.all([
        this.taskService.updateSubTaskStatus(taskKey, TaskStatusEnum.FAILED, {
          error: (error as Error).stack,
        }),
        this.taskService.publishAlarm(event, (error as Error).stack),
      ]);
      // Note: Error is not re-thrown to allow Step Function to continue
      // throw error // Uncomment to fail the entire workflow on error
    }
  }
}

Alarm Publishing

The framework publishes alarms to SNS for monitoring and alerting:

// Publish alarm notification to SNS topic
async publishAlarm(
  event: DataSyncCommandSfnEvent,
  errorDetails: any,
): Promise<void> {
  const alarm: INotification = {
    action: 'sfn-alarm',
    id: `${event.commandKey.pk}#${event.commandKey.sk}`,
    table: this.options.tableName,
    pk: event.commandKey.pk,
    sk: event.commandKey.sk,
    tenantCode: event.commandKey.pk.substring(
      event.commandKey.pk.indexOf('#') + 1,
    ),
    content: {
      errorMessage: errorDetails,
      sfnId: event.context.Execution.Id,
    },
  };
  await this.snsService.publish<INotification>(alarm, this.alarmTopicArn);
}

State machine error handling configuration:

{
  "ProcessStep": {
    "Type": "Task",
    "Resource": "${LambdaArn}",
    "Retry": [
      {
        "ErrorEquals": ["States.TaskFailed"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 2
      }
    ],
    "Catch": [
      {
        "ErrorEquals": ["States.ALL"],
        "Next": "HandleError",
        "ResultPath": "$.error"
      }
    ],
    "Next": "NextStep"
  }
}

Best Practices

Design Principles

Idempotency: Design each state to be safely retryable
Single Responsibility: Each state should do one thing well
Timeout Configuration: Set appropriate timeouts for each state
Logging: Enable comprehensive logging for debugging

Performance Optimization

Use Express Workflows: For high-volume, short-duration workflows
Batch Processing: Group items to reduce state transitions
Concurrency Limits: Set appropriate limits to prevent throttling
S3 Integration: Use native S3 integration for large data processing

Monitoring

CloudWatch Metrics: Monitor execution counts, failures, and duration
X-Ray Tracing: Enable distributed tracing for debugging
CloudWatch Logs: Capture detailed execution logs
Alarms: Set up alerts for failure rates and execution times

Task Module - Task management with Step Functions
Import/Export Patterns - CSV import with Distributed Map
Event Sourcing - Event-driven architecture
CQRS Flow - Command and query separation

Architecture Overview​

State Machines​

Command State Machine​

Task State Machine​

Import CSV State Machine​

System Configuration Example​

Data Flow Example​

CDK Implementation Examples​

Complete Command State Machine​

Task State Machine with Controlled Concurrency​

Distributed Map for CSV Import​

Event Source Configuration​

Implementation Guide​

Step 1: Infrastructure Setup​

Step 2: Define Step Function Events​

Step 3: Implement Event Handlers​

Step 4: Configure Event Factory​

Step 5: Trigger State Machine Execution​

Use Cases​

Use Case 1: Data Synchronization​

Use Case 2: Batch Task Processing​

Use Case 3: Large-Scale CSV Import​

Use Case 4: Async Callback Pattern​

Callback Patterns with Task Tokens​

How Callback Patterns Work​

StepFunctionService Implementation​

Version-Based Command Chaining​

CDK Configuration for Callback Pattern​

Long-Running Workflow Strategies​

ZIP Import Orchestration​

Task Token Propagation for Child Workflows​

Workflow Timeout Configuration​

Integration with Import/Export Patterns​

CSV Import Flow​

Progress Tracking with Atomic Counters​

Processing Mode Selection​

Step Functions Context​

Error Handling​

Handler-Level Error Handling​

Task Error Handling with Continuation​

Alarm Publishing​

State machine error handling configuration:​

Best Practices​

Design Principles​

Performance Optimization​

Monitoring​

Related Documentation​

Architecture Overview

State Machines

Command State Machine

Task State Machine

Import CSV State Machine

System Configuration Example

Data Flow Example

CDK Implementation Examples

Complete Command State Machine

Task State Machine with Controlled Concurrency

Distributed Map for CSV Import

Event Source Configuration

Implementation Guide

Step 1: Infrastructure Setup

Step 2: Define Step Function Events

Step 3: Implement Event Handlers

Step 4: Configure Event Factory

Step 5: Trigger State Machine Execution

Use Cases

Use Case 1: Data Synchronization

Use Case 2: Batch Task Processing

Use Case 3: Large-Scale CSV Import

Use Case 4: Async Callback Pattern

Callback Patterns with Task Tokens

How Callback Patterns Work

StepFunctionService Implementation

Version-Based Command Chaining

CDK Configuration for Callback Pattern

Long-Running Workflow Strategies

ZIP Import Orchestration

Task Token Propagation for Child Workflows

Workflow Timeout Configuration

Integration with Import/Export Patterns

CSV Import Flow

Progress Tracking with Atomic Counters

Processing Mode Selection

Step Functions Context

Error Handling

Handler-Level Error Handling

Task Error Handling with Continuation

Alarm Publishing

State machine error handling configuration:

Best Practices

Design Principles

Performance Optimization

Monitoring

Related Documentation