Skip to main content

Monitoring and Logging

This guide covers setting up comprehensive monitoring and logging for MBC CQRS Serverless applications on AWS.

Overview

A complete observability strategy includes:

  • Logging: Capturing application events and errors
  • Metrics: Measuring performance and usage
  • Tracing: Following requests across services
  • Alerting: Notifying on issues

CloudWatch Logs

Lambda Logging

Lambda functions automatically log to CloudWatch. Enhance with structured logging:

import { Logger, Injectable } from '@nestjs/common';

@Injectable()
export class TodoService {
private readonly logger = new Logger(TodoService.name);

async create(dto: CreateTodoDto): Promise<Todo> {
this.logger.log({
action: 'create_todo',
input: dto,
userId: context.userId,
});

try {
const result = await this.save(dto);
this.logger.log({
action: 'todo_created',
todoId: result.id,
duration: Date.now() - startTime,
});
return result;
} catch (error) {
this.logger.error({
action: 'create_todo_failed',
error: error.message,
stack: error.stack,
});
throw error;
}
}
}

Log Retention

Configure log retention in CDK:

import * as logs from 'aws-cdk-lib/aws-logs';

const logGroup = new logs.LogGroup(this, 'AppLogGroup', {
logGroupName: `/aws/lambda/${props.appName}-${props.envName}`,
retention: logs.RetentionDays.ONE_MONTH,
removalPolicy: cdk.RemovalPolicy.DESTROY,
});

Log Format

Recommended log format for easy querying:

interface LogEntry {
timestamp: string;
level: 'DEBUG' | 'INFO' | 'WARN' | 'ERROR';
message: string;
context: string;
correlationId?: string;
userId?: string;
duration?: number;
error?: {
name: string;
message: string;
stack: string;
};
metadata?: Record<string, any>;
}

CloudWatch Metrics

Custom Metrics

Publish custom metrics from your application:

import { CloudWatch } from 'aws-sdk';

const cloudwatch = new CloudWatch();

async function publishMetric(
metricName: string,
value: number,
unit: string = 'Count',
): Promise<void> {
await cloudwatch.putMetricData({
Namespace: 'YourApp/Custom',
MetricData: [
{
MetricName: metricName,
Value: value,
Unit: unit,
Dimensions: [
{ Name: 'Environment', Value: process.env.ENVIRONMENT },
],
},
],
}).promise();
}

// Usage
await publishMetric('TodosCreated', 1);
await publishMetric('ProcessingTime', 150, 'Milliseconds');

Key Metrics to Monitor

CategoryMetricDescription
LambdaInvocationsNumber of function calls
LambdaDurationExecution time
LambdaErrorsFailed invocations
LambdaConcurrentExecutionsParallel executions
API GatewayCountRequest count
API GatewayLatencyResponse time
API Gateway4XXErrorClient errors
API Gateway5XXErrorServer errors
DynamoDBConsumedReadCapacityRead usage
DynamoDBConsumedWriteCapacityWrite usage
DynamoDBThrottledRequestsThrottled operations

CDK Dashboard

Create a CloudWatch dashboard:

import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';

const dashboard = new cloudwatch.Dashboard(this, 'AppDashboard', {
dashboardName: `${props.appName}-${props.envName}`,
});

dashboard.addWidgets(
new cloudwatch.GraphWidget({
title: 'Lambda Invocations',
left: [handler.metricInvocations()],
}),
new cloudwatch.GraphWidget({
title: 'Lambda Duration',
left: [handler.metricDuration()],
}),
new cloudwatch.GraphWidget({
title: 'Lambda Errors',
left: [handler.metricErrors()],
}),
new cloudwatch.GraphWidget({
title: 'API Gateway Requests',
left: [
api.metricCount(),
api.metricClientError(),
api.metricServerError(),
],
}),
);

CloudWatch Alarms

Create Alarms

Set up alarms for critical metrics:

import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
import * as sns from 'aws-cdk-lib/aws-sns';
import * as actions from 'aws-cdk-lib/aws-cloudwatch-actions';

// SNS topic for notifications
const alertTopic = new sns.Topic(this, 'AlertTopic');

// Lambda error alarm
const errorAlarm = new cloudwatch.Alarm(this, 'LambdaErrorAlarm', {
metric: handler.metricErrors(),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'Lambda function errors exceed threshold',
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
});
errorAlarm.addAlarmAction(new actions.SnsAction(alertTopic));

// High latency alarm
const latencyAlarm = new cloudwatch.Alarm(this, 'LatencyAlarm', {
metric: handler.metricDuration({
statistic: 'p99',
}),
threshold: 5000, // 5 seconds
evaluationPeriods: 3,
alarmDescription: 'P99 latency exceeds 5 seconds',
});
latencyAlarm.addAlarmAction(new actions.SnsAction(alertTopic));

// DynamoDB throttling alarm
const throttleAlarm = new cloudwatch.Alarm(this, 'ThrottleAlarm', {
metric: table.metricThrottledRequests(),
threshold: 1,
evaluationPeriods: 1,
alarmDescription: 'DynamoDB throttling detected',
});
throttleAlarm.addAlarmAction(new actions.SnsAction(alertTopic));
AlarmMetricThresholdDescription
High Error RateLambda Errors> 5 per minuteIndicates application issues
High LatencyLambda Duration p99> 5 secondsSlow response times
DynamoDB ThrottlingThrottledRequests> 0Capacity issues
5XX ErrorsAPI Gateway 5XXError> 1%Server errors
Dead Letter QueueSQS ApproximateNumberOfMessages> 0Failed messages

AWS X-Ray Tracing

Enable X-Ray

Enable X-Ray tracing in CDK:

const handler = new lambda.Function(this, 'Handler', {
// ... other config
tracing: lambda.Tracing.ACTIVE,
});

const api = new apigateway.HttpApi(this, 'Api', {
// ... other config
});

// Enable X-Ray for API Gateway
const stage = api.defaultStage?.node.defaultChild as apigateway.CfnStage;
stage.addPropertyOverride('TracingEnabled', true);

Instrument Application

import * as AWSXRay from 'aws-xray-sdk';

// Instrument AWS SDK
const AWS = AWSXRay.captureAWS(require('aws-sdk'));

// Instrument HTTP calls
AWSXRay.captureHTTPsGlobal(require('http'));
AWSXRay.captureHTTPsGlobal(require('https'));

// Add custom annotations
const segment = AWSXRay.getSegment();
const subsegment = segment?.addNewSubsegment('CustomOperation');
subsegment?.addAnnotation('userId', userId);
subsegment?.addMetadata('request', requestData);
// ... perform operation
subsegment?.close();

Centralized Logging

Log Aggregation Pattern

For complex applications, aggregate logs:

// Create a centralized log group
const centralLogGroup = new logs.LogGroup(this, 'CentralLogs', {
logGroupName: `/app/${props.appName}/central`,
retention: logs.RetentionDays.THREE_MONTHS,
});

// Subscribe Lambda logs
new logs.SubscriptionFilter(this, 'LambdaLogSubscription', {
logGroup: lambdaLogGroup,
destination: new destinations.LambdaDestination(logProcessorFunction),
filterPattern: logs.FilterPattern.allEvents(),
});

Log Insights Queries

Useful CloudWatch Logs Insights queries:

# Error analysis
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) by bin(1h)

# Slow requests
fields @timestamp, @duration, @requestId
| filter @duration > 3000
| sort @duration desc
| limit 20

# Request volume by endpoint
fields @timestamp
| filter @message like /HTTP/
| parse @message /(?<method>\w+) (?<path>\/\S+)/
| stats count(*) by path, method
| sort count desc

# Cold starts
fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<initDuration>[\d.]+) ms/
| stats avg(initDuration), max(initDuration), count(*)

Application Performance Monitoring

Performance Metrics

Track key performance indicators:

class PerformanceMonitor {
private metrics: Map<string, number[]> = new Map();

startTimer(operation: string): () => void {
const start = Date.now();
return () => {
const duration = Date.now() - start;
this.recordMetric(operation, duration);
};
}

private recordMetric(name: string, value: number): void {
if (!this.metrics.has(name)) {
this.metrics.set(name, []);
}
this.metrics.get(name)!.push(value);
}

async flush(): Promise<void> {
for (const [name, values] of this.metrics.entries()) {
await publishMetric(name, average(values), 'Milliseconds');
}
this.metrics.clear();
}
}

Health Checks

Implement health check endpoints:

@Controller('health')
export class HealthController {
constructor(
private readonly prisma: PrismaService,
private readonly dynamodb: DynamoDBService,
) {}

@Get()
async check(): Promise<HealthStatus> {
const checks = await Promise.allSettled([
this.checkDatabase(),
this.checkDynamoDB(),
]);

return {
status: checks.every(c => c.status === 'fulfilled') ? 'healthy' : 'degraded',
timestamp: new Date().toISOString(),
checks: {
database: checks[0].status === 'fulfilled' ? 'ok' : 'failed',
dynamodb: checks[1].status === 'fulfilled' ? 'ok' : 'failed',
},
};
}
}

Best Practices

Logging

  • Use structured JSON logging
  • Include correlation IDs
  • Log at appropriate levels
  • Avoid logging sensitive data
  • Set appropriate retention periods

Metrics

  • Focus on actionable metrics
  • Use percentiles (p50, p95, p99) for latency
  • Set meaningful thresholds
  • Create dashboards for different audiences

Alerting

  • Avoid alert fatigue
  • Set appropriate thresholds
  • Include runbooks in alert descriptions
  • Use escalation policies

Next Steps