Serverless Data Processing with AWS Lambda and Go

A practical guide to building scalable data processing with AWS Lambda and Go. Learn how to handle millions of DynamoDB records efficiently using event-driven architecture.

Why Serverless for Data Processing?

Processing millions of records daily doesn't mean you need servers running 24/7. AWS Lambda lets you run code only when needed, scale automatically, and pay only for what you use. Here's how to build a data processing pipeline that actually works.

The Architecture

Simple flow that scales:

DynamoDB Streams → EventBridge → Lambda (Go) → S3/DynamoDB

Why this stack:

Lambda for zero-ops compute
Go for fast execution and low memory
DynamoDB Streams for change capture
EventBridge for event routing
S3 for cheap storage

Step 1: Set Up Your Project

Start with a clean Go project structure:

mkdir serverless-processor && cd serverless-processor
go mod init serverless-processor
 
mkdir -p cmd/processor internal/handler pkg/models

Project layout:

.
├── cmd/
│   └── processor/      # Lambda entry point
├── internal/
│   └── handler/        # Business logic
├── pkg/
│   └── models/         # Data structures
└── terraform/          # Infrastructure

Step 2: Write Your Lambda Handler

Keep it simple. Here's a basic Lambda handler in Go:

// cmd/processor/main.go
package main
 
import (
    "context"
    "encoding/json"
    "log"
    
    "github.com/aws/aws-lambda-go/events"
    "github.com/aws/aws-lambda-go/lambda"
    "github.com/aws/aws-sdk-go-v2/config"
    "github.com/aws/aws-sdk-go-v2/service/s3"
)
 
// Initialize AWS clients outside handler for reuse
var s3Client *s3.Client
 
func init() {
    cfg, err := config.LoadDefaultConfig(context.TODO())
    if err != nil {
        log.Fatal(err)
    }
    s3Client = s3.NewFromConfig(cfg)
}
 
func handler(ctx context.Context, event events.DynamoDBEvent) error {
    log.Printf("Processing %d records", len(event.Records))
    
    for _, record := range event.Records {
        // Process each record
        if err := processRecord(record); err != nil {
            log.Printf("Error processing record: %v", err)
            return err
        }
    }
    
    return nil
}
 
func processRecord(record events.DynamoDBEventRecord) error {
    // Extract data from DynamoDB stream
    newImage := record.Change.NewImage
    
    // Transform and process
    data := extractData(newImage)
    
    // Save to S3 or write back to DynamoDB
    return saveData(data)
}
 
func main() {
    lambda.Start(handler)
}

Key points:

Initialize AWS clients in init() for connection reuse
Handle errors explicitly
Log what matters
Keep functions small

Step 3: Connect to DynamoDB Streams

Enable streams on your DynamoDB table, then connect Lambda:

# Enable streams
aws dynamodb update-table \
  --table-name my-table \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES

Your Lambda automatically gets triggered when data changes. Configure the event source:

// Lambda gets batches of records
// Process them in parallel with goroutines
func handler(ctx context.Context, event events.DynamoDBEvent) error {
    results := make(chan error, len(event.Records))
    
    for _, record := range event.Records {
        go func(r events.DynamoDBEventRecord) {
            results <- processRecord(r)
        }(record)
    }
    
    // Collect results
    for range event.Records {
        if err := <-results; err != nil {
            return err
        }
    }
    
    return nil
}

Tips:

Batch size of 100 works well
Add a dead letter queue for failed events
Use goroutines for parallel processing
Set appropriate timeout (30-60s)

Step 4: Add EventBridge for Flexibility

EventBridge lets you trigger Lambda from multiple sources:

// Handle different event types
func router(ctx context.Context, event json.RawMessage) error {
    var eventType struct {
        Source string `json:"source"`
    }
    
    json.Unmarshal(event, &eventType)
    
    switch eventType.Source {
    case "aws.dynamodb":
        return handleDynamoDB(event)
    case "aws.s3":
        return handleS3(event)
    case "custom.scheduled":
        return handleScheduled(event)
    default:
        return fmt.Errorf("unknown source: %s", eventType.Source)
    }
}

Schedule periodic processing:

# Run every hour
aws events put-rule \
  --name hourly-processing \
  --schedule-expression "rate(1 hour)"

Step 5: Save Processed Data

Write to S3 for long-term storage:

func saveToS3(data []byte, key string) error {
    _, err := s3Client.PutObject(context.TODO(), &s3.PutObjectInput{
        Bucket: aws.String("my-processed-data"),
        Key:    aws.String(key),
        Body:   bytes.NewReader(data),
    })
    return err
}

Or back to DynamoDB for real-time queries:

func saveToDynamoDB(item map[string]types.AttributeValue) error {
    _, err := dynamoClient.PutItem(context.TODO(), &dynamodb.PutItemInput{
        TableName: aws.String("processed-data"),
        Item:      item,
    })
    return err
}

Storage tips:

S3: partition by date, compress with gzip
DynamoDB: use batch writes, set TTL for cleanup
Add retry logic for transient failures

Step 6: Deploy with Terraform

Automate infrastructure with Terraform:

# terraform/main.tf
resource "aws_lambda_function" "processor" {
  filename         = "function.zip"
  function_name    = "data-processor"
  role            = aws_iam_role.lambda_role.arn
  handler         = "main"
  runtime         = "go1.x"
  timeout         = 60
  memory_size     = 1024
 
  environment {
    variables = {
      BUCKET_NAME = aws_s3_bucket.data.id
    }
  }
}
 
resource "aws_lambda_event_source_mapping" "dynamodb" {
  event_source_arn  = aws_dynamodb_table.source.stream_arn
  function_name     = aws_lambda_function.processor.arn
  starting_position = "LATEST"
  batch_size        = 100
}
 
resource "aws_s3_bucket" "data" {
  bucket = "processed-data-${var.environment}"
}

Deploy:

terraform init
terraform plan
terraform apply

Step 7: Build and Deploy

Build your Go binary for Lambda:

# Build for Lambda (Linux)
GOOS=linux GOARCH=amd64 go build -o main cmd/processor/main.go
 
# Zip it
zip function.zip main
 
# Deploy with Terraform
terraform apply

Test it:

# Trigger by writing to DynamoDB
aws dynamodb put-item \
  --table-name my-table \
  --item '{"id":{"S":"123"},"data":{"S":"test"}}'
 
# Check logs
aws logs tail /aws/lambda/data-processor --follow

Step 8: Monitor and Optimize

Add basic CloudWatch alarms:

resource "aws_cloudwatch_metric_alarm" "errors" {
  alarm_name          = "lambda-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name        = "Errors"
  namespace          = "AWS/Lambda"
  period             = 60
  statistic          = "Sum"
  threshold          = 5
  alarm_description  = "Lambda error rate too high"
}

What to monitor:

Error rate
Duration (optimize if >80% of timeout)
Throttles
DLQ messages
Cost

Optimization tips:

Test different memory sizes (more memory = faster CPU)
Keep binary size small
Reuse connections
Use goroutines for parallel work

Real-World Results

Before (EC2-based):

Servers running 24/7
Manual scaling
High operational overhead
~$800/month

After (Lambda + Go):

Pay only for execution
Auto-scales to any load
Zero maintenance
~$320/month (60% cost reduction)

Processing performance:

Handles millions of records daily
Sub-second latency per batch
99.9% success rate
Fast cold starts (~100ms with Go)

Why Go Works Great Here

Performance benefits:

Compiled binary = fast startup
Low memory footprint
Built-in concurrency with goroutines
Single binary deployment (no dependencies)

vs Python/Node.js:

10x faster cold starts
50% less memory usage
Type safety catches errors early
Better for CPU-intensive work

Common Issues and Fixes

Cold starts taking too long?

Minimize binary size
Initialize clients in init()
Use provisioned concurrency for critical paths

Memory errors?

Test with different memory sizes (128MB to 3GB)
Monitor CloudWatch metrics
More memory = faster CPU too

DynamoDB throttling?

Adjust batch size
Add exponential backoff
Check table capacity

Costs higher than expected?

Right-size memory allocation
Set appropriate timeouts
Use S3 lifecycle policies
Monitor with AWS Cost Explorer

Key Takeaways

What works:

Go for Lambda = excellent performance
Event-driven architecture scales effortlessly
Terraform makes infrastructure predictable
Start simple, add complexity when needed

What to remember:

Always add a dead letter queue
Monitor from day one
Test with production-like data volumes
Version your infrastructure

Cost optimization:

Pay only for what you use
Right-size everything
Use batch processing
Archive old data to Glacier

Quick Start Checklist

Start here, then iterate based on your needs.