Automating Snowflake Data Ingestion from AWS S3 Using Event-Driven Lambda Triggers

Dara Bindara

Snowflake RBAC Management with Streamlit

1. Executive Summary

Modern data platforms increasingly require near real-time ingestion pipelines that can automatically process incoming data without manual orchestration. Traditional batch pipelines that rely on scheduled jobs introduce latency, operational complexity, and unnecessary compute costs.

Recommended approach / pattern

Implement an event-driven ingestion architecture where new files arriving in AWS S3 automatically trigger an AWS Lambda function, which then loads the data into Snowflake using the COPY INTO command.

Where it fits (best use cases)

This architecture is particularly effective for:

Data platforms receiving frequent file uploads into S3
Systems requiring near real-time ingestion
Organizations wanting to eliminate manual ingestion orchestration
Workloads where ingestion must scale automatically with file arrival volume

Key outcomes

Automated Snowflake ingestion without polling or scheduled jobs
Near real-time data availability in Snowflake
Reduced operational overhead
Scalable ingestion architecture using serverless infrastructure

What the reader can implement

After reading this article, data engineers can implement:

An event-driven ingestion pipeline
Automated Snowflake loading from S3
Lambda-based ingestion orchestration
A scalable serverless ingestion architecture

2. Background

Enterprise data pipelines have evolved significantly with the adoption of cloud-native data platforms. Traditional ingestion pipelines often follow a scheduled ETL pattern, where jobs run every hour or every day to check for new files and process them. While this approach works, it introduces several inefficiencies:

Compute resources run even when no data arrives
Data freshness depends on schedule frequency
Operations teams must maintain orchestration workflows

With cloud infrastructure, event-driven architectures provide a more efficient model. Instead of continuously checking for new files, systems react only when events occur.

In AWS environments, S3 events can trigger Lambda functions automatically whenever a new file is uploaded. When integrated with Snowflake, this enables a powerful pattern:

A file arrives in S3
S3 generates an event notification
AWS Lambda is triggered
Lambda executes Snowflake loading commands
Data becomes available in Snowflake immediately

This architecture eliminates the need for traditional schedulers and enables low-latency data ingestion.

3. Problem

3.1 Symptoms

Organizations that rely on scheduled ingestion pipelines typically experience several recurring challenges.

Symptom 1 — Delayed Data Availability

Batch pipelines introduce delays between when data arrives and when it becomes available in Snowflake.

Symptom 2 — Operational Complexity

Teams maintain complex orchestration frameworks such as Airflow DAGs, Cron jobs, or Custom ingestion scripts.

Symptom 3 — Inefficient Resource Usage

Scheduled jobs often run even when no new data is available, wasting compute resources.

3.2 Impact

These limitations create both technical and business challenges:

Slower analytics due to delayed data ingestion
Increased operational overhead for data teams
Higher infrastructure costs
Reduced agility for real-time analytics use cases

Event-driven ingestion addresses these issues by automating ingestion only when new data arrives.

4. Requirements & Assumptions

4.1 Data Characteristics & Operational Context

Typical ingestion environments using this architecture exhibit the following characteristics:

Data scale

Files ranging from MBs to several GBs
Thousands of files arriving daily

Refresh frequency

Data may arrive continuously or in bursts

Environment structure

Most organizations deploy separate environments:

Development
UAT
Production

Each environment may use separate S3 buckets and Snowflake databases.

4.2 Security & Access Control

Security considerations include:

AWS IAM roles controlling Lambda permissions
Snowflake roles managing data access
Secure credential storage using AWS Secrets Manager

Lambda functions should authenticate to Snowflake using secure credentials rather than hardcoded passwords.

4.3 Tooling & Constraints

This architecture leverages several AWS and Snowflake services.

Key technologies include:

AWS S3 for file storage
AWS Lambda for serverless event processing
Snowflake Cloud Data Platform
Snowflake External Stage
Snowflake COPY INTO command

This combination enables a fully automated ingestion pipeline.

Several practical constraints must be considered.

Lambda execution limits

Maximum execution time: 15 minutes
Memory limits depending on configuration

File size considerations

Very large files may require batching or chunked ingestion.

Snowflake warehouse availability

A virtual warehouse must be available to process ingestion commands.

5. Recommended Architecture

5.1 High-Level Flow

The event-driven ingestion pipeline follows this workflow:

A file is uploaded to an AWS S3 bucket
S3 generates an event notification
The event triggers an AWS Lambda function
Lambda retrieves file metadata
Lambda connects to Snowflake
Lambda executes a COPY INTO command
Snowflake loads the data into the target table

This approach ensures ingestion occurs immediately after file arrival.

5.2 Architecture Diagram

5.3 Options

Option A — Scheduled Ingestion

Many pipelines use schedulers such as Airflow to periodically ingest files.

Advantages

Easy to implement
Widely used

Disadvantages

Higher latency
Unnecessary compute usage
Operational overhead

Option B — Event-Driven Ingestion (Recommended)

S3 events trigger Lambda functions automatically.

Advantages

Near real-time ingestion
Reduced operational overhead
Scales automatically with data arrival

Selection Guide

Organizations requiring real-time or near real-time ingestion should strongly prefer event-driven pipelines.

6. Implementation

6.1 Setup

Core resources required:

AWS components

S3 bucket
Lambda function
IAM role
S3 event notifications

Snowflake components

Database and schema
Target table
External stage
Virtual warehouse

6.2 Core Build Steps

Step 1 — Create S3 Bucket
Create an S3 bucket to store incoming data files.

Step 2 — Create Snowflake External Stage
Define an external stage pointing to the S3 bucket.

Step 3 — Configure Target Table
Create a Snowflake table to store ingested data.

Step 4 — Create AWS Lambda Function
The Lambda function will:

Receive S3 event notifications
Extract file path
Connect to Snowflake
Execute COPY command

Step 5 — Configure S3 Event Notification
Configure the S3 bucket to trigger Lambda when a new file is uploaded.

6.3 Configuration Defaults

Recommended defaults include:

File format definition

Define file formats explicitly in Snowflake.

Error handling

Use COPY options: ON_ERROR = 'CONTINUE'

Logging

Lambda should log ingestion events for monitoring.

7. Validation & Testing

Testing ensures ingestion works reliably and safely.

Validation focuses on:

Data ingestion correctness
File detection reliability
Snowflake load success

7.1 Ingestion Validation

Test cases include:

Upload a file to S3
Verify Lambda execution
Verify Snowflake table ingestion

7.2 Data Validation

Validate:

Row counts
Column mappings
Data format consistency

7.3 Failure Testing

Test failure scenarios such as:

Invalid file formats
Missing columns
Snowflake connection failures

8. Security & Access

Required permissions include:

AWS permissions

Lambda execution role
S3 read permissions

Snowflake permissions

USAGE on database
USAGE on stage
INSERT privileges on target table

9. Performance & Cost

9.1 Performance Considerations

Performance depends on:

Snowflake warehouse size
File size
Number of concurrent files

Best practices include:

Use compressed files
Batch small files
Enable auto-scaling warehouses

9.2 Cost Drivers

Primary cost components include:

Compute
Snowflake virtual warehouse usage

Serverless compute
AWS Lambda execution time

Storage
S3 file storage

9.3 Cost Controls

Recommended controls include:

Warehouse auto-suspend
Lambda memory optimization
File batching strategies

10. Operations & Monitoring

10.1 What to Monitor

Key operational metrics include:

Lambda execution failures
Snowflake load errors
Data ingestion latency

10.2 Alerting

Recommended alerts include:

Lambda failure notifications
Snowflake COPY errors
S3 event delivery failures

10.3 Runbook (Top Issues)

Issue: Lambda fails to connect to Snowflake
Fix: Verify credentials and network configuration

Issue: Data not loading
Fix: Check stage configuration and file format

Issue: Duplicate ingestion
Fix: Implement idempotent load logic

11. Common Pitfalls

Pitfall 1
Triggering Lambda for every tiny file.

Pitfall 2
Not handling duplicate file ingestion.

Pitfall 3
Using oversized Lambda functions.

Pitfall 4
Ignoring Snowflake warehouse scaling.

Pitfall 5
Not validating file formats before ingestion.

12. Variations / Use Cases

Variation 1 — Snowpipe Integration

Use Snowpipe with S3 notifications for fully managed ingestion.

Variation 2 — Streaming Pipelines

Combine with Kafka or Kinesis for real-time event streaming.

Variation 3 — Metadata Tracking

Maintain ingestion logs in Snowflake for auditability.

Variation 4 — Data Quality Integration

Add validation frameworks like Great Expectations before ingestion.

Dara Bindara

Associate Data Engineer

Boolean Data Systems

Dara Bindara is a Associate Data Engineer specializing in building and optimizing cloud-based data pipelines. Experienced in Python, SQL, PySpark, Snowflake Cortex, and AI/ML workflows, with a focus on ETL automation, large-scale data transformation, and scalable data warehousing.

About Boolean Data
Systems

Boolean Data Systems is a Snowflake Premier Partner that implements solutions on cloud platforms. We help enterprises make better business decisions with data and solve real-world business analytics and data challenges.

Services and
Offerings

Solutions &
Accelerators

Snowflake Cost Estimator

Data Pipeline

QA Framework

Logistics Industry AI
Retail Industry AI
Predictive Maintenance

Fraud Prediction AI

Health Check Accelerator

Global
Head Quarters

USA - Atlanta
3970 Old Milton Parkway,
Suite #200, Alpharetta, GA 30005
Ph. : 770-410-7770
Fax : 855-414-2865

Boolean Data is SOC 2 Type 1 compliant

Automating Snowflake Data Ingestion from AWS S3 Using Event-Driven Lambda Triggers

Automating Snowflake Data Ingestion from AWS S3 Using Event-Driven Lambda Triggers

1. Executive Summary

Recommended approach / pattern

Where it fits (best use cases)

Key outcomes

What the reader can implement

2. Background

3. Problem

3.1 Symptoms

Symptom 1 — Delayed Data Availability

Symptom 2 — Operational Complexity

Symptom 3 — Inefficient Resource Usage

3.2 Impact

4. Requirements & Assumptions

4.1 Data Characteristics & Operational Context

Data scale

Refresh frequency

Environment structure

4.2 Security & Access Control

4.3 Tooling & Constraints

Lambda execution limits

File size considerations

Snowflake warehouse availability

5. Recommended Architecture

5.1 High-Level Flow

5.2 Architecture Diagram

5.3 Options

Option A — Scheduled Ingestion

Option B — Event-Driven Ingestion (Recommended)

Selection Guide

6. Implementation

6.1 Setup

AWS components

Snowflake components

6.2 Core Build Steps

6.3 Configuration Defaults

File format definition

Error handling

Logging

7. Validation & Testing

7.1 Ingestion Validation

7.2 Data Validation

7.3 Failure Testing

8. Security & Access

AWS permissions

Snowflake permissions

9. Performance & Cost

9.1 Performance Considerations

9.2 Cost Drivers

9.3 Cost Controls

10. Operations & Monitoring

10.1 What to Monitor

10.2 Alerting

10.3 Runbook (Top Issues)

11. Common Pitfalls

12. Variations / Use Cases

Variation 1 — Snowpipe Integration

Variation 2 — Streaming Pipelines

Variation 3 — Metadata Tracking

Variation 4 — Data Quality Integration

Recent Posts

Categories

Archives