Automating Snowflake Data Ingestion from AWS S3 Using Event-Driven Lambda Triggers

Dara Bindara

Snowflake RBAC Management with Streamlit

1. Executive Summary

Modern data platforms increasingly require near real-time ingestion pipelines that can automatically process incoming data without manual orchestration. Traditional batch pipelines that rely on scheduled jobs introduce latency, operational complexity, and unnecessary compute costs.

Recommended approach / pattern

Implement an event-driven ingestion architecture where new files arriving in AWS S3 automatically trigger an AWS Lambda function, which then loads the data into Snowflake using the COPY INTO command.

Where it fits (best use cases)

This architecture is particularly effective for:

  • Data platforms receiving frequent file uploads into S3
  • Systems requiring near real-time ingestion
  • Organizations wanting to eliminate manual ingestion orchestration
  • Workloads where ingestion must scale automatically with file arrival volume

Key outcomes

  • Automated Snowflake ingestion without polling or scheduled jobs
  • Near real-time data availability in Snowflake
  • Reduced operational overhead
  • Scalable ingestion architecture using serverless infrastructure

What the reader can implement

After reading this article, data engineers can implement:

  • An event-driven ingestion pipeline
  • Automated Snowflake loading from S3
  • Lambda-based ingestion orchestration
  • A scalable serverless ingestion architecture

2. Background

Enterprise data pipelines have evolved significantly with the adoption of cloud-native data platforms. Traditional ingestion pipelines often follow a scheduled ETL pattern, where jobs run every hour or every day to check for new files and process them. While this approach works, it introduces several inefficiencies:

  • Compute resources run even when no data arrives
  • Data freshness depends on schedule frequency
  • Operations teams must maintain orchestration workflows

With cloud infrastructure, event-driven architectures provide a more efficient model. Instead of continuously checking for new files, systems react only when events occur.

In AWS environments, S3 events can trigger Lambda functions automatically whenever a new file is uploaded. When integrated with Snowflake, this enables a powerful pattern:

  • A file arrives in S3
  • S3 generates an event notification
  • AWS Lambda is triggered
  • Lambda executes Snowflake loading commands
  • Data becomes available in Snowflake immediately

This architecture eliminates the need for traditional schedulers and enables low-latency data ingestion.

3. Problem

3.1 Symptoms

Organizations that rely on scheduled ingestion pipelines typically experience several recurring challenges.

Symptom 1 — Delayed Data Availability

Batch pipelines introduce delays between when data arrives and when it becomes available in Snowflake.

Symptom 2 — Operational Complexity

Teams maintain complex orchestration frameworks such as Airflow DAGs, Cron jobs, or Custom ingestion scripts.

Symptom 3 — Inefficient Resource Usage

Scheduled jobs often run even when no new data is available, wasting compute resources.

3.2 Impact

These limitations create both technical and business challenges:

  • Slower analytics due to delayed data ingestion
  • Increased operational overhead for data teams
  • Higher infrastructure costs
  • Reduced agility for real-time analytics use cases

Event-driven ingestion addresses these issues by automating ingestion only when new data arrives.

4. Requirements & Assumptions

4.1 Data Characteristics & Operational Context

Typical ingestion environments using this architecture exhibit the following characteristics:

Data scale

  • Files ranging from MBs to several GBs
  • Thousands of files arriving daily

Refresh frequency

  • Data may arrive continuously or in bursts

Environment structure

Most organizations deploy separate environments:

  • Development
  • UAT
  • Production

Each environment may use separate S3 buckets and Snowflake databases.

4.2 Security & Access Control

Security considerations include:

  • AWS IAM roles controlling Lambda permissions
  • Snowflake roles managing data access
  • Secure credential storage using AWS Secrets Manager

Lambda functions should authenticate to Snowflake using secure credentials rather than hardcoded passwords.

4.3 Tooling & Constraints

This architecture leverages several AWS and Snowflake services.

Key technologies include:

  • AWS S3 for file storage
  • AWS Lambda for serverless event processing
  • Snowflake Cloud Data Platform
  • Snowflake External Stage
  • Snowflake COPY INTO command

This combination enables a fully automated ingestion pipeline.

Several practical constraints must be considered.

Lambda execution limits

  • Maximum execution time: 15 minutes
  • Memory limits depending on configuration

File size considerations

Very large files may require batching or chunked ingestion.

Snowflake warehouse availability

A virtual warehouse must be available to process ingestion commands.

5. Recommended Architecture

5.1 High-Level Flow

The event-driven ingestion pipeline follows this workflow:

  1. A file is uploaded to an AWS S3 bucket
  2. S3 generates an event notification
  3. The event triggers an AWS Lambda function
  4. Lambda retrieves file metadata
  5. Lambda connects to Snowflake
  6. Lambda executes a COPY INTO command
  7. Snowflake loads the data into the target table

This approach ensures ingestion occurs immediately after file arrival.

5.2 Architecture Diagram

Architecture Diagram

5.3 Options

Option A — Scheduled Ingestion

Many pipelines use schedulers such as Airflow to periodically ingest files.

Advantages

  • Easy to implement
  • Widely used

Disadvantages

  • Higher latency
  • Unnecessary compute usage
  • Operational overhead

Option B — Event-Driven Ingestion (Recommended)

S3 events trigger Lambda functions automatically.

Advantages

  • Near real-time ingestion
  • Reduced operational overhead
  • Scales automatically with data arrival

Selection Guide

Organizations requiring real-time or near real-time ingestion should strongly prefer event-driven pipelines.

6. Implementation

6.1 Setup

Core resources required:

AWS components

  • S3 bucket
  • Lambda function
  • IAM role
  • S3 event notifications

Snowflake components

  • Database and schema
  • Target table
  • External stage
  • Virtual warehouse

6.2 Core Build Steps

Step 1 — Create S3 Bucket
Create an S3 bucket to store incoming data files.

Step 2 — Create Snowflake External Stage
Define an external stage pointing to the S3 bucket.

Step 3 — Configure Target Table
Create a Snowflake table to store ingested data.

Step 4 — Create AWS Lambda Function
The Lambda function will:

  • Receive S3 event notifications
  • Extract file path
  • Connect to Snowflake
  • Execute COPY command

Step 5 — Configure S3 Event Notification
Configure the S3 bucket to trigger Lambda when a new file is uploaded.

6.3 Configuration Defaults

Recommended defaults include:

File format definition

Define file formats explicitly in Snowflake.

Error handling

Use COPY options: ON_ERROR = 'CONTINUE'

Logging

Lambda should log ingestion events for monitoring.

7. Validation & Testing

Testing ensures ingestion works reliably and safely.

Validation focuses on:

  • Data ingestion correctness
  • File detection reliability
  • Snowflake load success

7.1 Ingestion Validation

Test cases include:

  • Upload a file to S3
  • Verify Lambda execution
  • Verify Snowflake table ingestion

7.2 Data Validation

Validate:

  • Row counts
  • Column mappings
  • Data format consistency

7.3 Failure Testing

Test failure scenarios such as:

  • Invalid file formats
  • Missing columns
  • Snowflake connection failures

8. Security & Access

Required permissions include:

AWS permissions

  • Lambda execution role
  • S3 read permissions

Snowflake permissions

  • USAGE on database
  • USAGE on stage
  • INSERT privileges on target table

9. Performance & Cost

9.1 Performance Considerations

Performance depends on:

  • Snowflake warehouse size
  • File size
  • Number of concurrent files

Best practices include:

  • Use compressed files
  • Batch small files
  • Enable auto-scaling warehouses

9.2 Cost Drivers

Primary cost components include:

Compute
Snowflake virtual warehouse usage

Serverless compute
AWS Lambda execution time

Storage
S3 file storage

9.3 Cost Controls

Recommended controls include:

  • Warehouse auto-suspend
  • Lambda memory optimization
  • File batching strategies

10. Operations & Monitoring

10.1 What to Monitor

Key operational metrics include:

  • Lambda execution failures
  • Snowflake load errors
  • Data ingestion latency

10.2 Alerting

Recommended alerts include:

  • Lambda failure notifications
  • Snowflake COPY errors
  • S3 event delivery failures

10.3 Runbook (Top Issues)

Issue: Lambda fails to connect to Snowflake
Fix: Verify credentials and network configuration

Issue: Data not loading
Fix: Check stage configuration and file format

Issue: Duplicate ingestion
Fix: Implement idempotent load logic

11. Common Pitfalls

Pitfall 1
Triggering Lambda for every tiny file.

Pitfall 2
Not handling duplicate file ingestion.

Pitfall 3
Using oversized Lambda functions.

Pitfall 4
Ignoring Snowflake warehouse scaling.

Pitfall 5
Not validating file formats before ingestion.

12. Variations / Use Cases

Variation 1 — Snowpipe Integration

Use Snowpipe with S3 notifications for fully managed ingestion.

Variation 2 — Streaming Pipelines

Combine with Kafka or Kinesis for real-time event streaming.

Variation 3 — Metadata Tracking

Maintain ingestion logs in Snowflake for auditability.

Variation 4 — Data Quality Integration

Add validation frameworks like Great Expectations before ingestion.

Dara Bindara

Associate Data Engineer

Boolean Data Systems




Dara Bindara is a Associate Data Engineer specializing in building and optimizing cloud-based data pipelines. Experienced in Python, SQL, PySpark, Snowflake Cortex, and AI/ML workflows, with a focus on ETL automation, large-scale data transformation, and scalable data warehousing.

About Boolean Data
Systems

Boolean Data Systems is a Snowflake Premier Partner that implements solutions on cloud platforms. We help enterprises make better business decisions with data and solve real-world business analytics and data challenges.

Global
Head Quarters

USA - Atlanta
3970 Old Milton Parkway,
Suite #200, Alpharetta, GA 30005
Ph. : 770-410-7770
Fax : 855-414-2865

Boolean Data is SOC 2 Type 1 compliant
All rights reserved – Boolean Data Systems