Cost Optimization in Snowflake: From Small Pipelines to Enterprise Scale

Srikar Mandava

Snowflake RBAC Management with Streamlit

1. Executive Summary

API ingestion pipelines often start simple but become unreliable as systems scale.

What initially works for small datasets begins to fail when exposed to pagination issues, retries, schema changes, and partial failures.

The real challenge is not ingestion—it is designing pipelines that can recover, replay, and maintain consistency over time.

A production-ready pipeline must clearly answer:

What data was processed?
What failed, and where?
Can we safely rerun without duplication?
Can we recover without starting over?

This requires moving beyond simple extraction and building stateful, resilient ingestion systems.

2. Background

APIs are not designed for bulk data extraction. They introduce pagination, rate limits, inconsistent ordering, and schema variability.

Snowflake, however, is optimized for structured, reliable data processing.

Because of this mismatch, the ingestion layer must act as a stabilization layer, ensuring that incoming data is complete, deduplicated, and traceable before it reaches downstream systems.

3. Problem

3.1 Symptoms

As pipelines scale, issues emerge gradually. Teams begin to observe missing data, duplicate records, inconsistent reports, and pipelines that cannot recover from mid-run failures.

These issues are rarely caused by one big failure, but rather by small design gaps—pagination logic, retry handling, and incremental processing assumptions.

3.2 Impact

Over time, these issues reduce trust in the data platform.

Engineering teams spend more effort debugging pipelines than building new capabilities.

Costs increase due to inefficient reprocessing, and decision-making slows due to unreliable data.

4. Requirements & Assumptions

A production-grade API ingestion system must:

Maintain execution state (watermark, batch ID, pagination token)
Support idempotent execution across retries
Enable replay and backfill without corruption
Enforce data quality checks
Provide observability and auditability

5. Recommended Architecture

5.1 High-Level Flow

A scalable API ingestion pipeline typically follows this architecture:

API to Snowflake Ingestion High-Level Flow

This layered approach ensures that each component handles a specific responsibility.

5.2 Architecture Principle

A robust ingestion system requires both layered architecture and a Control & Metadata Layer.

This layer acts as the system of record for pipeline execution and ensures recovery, replay, and auditability.

CREATE TABLE ingestion_control (
    object_name STRING,
    batch_id STRING,
    last_successful_watermark TIMESTAMP,
    last_page_token STRING,
    status STRING,
    record_count NUMBER,
    created_at TIMESTAMP
);

Standardizing this pattern enables scaling ingestion across multiple APIs consistently.

5.3 Idempotent Pipeline Design

Idempotency must be enforced across three levels to ensure safe retries and consistent results.

File-Level (S3)
Each file is immutable and uniquely identifiable:

s3://bucket/object/batch_id/page_token.json.gz

Batch-Level (Control Layer)
Each run uses a deterministic batch ID:

batch_id = object_name + extraction_window

Record-Level (Snowflake) — Production Safe

MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE
WHEN NOT MATCHED THEN INSERT;

This approach prevents missed updates and safely handles late-arriving data.

6. Implementation

6.1 Core Flow

The pipeline extracts paginated data, stores raw files in S3, and loads them into Snowflake.

What makes this production-ready is not the flow itself, but the ability to track state, recover from failures, and rerun safely.

6.2 Checkpointing Strategy

Checkpointing determines how the pipeline resumes after failure.

What is Stored

last_successful_watermark
last_page_token
current_batch_id
status

Granularity
Checkpointing should occur per page or micro-batch, not per run.

Lifecycle

Update after each successful page
Commit watermark after batch completion
Preserve last successful state on failure

6.3 Replay & Backfill Strategy

Replay is essential for correcting data issues.

Supported approaches include:

Reprocessing raw data from S3
Re-running a specific batch using batch_id
Reprocessing a defined time range

Replay must always use a new batch ID and remain fully auditable to avoid corrupting existing data.

6.4 Processing Semantics

API ingestion pipelines typically operate under at-least-once delivery semantics.

To approximate exactly-once behavior:

Use idempotent design across file, batch, and record levels
Deduplicate using MERGE logic
Ensure replay does not introduce duplicates

This removes ambiguity and ensures predictable behavior under retries.

6.5 Concurrency Control

In production systems, overlapping pipeline runs can corrupt state and introduce inconsistencies.

To prevent this:

Allow only one active batch per object
Use control table status (running) to block new executions
Apply atomic updates to avoid race conditions

Optionally, a lock flag or version column can be introduced for stronger control.

6.6 Failure Scenarios

Scenario	Behaviour	Recovery
API timeout	Retry with backoff	Resume from checkpoint
Partial pagination	Detect via count mismatch	Re-fetch pages
Duplicate API pages	Detect via keys	Deduplicate via MERGE
Late-arriving updates	Overlap window	Reprocess window
Schema change	Land raw JSON	Adjust transform
Partial S3 upload	Detect via size/hash	Re-upload
Inconsistent API ordering	Causes gaps/duplicates	Enforce ordering or reprocess window
Corrupt / invalid records	Detect during processing	Move to DLQ/quarantine

6.7 Schema Evolution

APIs evolve frequently, and pipelines must handle changes gracefully.

Storing raw data in JSON (VARIANT) and applying schema-on-read ensures flexibility and prevents upstream changes from breaking pipelines.

7. Validation & Testing

Data Quality Layer
Data quality must be enforced—not just monitored.

Checks should include:

Source vs target reconciliation
Freshness validation
Null and anomaly detection

Failures should trigger alerts, pipeline failure, or data quarantine.

Data Completeness Strategy
APIs cannot always be trusted to return complete data.

To ensure correctness:

Perform periodic full reconciliation (daily or weekly)
Compare historical trends and expected volumes
Use sampling when full counts are unavailable
Track expected vs actual record counts per batch

This ensures high data reliability even when APIs are imperfect.

8. Security & Access

Security should be enforced across all layers:

Store API credentials securely using Secrets Manager
Use IAM roles for accessing S3
Restrict Snowflake access using RBAC
Enable auditing and monitoring for data access

9. Performance & Cost

9.1 Performance

Efficient pipelines minimize redundant work through incremental extraction, parallel processing, and compression.

9.2 Snowpipe vs COPY

Snowpipe is event-driven and suited for low-latency ingestion, while COPY is batch-oriented and more cost-efficient for large volumes.

At scale, Snowpipe can become expensive due to per-file overhead, whereas COPY provides better cost control.

9.3 Cost Optimization

Optimize file size (100–250 MB per file) for efficient ingestion
Avoid excessive small files
Right-size warehouses based on workload
Reduce unnecessary API calls

10. Operations & Monitoring

A production-grade pipeline is incomplete without proper monitoring and alerting. Observability ensures that issues are detected early and resolved quickly.

10.1 Monitoring

To maintain pipeline reliability, you should continuously track:

API success and failure rates: Helps identify instability in external systems or connectivity issues.
Data freshness and latency: Ensures that data is arriving within expected SLA timelines.
Record counts and anomalies: Detects missing or unexpected spikes in data volume.

10.2 Alerting

Monitoring alone is not enough alerts ensure timely action.

You should configure alerts for:

API failures: So that issues with external systems are addressed immediately.
Pipeline execution failures: To prevent delays in downstream processing.
SLA breaches: To ensure data availability aligns with business expectations.

10.3 Operational Runbook

Having a clear recovery plan reduces downtime and confusion during failures:

Failures → Retry from last checkpoint: Use watermarking or checkpoints to resume processing without data loss.
Duplicates → Apply deduplication logic: Use business keys and merge strategies to clean duplicate records.
Delays → Adjust extraction frequency or scale compute: Increase resources or optimize scheduling to meet performance requirements.

11. Common Pitfalls

Even well-designed pipelines can fail if these common mistakes are not addressed:

Incorrect pagination logic: Leads to incomplete data without obvious errors.
Missing incremental processing: Causes inefficiency and unnecessary API load.
Ignoring API rate limits: Results in unstable pipelines and frequent failures.
Lack of deduplication strategy: Introduces inconsistencies in downstream data.
Weak error handling: Makes recovery difficult when failures occur.
No monitoring or alerting: Allows issues to go unnoticed until they impact business users.

12. Variations / Use Cases

API ingestion pipelines are widely used across different domains, depending on the source systems and use case:

SaaS integrations (HR, Finance, CRM systems): Common for systems like payroll, accounting, and customer management.
Third-party API ingestion: Used for integrating external partner or vendor data.
Partner data integration: Enables data sharing between organizations.
Event-driven API pipelines: Supports near real-time ingestion for time-sensitive use cases.

13. Next Steps

To move towards a more mature and scalable ingestion framework, consider the following steps:

Standardize API ingestion frameworks: Build reusable patterns instead of creating pipelines from scratch each time.
Implement reusable extraction patterns: Common logic like pagination, retries, and watermarking should be modularized.
Introduce monitoring and alerting early: Observability should be part of the design, not an afterthought.
Optimize pipelines for incremental processing: Focus on efficiency to reduce both cost and runtime.

14. Conclusion

API ingestion is a foundational component of modern data platforms, but it requires careful design to ensure reliability and scalability.

By addressing common pitfalls such as pagination, incremental processing, and rate limits, organizations can build pipelines that are both efficient and trustworthy.

A well-architected API ingestion pipeline not only improves performance and cost efficiency but also ensures that downstream analytics are based on accurate and complete data.

Srikar Mandava

Associate Data Engineer

Boolean Data Systems

Associate Data Engineer focused on designing scalable cloud data pipelines and modern data platforms. Skilled in Python, SQL, and Snowflake, with experience in ETL automation, large-scale data transformation, and building reliable data ingestion frameworks.

About Boolean Data
Systems

Boolean Data Systems is a Snowflake Premier Partner that implements solutions on cloud platforms. We help enterprises make better business decisions with data and solve real-world business analytics and data challenges.

Services and
Offerings

Solutions &
Accelerators

Snowflake Cost Estimator

Data Pipeline

QA Framework

Logistics Industry AI
Retail Industry AI
Predictive Maintenance

Fraud Prediction AI

Health Check Accelerator