Cost Optimization in Snowflake: From Small Pipelines to Enterprise Scale
Srikar Mandava
1. Executive Summary
API ingestion pipelines often start simple but become unreliable as systems scale.
What initially works for small datasets begins to fail when exposed to pagination issues, retries, schema changes, and partial failures.
The real challenge is not ingestion—it is designing pipelines that can recover, replay, and maintain consistency over time.
A production-ready pipeline must clearly answer:
- What data was processed?
- What failed, and where?
- Can we safely rerun without duplication?
- Can we recover without starting over?
This requires moving beyond simple extraction and building stateful, resilient ingestion systems.
2. Background
APIs are not designed for bulk data extraction. They introduce pagination, rate limits, inconsistent ordering, and schema variability.
Snowflake, however, is optimized for structured, reliable data processing.
Because of this mismatch, the ingestion layer must act as a stabilization layer, ensuring that incoming data is complete, deduplicated, and traceable before it reaches downstream systems.
3. Problem
3.1 Symptoms
As pipelines scale, issues emerge gradually. Teams begin to observe missing data, duplicate records, inconsistent reports, and pipelines that cannot recover from mid-run failures.
These issues are rarely caused by one big failure, but rather by small design gaps—pagination logic, retry handling, and incremental processing assumptions.
3.2 Impact
Over time, these issues reduce trust in the data platform.
Engineering teams spend more effort debugging pipelines than building new capabilities.
Costs increase due to inefficient reprocessing, and decision-making slows due to unreliable data.
4. Requirements & Assumptions
A production-grade API ingestion system must:
- Maintain execution state (watermark, batch ID, pagination token)
- Support idempotent execution across retries
- Enable replay and backfill without corruption
- Enforce data quality checks
- Provide observability and auditability
5. Recommended Architecture
5.1 High-Level Flow
A scalable API ingestion pipeline typically follows this architecture:
This layered approach ensures that each component handles a specific responsibility.
5.2 Architecture Principle
A robust ingestion system requires both layered architecture and a Control & Metadata Layer.
This layer acts as the system of record for pipeline execution and ensures recovery, replay, and auditability.
CREATE TABLE ingestion_control (
object_name STRING,
batch_id STRING,
last_successful_watermark TIMESTAMP,
last_page_token STRING,
status STRING,
record_count NUMBER,
created_at TIMESTAMP
);
Standardizing this pattern enables scaling ingestion across multiple APIs consistently.
5.3 Idempotent Pipeline Design
Idempotency must be enforced across three levels to ensure safe retries and consistent results.
File-Level (S3)
Each file is immutable and uniquely identifiable:
s3://bucket/object/batch_id/page_token.json.gz
Batch-Level (Control Layer)
Each run uses a deterministic batch ID:
batch_id = object_name + extraction_window
Record-Level (Snowflake) — Production Safe
MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE
WHEN NOT MATCHED THEN INSERT;
This approach prevents missed updates and safely handles late-arriving data.
6. Implementation
6.1 Core Flow
The pipeline extracts paginated data, stores raw files in S3, and loads them into Snowflake.
What makes this production-ready is not the flow itself, but the ability to track state, recover from failures, and rerun safely.
6.2 Checkpointing Strategy
Checkpointing determines how the pipeline resumes after failure.
What is Stored
- last_successful_watermark
- last_page_token
- current_batch_id
- status
Granularity
Checkpointing should occur per page or micro-batch, not per run.
Lifecycle
- Update after each successful page
- Commit watermark after batch completion
- Preserve last successful state on failure
6.3 Replay & Backfill Strategy
Replay is essential for correcting data issues.
Supported approaches include:
- Reprocessing raw data from S3
- Re-running a specific batch using batch_id
- Reprocessing a defined time range
Replay must always use a new batch ID and remain fully auditable to avoid corrupting existing data.
6.4 Processing Semantics
API ingestion pipelines typically operate under at-least-once delivery semantics.
To approximate exactly-once behavior:
- Use idempotent design across file, batch, and record levels
- Deduplicate using MERGE logic
- Ensure replay does not introduce duplicates
This removes ambiguity and ensures predictable behavior under retries.
6.5 Concurrency Control
In production systems, overlapping pipeline runs can corrupt state and introduce inconsistencies.
To prevent this:
- Allow only one active batch per object
- Use control table status (running) to block new executions
- Apply atomic updates to avoid race conditions
Optionally, a lock flag or version column can be introduced for stronger control.
6.6 Failure Scenarios
| Scenario | Behaviour | Recovery |
|---|---|---|
| API timeout | Retry with backoff | Resume from checkpoint |
| Partial pagination | Detect via count mismatch | Re-fetch pages |
| Duplicate API pages | Detect via keys | Deduplicate via MERGE |
| Late-arriving updates | Overlap window | Reprocess window |
| Schema change | Land raw JSON | Adjust transform |
| Partial S3 upload | Detect via size/hash | Re-upload |
| Inconsistent API ordering | Causes gaps/duplicates | Enforce ordering or reprocess window |
| Corrupt / invalid records | Detect during processing | Move to DLQ/quarantine |
6.7 Schema Evolution
APIs evolve frequently, and pipelines must handle changes gracefully.
Storing raw data in JSON (VARIANT) and applying schema-on-read ensures flexibility and prevents upstream changes from breaking pipelines.
7. Validation & Testing
Data Quality Layer
Data quality must be enforced—not just monitored.
Checks should include:
- Source vs target reconciliation
- Freshness validation
- Null and anomaly detection
Failures should trigger alerts, pipeline failure, or data quarantine.
Data Completeness Strategy
APIs cannot always be trusted to return complete data.
To ensure correctness:
- Perform periodic full reconciliation (daily or weekly)
- Compare historical trends and expected volumes
- Use sampling when full counts are unavailable
- Track expected vs actual record counts per batch
This ensures high data reliability even when APIs are imperfect.
8. Security & Access
Security should be enforced across all layers:
- Store API credentials securely using Secrets Manager
- Use IAM roles for accessing S3
- Restrict Snowflake access using RBAC
- Enable auditing and monitoring for data access
9. Performance & Cost
9.1 Performance
Efficient pipelines minimize redundant work through incremental extraction, parallel processing, and compression.
9.2 Snowpipe vs COPY
Snowpipe is event-driven and suited for low-latency ingestion, while COPY is batch-oriented and more cost-efficient for large volumes.
At scale, Snowpipe can become expensive due to per-file overhead, whereas COPY provides better cost control.
9.3 Cost Optimization
- Optimize file size (100–250 MB per file) for efficient ingestion
- Avoid excessive small files
- Right-size warehouses based on workload
- Reduce unnecessary API calls
10. Operations & Monitoring
A production-grade pipeline is incomplete without proper monitoring and alerting. Observability ensures that issues are detected early and resolved quickly.
10.1 Monitoring
To maintain pipeline reliability, you should continuously track:
- API success and failure rates: Helps identify instability in external systems or connectivity issues.
- Data freshness and latency: Ensures that data is arriving within expected SLA timelines.
- Record counts and anomalies: Detects missing or unexpected spikes in data volume.
10.2 Alerting
Monitoring alone is not enough alerts ensure timely action.
You should configure alerts for:
- API failures: So that issues with external systems are addressed immediately.
- Pipeline execution failures: To prevent delays in downstream processing.
- SLA breaches: To ensure data availability aligns with business expectations.
10.3 Operational Runbook
Having a clear recovery plan reduces downtime and confusion during failures:
- Failures → Retry from last checkpoint: Use watermarking or checkpoints to resume processing without data loss.
- Duplicates → Apply deduplication logic: Use business keys and merge strategies to clean duplicate records.
- Delays → Adjust extraction frequency or scale compute: Increase resources or optimize scheduling to meet performance requirements.
11. Common Pitfalls
Even well-designed pipelines can fail if these common mistakes are not addressed:
- Incorrect pagination logic: Leads to incomplete data without obvious errors.
- Missing incremental processing: Causes inefficiency and unnecessary API load.
- Ignoring API rate limits: Results in unstable pipelines and frequent failures.
- Lack of deduplication strategy: Introduces inconsistencies in downstream data.
- Weak error handling: Makes recovery difficult when failures occur.
- No monitoring or alerting: Allows issues to go unnoticed until they impact business users.
12. Variations / Use Cases
API ingestion pipelines are widely used across different domains, depending on the source systems and use case:
- SaaS integrations (HR, Finance, CRM systems): Common for systems like payroll, accounting, and customer management.
- Third-party API ingestion: Used for integrating external partner or vendor data.
- Partner data integration: Enables data sharing between organizations.
- Event-driven API pipelines: Supports near real-time ingestion for time-sensitive use cases.
13. Next Steps
To move towards a more mature and scalable ingestion framework, consider the following steps:
- Standardize API ingestion frameworks: Build reusable patterns instead of creating pipelines from scratch each time.
- Implement reusable extraction patterns: Common logic like pagination, retries, and watermarking should be modularized.
- Introduce monitoring and alerting early: Observability should be part of the design, not an afterthought.
- Optimize pipelines for incremental processing: Focus on efficiency to reduce both cost and runtime.
14. Conclusion
API ingestion is a foundational component of modern data platforms, but it requires careful design to ensure reliability and scalability.
By addressing common pitfalls such as pagination, incremental processing, and rate limits, organizations can build pipelines that are both efficient and trustworthy.
A well-architected API ingestion pipeline not only improves performance and cost efficiency but also ensures that downstream analytics are based on accurate and complete data.

Srikar Mandava
Associate Data Engineer
Boolean Data Systems

Associate Data Engineer focused on designing scalable cloud data pipelines and modern data platforms. Skilled in Python, SQL, and Snowflake, with experience in ETL automation, large-scale data transformation, and building reliable data ingestion frameworks.
About Boolean Data
Systems
Boolean Data Systems is a Snowflake Premier Partner that implements solutions on cloud platforms. We help enterprises make better business decisions with data and solve real-world business analytics and data challenges.
Services and
Offerings
Solutions &
Accelerators
Global
Head Quarters
USA - Atlanta
3970 Old Milton Parkway,
Suite #200, Alpharetta, GA 30005
Ph. : 770-410-7770
Fax : 855-414-2865