Concepts
File Chunking
Split large files into smaller addressable pieces so upload, download, deduplication, retry, and sync can work without moving the whole file every time.
After this, you will understand
How File Chunking helps you see where this idea appears in production systems, what problem forces it, and how to reason about the tradeoffs.
Treat the idea as a definition to memorize.
Real systems force the idea to handle Chunk Boundary, Content Hash, and Resumable Upload.
Use the concept to decide what the system guarantees, what it risks, and what it costs to operate.
Think before readingWhere would File Chunking appear in a real production system, and what failure or bottleneck would it help you reason about?
Reading in progress
This page is saved in your local study history so you can continue later.
Concepts Covered
- Fixed-size chunks
- Content-defined chunks
- Content hashes
- Resumable upload
- Partial download
- Deduplication
- Chunk manifests
- Retry boundaries
Definition
File chunking is the practice of splitting a large file into smaller pieces that can be uploaded, downloaded, retried, hashed, stored, and reused independently.
A file sync system should not treat every edit as:
upload the entire file again
That works for small files and stable networks. It breaks when users edit large videos, design files, source archives, or documents on unreliable connections.
Chunking gives the system smaller units of work.
The Pain That Forces This Concept
Imagine a user edits one paragraph inside a 500 MB file.
A naive sync client uploads the whole file again. If the network drops at 490 MB, the client may restart from zero. If several devices upload similar files, the service stores duplicate bytes. If thousands of users sync after reconnecting, the upload path carries far more data than the actual changes require.
The pain is not just bandwidth. Whole-file sync also increases:
- retry cost
- storage cost
- mobile battery usage
- queue pressure
- conflict recovery cost
- time until other devices see the update
Chunking moves the system from whole-file work to piece-level work.
Mental Model
A file becomes a manifest plus chunks.
file_version v7
chunk_a hash=8f12 size=4MB
chunk_b hash=91cc size=4MB
chunk_c hash=aa02 size=2MB
The manifest describes which chunks form a version of the file. The chunk store holds the bytes.
If a new version reuses most chunks, the system can upload and store only the changed chunks, then create a new manifest.
How It Works
A common flow:
1. Client reads the file locally.
2. Client splits it into chunks.
3. Client hashes each chunk.
4. Client asks the server which hashes already exist.
5. Client uploads missing chunks.
6. Server verifies chunk hashes and stores them.
7. Client commits a file version that references the chunk manifest.
Chunk size is a tradeoff. Smaller chunks improve reuse and retry precision, but create more metadata. Larger chunks reduce metadata overhead, but make small edits more expensive.
Some systems use fixed-size chunks. Others use content-defined chunking, where boundaries are chosen based on the content so inserted bytes do not shift every later chunk boundary.
Tradeoffs
| Choice | Benefit | Cost |
|---|---|---|
| Fixed-size chunks | Simple implementation | Insertions can shift later chunks |
| Content-defined chunks | Better reuse after insertions | More CPU and complexity |
| Small chunks | Precise retries and dedupe | More manifest metadata |
| Large chunks | Lower metadata overhead | More upload work per edit |
| Content hashes | Deduplication and integrity checks | Requires careful hash and collision policy |
Chunking does not remove the need for metadata consistency. The server still needs to know which manifest is the current version, who can read it, and whether all referenced chunks exist.
Operational Reality
Operators should watch:
- chunk upload failure rate
- average chunk size
- manifest commit failures
- orphaned chunks
- chunk verification failures
- storage deduplication ratio
- upload session age
- object storage latency
- clients repeatedly retrying the same chunk
Failure modes:
- A manifest references a chunk that was never committed.
- A client retries an upload and creates duplicate chunks.
- Chunk verification is skipped and corrupted bytes become durable.
- Chunk metadata grows faster than expected.
- Deduplication links private files through shared storage without a safe access model.
- Garbage collection removes chunks still referenced by an old version.
Related Topics
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Prerequisites
Read these first if the mechanics feel unfamiliar.
Used In Systems
System studies where this idea appears in context.
Related Concepts
Core ideas that connect to this topic.
Related Patterns
Reusable architecture moves built from these ideas.