System Design
Google Drive / Dropbox File Sync System
Design a cloud file sync system that handles chunked uploads, metadata sync, version history, conflict resolution, delta transfer, offline clients, and repair.
After this, you will understand
Why file sync is not uploading files, but reconciling chunks, metadata, versions, conflicts, offline clients, and repair after partial failure.
Upload the whole file on every save and let each device poll for the latest folder state.
Files are large, clients disconnect, edits overlap, renames and deletes need ordering, and polling is too slow for efficient recovery.
Separate chunked content storage from metadata sync, use cursors and version history, detect conflicts, and run reconciliation for drift.
Think before readingIf a laptop edits a file offline while another device renames it, which part of the system should decide what changed?
Reading in progress
This page is saved in your local study history so you can continue later.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
Concepts Covered
- File chunking
- Metadata sync
- Delta transfer
- Version history
- Conflict resolution
- Cursor-based sync
- Offline-first clients
- Tombstones and deletes
- Chunk manifests
- Sync journals
- Reconciliation and repair
1. Introduction
A Google Drive or Dropbox-style file sync system keeps files available across laptops, phones, browsers, shared folders, and offline clients.
The visible product behavior looks simple: save a file on one device, see it on another device.
The backend problem is harder because file sync is not one problem. It is several problems that happen to look like one product:
- moving large bytes efficiently
- keeping folder metadata coherent
- handling offline edits
- preserving older versions
- resolving conflicts without losing work
- syncing many devices after network gaps
- enforcing permissions while clients cache data locally
- repairing drift after partial failures
This module uses "Google Drive / Dropbox File Sync" as a familiar product shape, not as a claim about Google Drive, Dropbox, or any private implementation.
At small scale, a client can upload a whole file and the server can store the latest copy.
At production scale, that naive model breaks because files are large, clients disconnect, edits overlap, deletes need tombstones, metadata changes outnumber byte changes, and users expect recovery when sync goes wrong.
2. Product Requirements
Functional Requirements
- Users can create, upload, edit, rename, move, delete, and restore files.
- Users can organize files into folders.
- Clients can sync changes across multiple devices.
- Clients can continue working while offline and sync later.
- Large files can upload and download reliably.
- The product keeps version history for recovery.
- Concurrent edits should not silently lose user data.
- Shared folders and permissions should affect visibility.
- Clients can resume partial uploads and downloads.
- The system can repair metadata or chunk drift.
Non-Functional Requirements
- File bytes must be durable once a version is committed.
- Metadata reads and writes should feel low latency.
- Sync should avoid re-uploading unchanged bytes.
- Offline clients should converge when they reconnect.
- Conflict handling should preserve user work.
- The system should tolerate object storage, worker, and network failures.
- Sync storms should not overload the metadata or blob plane.
- Permission changes should propagate quickly enough to avoid unsafe access.
- Operators should be able to audit, reconcile, and restore state.
3. Core Engineering Challenges
The core challenge is that file sync has two very different planes.
The blob plane handles bytes:
chunks -> manifests -> object storage -> download
The metadata plane handles meaning:
file name -> folder -> current version -> permissions -> delete state
Treating these as the same path makes the system slow and fragile. Treating them as completely independent creates drift.
The hard parts are:
- A file version should not become visible before its chunks are durable.
- A delete must reach offline devices later.
- A rename should not look like a delete plus unrelated create unless the product chooses that model.
- A stale device should not overwrite newer work.
- A new device should be able to bootstrap from a snapshot and then catch up with a journal.
- A reconnect storm should not cause every client to scan and download everything at once.
- Permissions must apply to metadata and bytes, including cached or old versions.
4. High-Level Architecture
A practical design separates client sync, metadata, blob storage, and async repair.
Client Sync Engine
-> Metadata API
-> Upload Session API
-> Chunk Upload Service
-> Sync Feed API
Metadata API
-> File Metadata Store
-> Version Store
-> Sync Event Journal
-> Permission Store
Chunk Upload Service
-> Chunk Store / Object Storage
-> Manifest Store
-> Integrity Verifier
Async Workers
-> Notification Fan-Out
-> Garbage Collection
-> Reconciliation
-> Malware / Policy Scanning
-> Search / Preview Indexing
The metadata service is the source of truth for file identity, folder placement, current version, tombstones, and permissions.
The chunk store is the durable storage layer for bytes. It should not decide which file version is current. It stores chunks and manifests that metadata records reference.
The sync journal lets devices ask:
Give me all metadata changes after cursor 84211.
That journal is what makes offline recovery deterministic.
5. Core Components
Client Sync Engine
The client sync engine watches local file changes, computes chunks, uploads missing data, pulls metadata changes, applies server events, and manages local cache state.
It should treat the server as authoritative for committed metadata while still allowing local optimistic work.
Metadata API
The Metadata API handles file and folder operations:
- create file
- rename file
- move file
- delete file
- commit new version
- restore old version
- list folder
- fetch sync changes
It validates permissions, applies version checks, writes metadata records, and appends sync events.
File Metadata Store
The metadata store keeps the file tree:
- stable file IDs
- parent folder IDs
- names
- owner and workspace
- current version pointer
- delete state
- metadata version
This store should be optimized for folder listing, file lookup, and mutation safety.
Chunk Upload Service
The chunk service accepts large byte uploads in smaller units.
It verifies hashes, stores chunks, supports resume, and avoids forcing large binary payloads through the metadata service.
Manifest Store
A manifest describes the chunks that make up a file version.
version v17 -> [chunk_a, chunk_b, chunk_c]
This lets multiple file versions reuse unchanged chunks.
Version Store
The version store tracks every committed file version, its manifest, its parent version, who created it, and when.
Version history is a recovery feature, not just a storage detail.
Sync Event Journal
The sync journal records metadata changes in an ordered stream.
Clients use it to recover after disconnection and to avoid expensive full scans.
Conflict Resolver
The conflict resolver decides what happens when a client tries to commit a version based on stale metadata.
For normal files, preserving both versions as a conflict copy is often safer than silently overwriting user work.
Reconciliation Workers
Reconciliation workers compare metadata, manifests, chunks, sync events, and local indexes to detect drift.
They repair safe inconsistencies and alert on unsafe ones.
6. Data Modeling
File Node
file_node
- file_id
- workspace_id
- parent_id
- name
- type: file | folder
- owner_id
- current_version_id
- metadata_version
- deleted_at
- created_at
- updated_at
The file_id should remain stable across renames and moves.
File Version
file_version
- version_id
- file_id
- parent_version_id
- manifest_id
- created_by_user_id
- created_by_device_id
- base_version_id
- size_bytes
- content_hash
- created_at
- reason: upload | edit | restore | conflict
The base_version_id is important for conflict detection. A client should say which version it edited.
Chunk
chunk
- chunk_hash
- size_bytes
- storage_key
- verification_state
- reference_count
- created_at
Chunks can be content-addressed by hash, but access control should be enforced through file metadata and version references, not by exposing raw chunks as public objects.
Manifest
manifest
- manifest_id
- chunk_hashes
- total_size_bytes
- algorithm
- created_at
The manifest is the bridge between version metadata and stored bytes.
Sync Event
sync_event
- sequence
- workspace_id
- file_id
- event_type
- metadata_version
- payload
- created_at
Events should contain enough information for clients to update local state or fetch the needed records.
Device Sync State
device_sync_state
- device_id
- workspace_id
- last_sync_sequence
- last_successful_sync_at
- client_version
Per-device state matters because a phone, laptop, and tablet may all be at different sync positions.
Upload Session
upload_session
- session_id
- user_id
- device_id
- file_id
- base_version_id
- expected_size_bytes
- uploaded_chunks
- expires_at
- state
Upload sessions help resume large uploads and clean up abandoned work.
7. Request Lifecycle
Uploading A New File
1. Client detects a new local file.
2. Client splits the file into chunks and computes hashes.
3. Client asks server which chunks already exist.
4. Client uploads missing chunks.
5. Server verifies chunk hashes and stores them.
6. Client commits a manifest and file version through Metadata API.
7. Metadata API creates file_node and file_version in a transaction.
8. Metadata API appends sync_event.
9. Other devices receive push hints or later pull the sync feed.
10. Other devices download only the chunks they need.
The key boundary is step 6. Uploaded chunks alone do not make a visible file. The metadata commit does.
Editing An Existing File
1. Client reads current server version v7.
2. User edits locally while online or offline.
3. Client computes new chunk manifest.
4. Client uploads missing chunks.
5. Client commits new version with base_version_id = v7.
6. Server checks whether current_version_id is still v7.
7. If yes, server advances current_version_id to v8.
8. If no, server applies conflict policy.
This prevents stale clients from silently overwriting newer versions.
Syncing A Reconnected Device
1. Device reconnects with last_sync_sequence = 84211.
2. Sync API returns metadata events after 84211.
3. Client applies creates, updates, moves, deletes, and permission changes.
4. Client decides which file versions need local bytes.
5. Client downloads missing chunks in the background.
6. Client advances cursor only after metadata events are applied safely.
Push can wake the device, but durable sync should be the source of truth.
Handling A Conflict
laptop commits version v8 based on v7
phone later commits version candidate based on v7
server sees current version is v8
server creates conflict version v9_conflict
server exposes both user-visible states
The product may create a conflict copy, ask the user to choose, or run a file-type-specific merge if it can do so safely.
8. Scaling Problems
Large File Uploads
Large uploads create long-lived connections, retry pressure, and storage load.
Chunking and resumable upload reduce wasted work, but they introduce metadata overhead and orphan cleanup.
Metadata Hotspots
Shared folders can become hot because many users read or update the same file tree.
The system may need folder-level partitioning, read replicas, caching, and rate limits around expensive listing or permission expansion.
Sync Fan-Out
One metadata change may need to reach many devices.
The system should avoid pushing full changes to every client synchronously. A push hint can tell devices to fetch from the sync journal when ready.
Reconnect Storms
After a network outage, client release, or regional recovery, many devices may reconnect and run sync at the same time.
Backpressure is important. Clients should use jitter, pagination, rate limits, and resumable sync.
Version Storage Growth
Version history improves recovery, but it grows storage and metadata.
Chunk reuse helps, but retention, garbage collection, and legal deletion rules still need careful design.
Conflict Rate
Conflicts are not only an edge case. Shared folders, offline edits, and weak networks can make conflicts common.
Operators should be able to see conflict rate by file type, workspace, client version, and folder.
9. Distributed Systems Concepts
File Chunking
File chunking makes large files retryable and reusable. It moves the system away from whole-file transfer.
Metadata Sync
Metadata sync is the durable path that tells clients what changed. It is usually more important than raw byte movement.
Delta Transfer
Delta transfer avoids moving unchanged bytes. It saves bandwidth, but it does not decide correctness.
Version History
Version history makes bad overwrites recoverable. It gives users and operators a normal restore path.
Conflict Resolution
Conflict resolution protects user work when concurrent edits cannot be safely merged.
Cursor-Based Sync
Cursor-based sync lets devices recover missed changes after disconnecting.
Idempotency
Idempotency prevents client retries from creating duplicate files, duplicate versions, or repeated conflict copies.
Eventual Consistency
Eventual consistency appears because clients, metadata replicas, local indexes, previews, search, and file bytes may converge at different times.
Backpressure
Backpressure keeps sync recovery from becoming the incident during reconnect storms or storage slowness.
10. Reliability & Failure Handling
Upload Succeeds But Metadata Commit Fails
The uploaded chunks are now orphaned. They should expire through upload session cleanup or garbage collection.
The user should be able to retry the metadata commit if the client still has the manifest.
Metadata Commit Succeeds But Chunk Is Missing
This is more serious. A visible file version points at unavailable bytes.
The system should verify chunk existence before commit, monitor manifest integrity, and reconcile manifests against chunk storage.
Client Retries Commit After Timeout
The client may not know whether the server accepted the write.
Use a client mutation ID so retrying the same commit returns the same result instead of creating another version.
Delete Races With Upload
A user may delete a file while another device uploads a new version.
The metadata service must define whether the upload is rejected, restored as a new file, or converted into a conflict.
Permission Changes Lag
If a file is unshared, clients and cached download URLs may still exist.
The system should use short-lived download authorization, permission-aware metadata sync, and server-side checks before issuing fresh access.
Sync Journal Retention Expires
An old device may reconnect with a cursor older than the retained journal.
The device needs a snapshot sync path:
your cursor is too old, rebuild from current metadata snapshot
Reconciliation Finds Drift
Drift can happen between file nodes, versions, manifests, chunks, local search indexes, and sync events.
Repair should be careful. Some drift can be fixed automatically. Missing chunks for visible versions should page an operator or mark the version unavailable until repaired.
11. Real-World Company Approaches
Large cloud file products usually separate file metadata from blob storage because these workloads behave differently.
Common public architecture patterns include:
- chunked or resumable uploads for large files
- content hashes for integrity and deduplication
- metadata journals for client sync
- local client indexes for offline mode
- tombstones for deletes
- conflict copies when automatic merge is unsafe
- version history for restore
- async workers for previews, scanning, indexing, and cleanup
Do not assume every product uses the same exact model. The important lesson is the pressure: file bytes, metadata, sync state, and user-visible correctness each need separate handling.
12. Tradeoffs & Alternatives
Whole-File Upload vs Chunked Upload
Whole-file upload is simpler. Chunked upload is better for large files, weak networks, retries, and deduplication.
The cost is more metadata, manifest logic, and garbage collection.
Server-Authoritative Metadata vs Peer-Like Sync
A server-authoritative model is easier to reason about and repair.
Peer-like sync can feel more local-first, but conflict handling and trust boundaries get much harder.
Last-Writer-Wins vs Conflict Copies
Last-writer-wins keeps folders cleaner but can lose work.
Conflict copies preserve user data but create cleanup burden.
Push Sync vs Pull Sync
Push makes updates feel immediate.
Pull from a durable sync journal is the recovery path. A robust system usually uses push as a hint and pull as truth.
Long Version Retention vs Storage Cost
Long retention helps recovery and auditability.
It also increases storage cost, privacy complexity, and garbage collection difficulty.
Immediate Delete vs Tombstone
Immediate delete is simple but dangerous for offline sync.
Tombstones let offline clients learn about deletes, but they add retention and cleanup complexity.
13. Evolution Path
Stage 1: Simple Upload And Download
Start with direct uploads, a file metadata table, and latest-version download.
This works for small files and one device, but it breaks under offline edits and large files.
Stage 2: Chunked Uploads
Add upload sessions, chunk hashes, manifests, and resumable transfer.
Now large files can retry without starting from zero.
Stage 3: Metadata Journal
Add a sync event log and per-device cursors.
Clients can recover missed changes and avoid full scans.
Stage 4: Version History And Conflict Handling
Add base version checks, version history, restore, and conflict copies.
The product can protect user work during offline and concurrent edits.
Stage 5: Scale, Repair, And Policy
Add folder partitioning, backpressure, permission propagation, scanning, preview generation, garbage collection, and reconciliation.
The system becomes operationally mature instead of merely functional.
14. Key Engineering Lessons
- File sync is mostly a metadata consistency problem attached to expensive byte movement.
- Uploaded chunks should not become user-visible until metadata commits a version.
- Offline clients require a durable sync journal, not just push notifications.
- Conflict resolution should preserve user work when safety is uncertain.
- Version history turns overwrite bugs into recoverable events.
- Deletes need tombstones when offline devices exist.
- Reconnect storms can turn sync recovery into a production incident.
- Reconciliation is part of the product, because metadata, chunks, versions, and indexes can drift.
15. Related Topics
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Related Concepts
Core ideas that connect to this topic.
Related Patterns
Reusable architecture moves built from these ideas.