Concepts

Short-Code Generation

How systems create compact, unique identifiers for resources such as shortened URLs while balancing length, collision risk, predictability, and coordination.

foundation4 min readUpdated unknownModelingDataTradeoffs
Base62 EncodingCollision HandlingRandom IDsSequential IDsCode Space

Concepts Covered

  • Identifier generation
  • Base62 encoding
  • Code space
  • Collision probability
  • Random token generation
  • Sequential ID encoding
  • Pre-generated code pools
  • Predictability tradeoffs

Definition

Short-code generation is the process of creating compact identifiers that map to longer resources.

In a URL shortener, the code in:

https://arc.fl/x7Kp9Q

is the short code. It identifies the destination URL stored by the system.

The short code looks small, but it sits at the center of the product. Every redirect depends on it being unique, stable, and routable.

The Pain That Forces Careful Generation

At small scale, generating a random string feels easy:

pick 6 random characters
save code -> destination_url

That works until the system has real users, real traffic, and many servers creating links at the same time.

Several things can go wrong:

  • two requests generate the same code
  • codes become predictable
  • the code space is too small
  • one generator service falls behind
  • different regions generate overlapping codes
  • duplicate inserts create broken mappings

A short code is not just decoration in the URL. It is the key used to retrieve the destination. If two links receive the same code, one user's link may overwrite or conflict with another user's link.

Mental Model

Short-code generation is a uniqueness problem under concurrency.

The system must answer:

How do many servers create small IDs without creating duplicates?

There are two broad approaches:

  1. Generate codes randomly and check uniqueness.
  2. Generate a unique numeric ID first, then encode it into a shorter representation.

Both are valid. They fail in different ways.

Base62 Encoding

Base62 uses 62 characters:

0-9, a-z, A-Z

Each character gives 62 possibilities. A 6-character code has:

62^6 = 56,800,235,584

possible values.

That number is large, but it does not mean random collisions are impossible. Random generation can still pick a code that already exists. As the number of used codes grows, collisions become more likely.

Base62 is popular because it produces URL-friendly strings while keeping codes compact.

Strategy 1: Random Codes

Random generation:

candidate = random_base62(7)
try insert candidate
if collision, retry

Benefits:

  • simple to distribute across servers
  • codes are hard to guess
  • no central sequence generator is required

Costs:

  • collisions must be handled
  • randomness must be good
  • retry loops can increase latency as the code space fills
  • uniqueness still needs a database constraint

The database should enforce a unique index on short_code. Application checks alone are not enough because two servers can check at the same time and both believe the code is available.

Strategy 2: Sequential ID Plus Encoding

Another approach:

1. Generate unique numeric ID: 12500001
2. Encode it as base62: W7eH
3. Use encoded value as short code.

Benefits:

  • no random collision if ID generation is safe
  • compact codes
  • easy to reason about uniqueness

Costs:

  • codes may be predictable
  • a central ID generator can become a dependency
  • sequential IDs can reveal business volume
  • multi-region generation needs coordination

Predictability may matter. If users can guess nearby short codes, they may discover links they should not casually find. Some systems add salts, shuffle IDs, or use random codes to reduce enumeration risk.

Strategy 3: Pre-Generated Pools

A pre-generated pool creates codes before users need them.

code_pool
- x7Kp9Q: available
- y8Mz2A: available
- q1Vb0L: reserved

At request time, the service reserves one available code and attaches it to the new link.

Benefits:

  • fast request-time assignment
  • collision handling happens ahead of time
  • generation can be isolated from user requests

Costs:

  • the pool can run low
  • reservation must be atomic
  • unused reserved codes need cleanup
  • operational monitoring becomes important

This can be useful at high scale, but it adds another inventory system.

Operational Reality

Important signals:

  • collision rate
  • code generation latency
  • insert failures from unique constraints
  • code pool depth
  • reserved-but-unused codes
  • generator service errors
  • distribution of code prefixes
  • abuse from code enumeration

Short-code generation is usually solved early, then forgotten. That is dangerous. A generator bug can create duplicate links, broken redirects, privacy leaks, or data migrations that are painful to repair later.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Used In Systems

System studies where this idea appears in context.

Related Concepts

Core ideas that connect to this topic.