Unique Name Ranker

Estimate how rare a person's name is against the current US adult population.

Names (one per line)

Paste names on the left and click Rank.
Format: First Last per line. Commas, middle names, and suffixes are tolerated.

How the ranking works

Self-contained, single-file HTML. All scoring runs in your browser against bundled census + birth-record tables. No server, no network, no data leaves the page.

1. The question we're answering

For a person whose name we know, how rare is that name among US adults alive today? Not "rare in 1985" or "rare among newborns" — rare against the pooled blob of every American currently old enough to be in our investigation set.

Equivalently: if you picked a random US adult, what fraction of them would have a name more common than this one? That fraction is the score we report (high = rare).

2. Data sources

First names (Social Security Administration)

SSA publishes one file per birth year (yob1880.txt through yob2024.txt) with the count of every name given to ≥5 babies that year. We use all 145 yearly files.

Coverage: every Social-Security-issued birth in the US since 1880
Caveat: minimum threshold of 5 births/year hides ultra-rare spellings
Caveat: misses people who immigrated as adults (no US birth record)

Last names (US Decennial Census)

Census Bureau publishes a list of every surname held by ≥100 people in each decennial. We pool all four:

1990 (88,799 names) — given as percent-of-total, rescaled to count units to match the others
2000 (151,671 names)
2010 (162,254 names)
2020 (156,620 names)

Union after dedup: 167,464 unique surnames. Pooling smooths single-decade noise (Nguyen, Garcia trending up) and covers immigrants the SSA file misses.

3. Survivorship weighting (the key idea)

SSA births include people who died decades ago. To approximate living adults, each year's counts are multiplied by the probability a person born that year is still alive in 2026, derived from the SSA actuarial life table:

Born	Age in 2026	P(alive)	What this fixes
2010	16	0.99	Recent names (Aiden) at full weight
1985	41	0.97	Brittany peak, all alive
1955	71	0.71	Boomer names taper
1935	91	0.18	Mildred mostly gone
1920	106	~0	Old names zeroed out

Without this step, including the full 1880-2024 window would inflate Mildred/Eugene/Gladys with millions of dead Silent Generation births. With it, they get correctly small.

Sanity: weighted total ≈ 270M, real US adult population ≈ 258M. Surnames are not survivorship-weighted (Census already snapshots living people).

4. Computing the score

1. Sort all names by weighted count, descending
2. Walk the sorted list, accumulating cumulative count
3. For each name N:
     own_share        = N.count / total
     more_common_share = (cumulative_at_N - own_count) / total
   Reported score (p) = more_common_share, range 0..1

So p=0.0 means "no name is more common" (James, Mary). p=0.99 means "99% of US adults have a name more common than this." Both first and last get their own score.

Combined score

combined = (first_p + last_p) / 2 — simple mean. Treats first and last as equally informative. If only one half is found, the missing half is currently treated as p=1.0 (max rare) — this is a known weakness: it inflates scores for typo'd or foreign names.

Verdict buckets

Combined p	Verdict
< 0.30	Common
0.30 – 0.65	Uncommon
0.65 – 0.90	Rare
≥ 0.90	Very Rare

Cutoffs are arbitrary — easy to tune. Both halves missing → "Unknown" (sorted to top of CSV output for review).

5. Lookup logic

Names are normalized: uppercased, stripped of everything but A-Z, hyphens, apostrophes. Compound surnames are tried multiple ways:

"Yeh Liu"  →  try "YEHLIU" (whole), "YEH", "LIU"
              pick the most-common hit (lowest rank)
"Clark-Smith" → try "CLARKSMITH", "CLARK", "SMITH"
"Jr.", "Sr.", "II", "III"... → stripped before parsing

This raised the LAPROB CSV last-name hit rate from 85% → 94%.

6. Known limitations

Typos miss silently. "Stathom" → no match (would need fuzzy/Soundex fallback).
Half-missing inflates combined score. "Jason Stathom" scores ~58% combined despite Stathom being unknown, because we use 1.0 as the missing-side fallback.
Adult immigrants undercounted on first names. SSA only has US-born births. A 60-year-old immigrant named Vladimir won't have his name reflected in the SSA pool.
Census surname threshold ≥100. Anyone with a surname held by <100 Americans is invisible — they're definitionally rare, but we have no count.
Survivorship table is unisex and bucketed. Real survival differs by sex (~5yr gap) and is smoother than 5-year buckets. Linear interpolation only.
Combined score is a flat mean. A common-rare combo and an uncommon-uncommon combo both score the same — a multiplicative score (joint rarity assuming independence) might rank better.

7. Files in this project

build_data.py      Builds firstnames.json + lastnames.json from raw SSA + Census
build_standalone.py Inlines the JSONs into this single-file HTML
rank_csv.py        Headless CLI: reads a CSV, writes a sorted-ranked CSV
index.html         The web UI (served version, fetches JSONs)
unique-name-ranker-standalone.html   This file (everything bundled)

8. Things a reviewer might want to optimize

Add Levenshtein-1 / Soundex / Metaphone fallback for typos
Switch combined score from mean to 1 - (1-p_first)(1-p_last) (joint rarity)
Use sex-specific survivorship if a sex column is provided
Pull in 1900-1980 Census surname records (Ancestry has them, not free)
Bayesian smoothing for tail names (anything below SSA threshold)
Optional age-range input that re-weights survivorship toward a target birth window