Key Concepts¶
This page explains the core ideas behind UVID encoding.
Linearized Genome Position¶
Traditional variant representations use a chromosome name and a position within that chromosome. UVID instead maps all positions into a single linearized coordinate space by concatenating chromosomes end-to-end.
Each assembly (GRCh37, GRCh38) has its own offset table. For example, in GRCh38:
- chr1 positions map to offsets 0 -- 248,956,421
- chr2 positions map to offsets 248,956,422 -- 491,149,950
- ...and so on through chrM
This linearized position occupies the top 32 bits of the UVID, which means UVIDs sort in natural genomic order when compared as unsigned 128-bit integers.
Bit 127 is not always zero
Linearized positions for later chromosomes (e.g. chr14 onwards in GRCh38) exceed 2^31, so the most significant bit of the UVID will be 1. UVIDs must be treated as unsigned 128-bit integers for correct sorting.
Assembly Field¶
Bits 95-94 encode the genome assembly as a 2-bit value:
| Value | Assembly |
|---|---|
| 0 | GRCh37 / hg19 |
| 1 | GRCh38 / hg38 |
| 2-3 | Reserved |
This means UVIDs for GRCh37 and GRCh38 sort into separate groups, with GRCh37 sorting before GRCh38 for the same chromosome and position.
Allele Encoding¶
Each allele (REF and ALT) gets a 46-bit field with an independent mode bit:
String Mode (mode = 0)¶
Used when the allele is 1-20 bases of pure A/C/G/T. Stores the exact sequence.
| Field | Bits | Description |
|---|---|---|
| Mode | 1 | Always 0 |
| Length | 5 | Number of bases (1-20) |
| DNA | 40 | 2-bit encoded bases (A=00, C=01, G=10, T=11) |
String mode alleles sort before length mode alleles at the same position, providing a natural ordering where short exact variants come first.
Length Mode (mode = 1)¶
Used for alleles longer than 20 bases, or alleles containing non-ACGT characters (e.g. N). Stores the length and a fingerprint.
| Field | Bits | Description |
|---|---|---|
| Mode | 1 | Always 1 |
| Length | 28 | Sequence length in bases (max 268,435,455) |
| Fingerprint | 17 | Rabin fingerprint of the sequence |
Decoding a length-mode allele returns N-repeats as a placeholder. The actual sequence can always be recovered from the reference genome (for REF alleles) or the original VCF.
Rabin Fingerprint¶
Length-mode alleles include a 17-bit Rabin fingerprint to reduce collisions between different sequences of the same length at the same locus.
Polynomial: x^17 + x^3 + 1 (0x20009 over GF(2))
Algorithm: DNA is treated as a 2-bit-per-base bitstream (A=00, C=01, G=10, T=11). For each base, the fingerprint register is shifted left by 2 and the 2-bit base encoding is OR'd in. After each shift, if the register exceeds 17 bits, the polynomial is XOR'd to reduce it.
N bases are treated as A (0b00) -- a zero value requiring no special case.
Collision resistance¶
A UVID collision requires two distinct variants to share all of: the same locus (chromosome + position), the same assembly, the same allele mode, and the same allele content encoding. String-mode alleles (up to 20 bases) are stored exactly, so they never collide. Collisions can only occur between length-mode alleles that happen to have the same sequence length and the same 17-bit Rabin fingerprint.
The 17-bit fingerprint has 131,072 possible values. For two specific same-length sequences at the same locus, the probability of a fingerprint collision is approximately 1 / 131,072 (~7.6 x 10^-6).
Using the birthday-bound approximation, the probability that any pair collides among n distinct same-length alleles at a single locus is roughly n^2 / (2 x 131,072).
131,072
This is just the cardinality of distinct integers that fit in 17 bits 2^17
In practice, the number of distinct same-length alleles at any single locus is very small:
| Same-length alleles at one locus | Collision probability |
|---|---|
| 2 | ~1 in 131,072 (0.0008%) |
| 5 | ~1 in 10,486 (0.010%) |
| 10 | ~1 in 2,621 (0.038%) |
| 50 | ~1 in 105 (0.95%) |
ClinVar validation
4,397,869 ClinVar records processed with zero collisions.
At genome scale, what matters is whether any locus across an entire dataset produces a collision. Without the fingerprint -- encoding only by length -- the ClinVar dataset (4.4 million variants) would contain 113 collisions affecting 265 records, all from distinct long alleles at the same locus that happen to share the same length. The Rabin fingerprint separates every one of these cases, bringing collisions to zero. This works because the vast majority of variants are short (SNVs are 1 bp) and stored exactly in string mode, so the fingerprint only needs to discriminate among the small number of length-mode alleles that share a locus and length.
UUIDv5¶
Every UVID can be converted to a standard UUIDv5 for interoperability with systems that expect UUIDs:
uvid = UVID.encode("chr1", 100, "A", "G")
print(uvid.uuid5()) # e.g. a1b2c3d4-e5f6-5789-abcd-ef0123456789
The namespace UUID is 2696985c-755c-53de-b6b9-1745af20d0fd, derived from uuid5(NAMESPACE_OID, "UVID"). The name bytes are the raw 128-bit UVID value.
UUIDv5 does not preserve sort order
While the UVID itself sorts by genomic position, the UUIDv5 representation does not. Use the raw UVID (hex or integer) when sort order matters.
Sort Order¶
UVIDs are designed to sort in a useful order when compared as unsigned 128-bit integers:
- Linearized position (bits 127-96) -- genomic coordinate
- Assembly (bits 95-94) -- GRCh37 before GRCh38
- REF mode (bit 93) -- string mode (exact) before length mode
- REF content (bits 92-47) -- bases or length+fingerprint
- ALT mode (bit 46) -- string mode before length mode
- ALT content (bits 45-0) -- bases or length+fingerprint
This means variants at the same position are grouped together, with short exact alleles sorting before longer ones.