Skip to content

Bit Layout

Visual diagrams of the 128-bit UVID structure using Mermaid packet diagrams. All bit numbering is MSB-first (bit 0 is the most significant bit in the diagram).

Bit numbering convention

Mermaid packet diagrams number bits from 0 at the top-left. In these diagrams, bit 0 corresponds to UVID bit 127 (MSB). The mapping is: diagram_bit = 127 - uvid_bit.

Full 128-bit Layouts

String Mode

Both alleles in string mode (exact 2-bit DNA, up to 20 bases each):

packet-beta
  0-31: "Linearized Position (32 bits)"
  32-33: "Asm"
  34: "0"
  35-39: "REF Len (5b)"
  40-63: "REF 2-bit DNA ←"
  64-79: "→ REF DNA (40b total)"
  80: "0"
  81-85: "ALT Len (5b)"
  86-95: "ALT DNA ←"
  96-127: "→ ALT DNA (40b total)"

Row breakdown (32 bits per row):

Row Bits Content
1 0-31 Linearized genome position (32 bits)
2 32-33 Assembly (2 bits), then REF: mode=0, length (5b), DNA start (24b)
3 64-79 REF DNA continued (16b), then ALT: mode=0, length (5b), DNA start (10b)
4 96-127 ALT DNA continued (32b)

Length Mode

Both alleles in length mode (28-bit length + 17-bit Rabin fingerprint each):

packet-beta
  0-31: "Linearized Position (32 bits)"
  32-33: "Asm"
  34: "1"
  35-62: "REF Sequence Length (28b)"
  63-79: "REF Rabin FP (17b)"
  80: "1"
  81-95: "ALT Length ←"
  96-108: "→ ALT Len (28b)"
  109-127: "ALT Rabin FP (17b)"

Row breakdown (32 bits per row):

Row Bits Content
1 0-31 Linearized genome position (32 bits)
2 32-63 Assembly (2b), REF: mode=1, length (28b)
3 63-95 REF Rabin fingerprint (17b), ALT: mode=1, length start (15b)
4 96-127 ALT length continued (13b), ALT Rabin fingerprint (17b)

Allele Detail

Each allele field is 46 bits. The mode bit determines the interpretation of the remaining 45 bits.

String Mode Allele (mode = 0)

---
config:
  packet:
    bitsPerRow: 46
---
packet-beta
  0: "M=0"
  1-5: "Length (5b)"
  6-45: "2-bit DNA (40b = up to 20 bases)"
  • Mode (1 bit): 0 = string mode
  • Length (5 bits): number of bases, 1-20
  • DNA (40 bits): 2-bit encoding per base (A=00, C=01, G=10, T=11), left-aligned

Length Mode Allele (mode = 1)

---
config:
  packet:
    bitsPerRow: 46
---
packet-beta
  0: "M=1"
  1-28: "Sequence Length (28b)"
  29-45: "Rabin Fingerprint (17b)"
  • Mode (1 bit): 1 = length mode
  • Length (28 bits): sequence length in bases (max 268,435,455 -- sufficient for chr1 at ~249M bp)
  • Fingerprint (17 bits): Rabin fingerprint using polynomial x^17 + x^3 + 1

Design Rationale

Why 5-bit length (not 6)?

With the mode bit at position 45 and a 6-bit length field, the length would overlap with the first DNA base position. The 5-bit length (max 31, but capped at 20 bases since 40 bits / 2 bits per base = 20) avoids this overlap.

Why 28-bit length (not more)?

The longest human chromosome (chr1) is 248,956,422 bp, which requires 28 bits to represent. A naive approach might dedicate all 45 remaining bits (after the mode bit) to length, but that would cap out at ~35 trillion -- far beyond any biological sequence and a waste of bits. Capping at 28 bits is sufficient for any human allele and frees 17 bits for a Rabin fingerprint that dramatically improves collision resistance.

Why 17-bit Rabin fingerprint?

Without a fingerprint, two length-mode alleles at the same locus with the same sequence length would be indistinguishable -- a length-only encoding produces collisions wherever this occurs. A 17-bit fingerprint provides 131,072 distinct values, reducing the per-pair collision probability to ~7.6 x 10^-6. The polynomial x^17 + x^3 + 1 is irreducible over GF(2), ensuring good distribution. Across 4.4 million ClinVar records, a length-only encoding would produce 113 collisions; with the fingerprint there are zero.

Why symmetric 46/46?

Both REF and ALT alleles get identical 46-bit fields. This simplifies the encoding logic and ensures both alleles benefit from the same exact-storage threshold (20 bases) and fingerprint quality.