Variant Normalization¶

Different VCF representations can describe the same biological variant. For example, the same single-base deletion in a poly-A run can be written at multiple positions depending on the tool that produced the VCF. Without normalization, these representations produce different UVIDs, undermining cross-dataset comparisons.

UVID includes optional variant normalization based on the Tan et al. 2015 algorithm (left-alignment + parsimonious trimming) to ensure that biologically identical variants always map to the same UVID.

Reference

Adrian Tan, Gonçalo R. Abecasis, Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics 31(13):2202--2204, 2015. doi:10.1093/bioinformatics/btv112

Algorithm¶

Normalization applies three operations in a loop until convergence:

flowchart TD
    A[Input: CHROM, POS, REF, ALT] --> B{Symbolic or<br/>missing allele?}
    B -- Yes --> C[Skip: pass through unchanged]
    B -- No --> D[Right-trim: remove matching<br/>trailing bases]
    D --> E[Left-trim: remove matching<br/>leading bases, adjust POS]
    E --> F{Simple indel?<br/>One allele = 1 base}
    F -- No --> G{Changed<br/>this iteration?}
    F -- Yes --> H[Left-align: while last bases<br/>match, prepend upstream base,<br/>shift POS left]
    H --> G
    G -- Yes --> D
    G -- No --> I[Output: normalized variant]

What gets normalized¶

Variant type	Action
SNV (both alleles 1 base)	Pass through
Over-specified SNV	Trim shared prefix/suffix
Simple indel	Full left-alignment through the reference
Complex / MNV (both alleles >1 base after trim)	Trim only, no left-alignment
Symbolic allele (`<DEL>`, `<INS>`, `*`)	Skip entirely
Missing allele (`.`)	Skip entirely
Breakend notation	Skip entirely

Per-allele normalization¶

Multi-allelic records are normalized per-allele (each ALT allele is normalized independently against the reference). This produces the same result as tools that normalize multi-allelic records as a unit in the vast majority of cases; in rare cases where sibling alleles constrain each other, per-allele normalization may left-align further.

Setup¶

Normalization requires a reference genome file to fetch upstream bases during left-alignment. UVID auto-discovers reference files from a data directory -- no path parameters need to be passed.

Quick start¶

The easiest way to install reference genomes is with uvid setup:

# Download both GRCh37 and GRCh38 (~800 MB each)
uvid setup

# Download only GRCh38
uvid setup -a GRCh38

If you run uvid vcf --normalize without a reference genome and your terminal is interactive, UVID will offer to download it for you automatically.

Reference files¶

Place one of these files in your UVID data directory:

File	Size	Description
`GRCh38.2bit`	~800 MB	GRCh38 in UCSC 2bit format (preferred)
`GRCh37.2bit`	~800 MB	GRCh37 in UCSC 2bit format
`GRCh38.fa` + `GRCh38.fa.fai`	~3 GB	GRCh38 in indexed FASTA format
`GRCh37.fa` + `GRCh37.fa.fai`	~3 GB	GRCh37 in indexed FASTA format

The .2bit format is recommended: it is smaller and faster to load.

Downloading reference genomes

The recommended way to install reference genomes is uvid setup, which downloads files to the correct location automatically. For manual installation:

# GRCh38 2bit from UCSC
curl -O https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
mv hg38.2bit ~/.local/share/uvid/GRCh38.2bit  # Linux
mv hg38.2bit ~/Library/Application\ Support/uvid/GRCh38.2bit  # macOS

# GRCh37 2bit from UCSC
curl -O https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
mv hg19.2bit ~/.local/share/uvid/GRCh37.2bit  # Linux
mv hg19.2bit ~/Library/Application\ Support/uvid/GRCh37.2bit  # macOS

Data directory¶

The default data directory is platform-specific:

Platform	Default path
Linux	`~/.local/share/uvid/`
macOS	`~/Library/Application Support/uvid/`
Windows	`C:\Users\<user>\AppData\Roaming\uvid\`

Override the default by setting the UVID_DATA_DIR environment variable:

export UVID_DATA_DIR=/path/to/my/references

Usage¶

CLI¶

Add the --normalize (or -n) flag to the vcf command:

# Normalize and assign UVIDs
uvid vcf input.vcf output.vcf --normalize -a GRCh38

# Normalize with auto-detected assembly
uvid vcf input.vcf output.vcf --normalize

# Combine with other options
uvid vcf input.vcf.gz output.vcf.gz --normalize --uuid

# Short form
uvid vcf input.vcf output.vcf -n -a GRCh38

When normalization is active, the output VCF contains the normalized POS, REF, and ALT values alongside the UVID. This means the output VCF is a valid normalized VCF, not just an ID-annotated copy.

Python¶

from uvid import vcf_passthrough

# With normalization
count = vcf_passthrough("input.vcf", "output.vcf", normalize=True, assembly="GRCh38")

# Auto-detect assembly
count = vcf_passthrough("input.vcf", "output.vcf", normalize=True)

What changes in the output¶

Without normalization, the passthrough only replaces the ID column. With normalization enabled:

ID column is set to the UVID (computed from normalized coordinates)
POS column is updated to the normalized position
REF column is updated to the normalized reference allele
ALT column is updated to the normalized alternate allele(s)

All other columns (QUAL, FILTER, INFO, FORMAT, samples) are passed through unchanged.

Examples¶

Indel in a homopolymer¶

A deletion at different positions in a poly-A run normalizes to the same leftmost representation:

# Input:   chr1  105  AAAA  A      (deletion at pos 105)
# Input:   chr1  107  AAA   A      (same deletion at pos 107)
# Both normalize to:
#          chr1  101  AAAA  A      (leftmost position)

Both produce the same UVID, enabling cross-dataset matching.

Overspecified SNV¶

An unnecessarily padded variant is trimmed to its minimal form:

# Input:   chr1  100  AGT  ACT
# Normalizes to:
#          chr1  101  G    C       (simple SNV)

Complex variant (no left-alignment)¶

Complex variants where both alleles are >1 base after trimming are trimmed but not left-aligned:

# Input:   chr1  100  CGGA  CA
# Normalizes to:
#          chr1  100  CGG   C      (shared suffix 'A' trimmed)

Cross-validation¶

The normalization implementation is validated against test suites from three independent tools:

Source	Cases	Assembly	Description
bcftools	28	Synthetic	SNVs, indels, complex, symbolic, boundary cases
vt	194	GRCh37	Indels on chr20 from Tan et al. original implementation
GATK	19	GRCh38	LeftAlignAndTrimVariants: insertions, deletions, repeats

The bcftools tests run against a small inline reference and are included in every test run. The vt and GATK tests require real reference genome files and are automatically skipped when those files are not available.