Skip to content

Python API

API reference auto-generated from the uvid type stubs.

UVID - Universal Variant ID for human genetic variation.

Provides compact 128-bit identifiers for genomic variants with DuckDB-backed collection storage.

NAMESPACE_UVID = UUID('2696985c-755c-53de-b6b9-1745af20d0fd') module-attribute

Instances of the UUID class represent UUIDs as specified in RFC 4122. UUID objects are immutable, hashable, and usable as dictionary keys. Converting a UUID to a string with str() yields something in the form '12345678-1234-1234-1234-123456789abc'. The UUID constructor accepts five possible forms: a similar string of hexadecimal digits, or a tuple of six integer fields (with 32-bit, 16-bit, 16-bit, 8-bit, 8-bit, and 48-bit values respectively) as an argument named 'fields', or a string of 16 bytes (with all the integer fields in big-endian order) as an argument named 'bytes', or a string of 16 bytes (with the first three fields in little-endian order) as an argument named 'bytes_le', or a single 128-bit integer as an argument named 'int'.

UUIDs have these read-only attributes:

bytes       the UUID as a 16-byte string (containing the six
            integer fields in big-endian byte order)

bytes_le    the UUID as a 16-byte string (with time_low, time_mid,
            and time_hi_version in little-endian byte order)

fields      a tuple of the six integer fields of the UUID,
            which are also available as six individual attributes
            and two derived attributes:

    time_low                the first 32 bits of the UUID
    time_mid                the next 16 bits of the UUID
    time_hi_version         the next 16 bits of the UUID
    clock_seq_hi_variant    the next 8 bits of the UUID
    clock_seq_low           the next 8 bits of the UUID
    node                    the last 48 bits of the UUID

    time                    the 60-bit timestamp
    clock_seq               the 14-bit sequence number

hex         the UUID as a 32-character hexadecimal string

int         the UUID as a 128-bit integer

urn         the UUID as a URN as specified in RFC 4122

variant     the UUID variant (one of the constants RESERVED_NCS,
            RFC_4122, RESERVED_MICROSOFT, or RESERVED_FUTURE)

version     the UUID version number (1 through 5, meaningful only
            when the variant is RFC_4122)

is_safe     An enum indicating whether the UUID has been generated in
            a way that is safe for multiprocessing applications, via
            uuid_generate_time_safe(3).

UVID

A 128-bit Universal Variant ID encoding a human genomic variant.

encode(chr, pos, ref_seq, alt_seq, assembly='GRCh38') staticmethod

Encode a variant as a UVID.

Parameters:

Name Type Description Default
chr str

Chromosome name (e.g. "chr1", "1", "chrX", "X", "chrM").

required
pos int

1-based genomic position.

required
ref_seq str

Reference allele sequence (e.g. "A", "ACGT").

required
alt_seq str

Alternate allele sequence (e.g. "G", "T", ".").

required
assembly str

Genome assembly ("GRCh37", "GRCh38", "hg19", "hg38").

'GRCh38'

Returns:

Type Description
UVID

A UVID instance.

Raises:

Type Description
ValueError

If any parameter is invalid.

decode()

Decode a UVID back to its component fields.

Returns:

Type Description
dict[str, object]

A dict with keys: chr (str), pos (int), ref (str), alt (str),

dict[str, object]

ref_len (int), alt_len (int), ref_is_exact (bool),

dict[str, object]

alt_is_exact (bool), ref_fingerprint (int | None),

dict[str, object]

alt_fingerprint (int | None), assembly (str).

dict[str, object]

When ref_is_exact or alt_is_exact is False, the

dict[str, object]

corresponding sequence is returned as N-repeats (the actual

dict[str, object]

bases can be recovered from the reference genome).

dict[str, object]

ref_fingerprint and alt_fingerprint are 17-bit Rabin

dict[str, object]

fingerprints of the original sequence, present only for

dict[str, object]

length-mode alleles (None for string-mode alleles).

Raises:

Type Description
ValueError

If the UVID data is malformed.

to_hex()

Get the hex string representation (format: XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX).

from_hex(hex_str) staticmethod

Create a UVID from a hex string (with or without dashes).

Raises:

Type Description
ValueError

If the hex string cannot be parsed.

as_int()

Get the raw 128-bit integer value.

from_int(value) staticmethod

Create a UVID from a raw 128-bit integer value.

range(chr, start_pos, end_pos, assembly='GRCh38') staticmethod

Compute UVID range bounds for a genomic region.

Parameters:

Name Type Description Default
chr str

Chromosome name.

required
start_pos int

Start position (1-based, inclusive).

required
end_pos int

End position (1-based, inclusive).

required
assembly str

Genome assembly ("GRCh37", "GRCh38").

'GRCh38'

Returns:

Type Description
tuple[UVID, UVID]

A (lower, upper) tuple of UVIDs bounding the region.

Raises:

Type Description
ValueError

If the chromosome or position is invalid.

uuid5()

Convert this UVID to a deterministic UUIDv5.

Uses the UVID namespace (derived from OID namespace + "UVID") and the raw 128-bit integer bytes as the name.

Returns:

Type Description
UUID

A Python uuid.UUID with version=5.

AssemblyNotDetectedError

Bases: builtins.ValueError

Raised when assembly cannot be detected from the VCF header.

Subclass of ValueError so it can be caught as either AssemblyNotDetectedError or ValueError.

__module__ = 'uvid._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__weakref__ property

list of weak references to the object

Collection(path)

A .uvid collection file backed by DuckDB.

Open or create a .uvid collection file.

Parameters:

Name Type Description Default
path str

Path to the .uvid file.

required

Raises:

Type Description
OSError

If the file cannot be opened or created.

add_vcf(vcf_path, assembly='GRCh38')

Add a VCF file to the collection.

Parameters:

Name Type Description Default
vcf_path str

Path to a VCF file (.vcf or .vcf.gz).

required
assembly str

Genome assembly ("GRCh37", "GRCh38", "hg19", "hg38").

'GRCh38'

Raises:

Type Description
OSError

If the VCF file cannot be read or parsed.

ValueError

If the assembly is invalid.

list_samples()

List all samples in the collection.

Returns:

Type Description
list[tuple[str, str, str]]

A list of (table_name, source_file, sample_name) tuples.

Raises:

Type Description
OSError

If the query fails.

list_sources()

List all source files in the collection.

Returns:

Type Description
list[str]

A list of source file names.

Raises:

Type Description
OSError

If the query fails.

search_region(table_name, chr, start_pos, end_pos, assembly='GRCh38')

Search for variants in a genomic region.

Parameters:

Name Type Description Default
table_name str

Sample table name (from list_samples).

required
chr str

Chromosome name.

required
start_pos int

Start position (1-based, inclusive).

required
end_pos int

End position (1-based, inclusive).

required

Returns:

Type Description
list[dict[str, object]]

A list of dicts with keys: uvid, allele1, allele2, phased,

list[dict[str, object]]

dp, gq, qual, filter, multiallelic.

Raises:

Type Description
ValueError

If the chromosome is invalid.

OSError

If the search fails.

ReferenceNotFoundError

Bases: builtins.ValueError

Raised when a reference genome file is not found in the data directory.

Subclass of ValueError so it can be caught as either ReferenceNotFoundError or ValueError.

To install reference genomes, run uvid setup or set UVID_DATA_DIR to a directory containing the reference files.

__module__ = 'uvid._core' class-attribute

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__weakref__ property

list of weak references to the object

data_dir() builtin

Return the platform-specific data directory for UVID reference files.

Resolution order
  1. UVID_DATA_DIR environment variable
  2. Platform default (Linux: ~/.local/share/uvid, macOS: ~/Library/Application Support/uvid, Windows: AppData\Roaming\uvid)

Returns:

Type Description
str | None

The data directory path as a string, or None if no platform

str | None

data directory can be determined and UVID_DATA_DIR is not set.

hgvs_to_uvid(hgvs, reference=None, assembly=None) builtin

Convert an HGVS genomic variant string to a UVID.

Parameters:

Name Type Description Default
hgvs str

HGVS string (e.g. "NC_000001.11:g.12345A>G"). Only genomic (g.) and mitochondrial (m.) coordinate systems are supported.

required
reference str | None

Optional path to a reference genome file (.2bit or indexed .fa). Required for indels, duplications, and inversions (anything that needs an anchor base or deleted sequence from the reference).

None
assembly str | None

Optional expected assembly ("GRCh37" or "GRCh38"). If provided, the assembly inferred from the RefSeq accession version is validated against this value.

None

Returns:

Type Description
UVID

A UVID object.

Raises:

Type Description
ValueError

On parse errors, unknown accessions, assembly mismatches, or when a reference genome is required but not provided.

uvid_to_hgvs(uvid, detect_dup_inv=False, reference=None) builtin

Convert a UVID back to HGVS genomic notation.

Parameters:

Name Type Description Default
uvid str

Hex string of the UVID to convert.

required
detect_dup_inv bool

If True, attempt to detect duplications and inversions by comparing the variant against the reference genome. More expensive but produces richer notation. Defaults to False.

False
reference str | None

Optional path to a reference genome file. Required when detect_dup_inv=True for duplication detection.

None

Returns:

Type Description
str

A tuple of (hgvs_string, warnings) where warnings is a

list[str]

list of strings describing any approximations (e.g. length-mode

tuple[str, list[str]]

alleles whose exact sequence is unavailable).

Raises:

Type Description
ValueError

If the UVID cannot be decoded or the reference genome cannot be opened.

vcf_passthrough(input, output=None, use_uuid=False, assembly=None, normalize=False) builtin

Process a VCF file, replacing the ID column with UVID identifiers.

Parameters:

Name Type Description Default
input str | PathLike[str]

Path to input VCF file (.vcf or .vcf.gz).

required
output str | PathLike[str] | None

Path to output file (None for stdout). If ends in .vcf.gz, bgzf-compressed.

None
use_uuid bool

If True, emit UUIDv5 instead of UVID hex.

False
assembly str | None

Assembly override ("GRCh37", "GRCh38", etc.). None to auto-detect from header.

None
normalize bool

If True, normalise variants (Tan et al. 2015, https://doi.org/10.1093/bioinformatics/btv112) before encoding. Requires a reference genome file in the data directory.

False

Returns:

Type Description
int

Number of data records processed.

Raises:

Type Description
AssemblyNotDetectedError

If assembly cannot be detected and no override given.

OSError

On I/O errors.

ValueError

On normalization errors (e.g. reference genome not found).