Python API¶
API reference auto-generated from the uvid type stubs.
UVID - Universal Variant ID for human genetic variation.
Provides compact 128-bit identifiers for genomic variants with DuckDB-backed collection storage.
NAMESPACE_UVID = UUID('2696985c-755c-53de-b6b9-1745af20d0fd')
module-attribute
¶
Instances of the UUID class represent UUIDs as specified in RFC 4122. UUID objects are immutable, hashable, and usable as dictionary keys. Converting a UUID to a string with str() yields something in the form '12345678-1234-1234-1234-123456789abc'. The UUID constructor accepts five possible forms: a similar string of hexadecimal digits, or a tuple of six integer fields (with 32-bit, 16-bit, 16-bit, 8-bit, 8-bit, and 48-bit values respectively) as an argument named 'fields', or a string of 16 bytes (with all the integer fields in big-endian order) as an argument named 'bytes', or a string of 16 bytes (with the first three fields in little-endian order) as an argument named 'bytes_le', or a single 128-bit integer as an argument named 'int'.
UUIDs have these read-only attributes:
bytes the UUID as a 16-byte string (containing the six
integer fields in big-endian byte order)
bytes_le the UUID as a 16-byte string (with time_low, time_mid,
and time_hi_version in little-endian byte order)
fields a tuple of the six integer fields of the UUID,
which are also available as six individual attributes
and two derived attributes:
time_low the first 32 bits of the UUID
time_mid the next 16 bits of the UUID
time_hi_version the next 16 bits of the UUID
clock_seq_hi_variant the next 8 bits of the UUID
clock_seq_low the next 8 bits of the UUID
node the last 48 bits of the UUID
time the 60-bit timestamp
clock_seq the 14-bit sequence number
hex the UUID as a 32-character hexadecimal string
int the UUID as a 128-bit integer
urn the UUID as a URN as specified in RFC 4122
variant the UUID variant (one of the constants RESERVED_NCS,
RFC_4122, RESERVED_MICROSOFT, or RESERVED_FUTURE)
version the UUID version number (1 through 5, meaningful only
when the variant is RFC_4122)
is_safe An enum indicating whether the UUID has been generated in
a way that is safe for multiprocessing applications, via
uuid_generate_time_safe(3).
UVID
¶
A 128-bit Universal Variant ID encoding a human genomic variant.
encode(chr, pos, ref_seq, alt_seq, assembly='GRCh38')
staticmethod
¶
Encode a variant as a UVID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chr
|
str
|
Chromosome name (e.g. "chr1", "1", "chrX", "X", "chrM"). |
required |
pos
|
int
|
1-based genomic position. |
required |
ref_seq
|
str
|
Reference allele sequence (e.g. "A", "ACGT"). |
required |
alt_seq
|
str
|
Alternate allele sequence (e.g. "G", "T", "."). |
required |
assembly
|
str
|
Genome assembly ("GRCh37", "GRCh38", "hg19", "hg38"). |
'GRCh38'
|
Returns:
| Type | Description |
|---|---|
UVID
|
A UVID instance. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any parameter is invalid. |
decode()
¶
Decode a UVID back to its component fields.
Returns:
| Type | Description |
|---|---|
dict[str, object]
|
A dict with keys: chr (str), pos (int), ref (str), alt (str), |
dict[str, object]
|
ref_len (int), alt_len (int), ref_is_exact (bool), |
dict[str, object]
|
alt_is_exact (bool), ref_fingerprint (int | None), |
dict[str, object]
|
alt_fingerprint (int | None), assembly (str). |
dict[str, object]
|
When |
dict[str, object]
|
corresponding sequence is returned as N-repeats (the actual |
dict[str, object]
|
bases can be recovered from the reference genome). |
dict[str, object]
|
|
dict[str, object]
|
fingerprints of the original sequence, present only for |
dict[str, object]
|
length-mode alleles ( |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the UVID data is malformed. |
to_hex()
¶
Get the hex string representation (format: XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX).
from_hex(hex_str)
staticmethod
¶
Create a UVID from a hex string (with or without dashes).
Raises:
| Type | Description |
|---|---|
ValueError
|
If the hex string cannot be parsed. |
as_int()
¶
Get the raw 128-bit integer value.
from_int(value)
staticmethod
¶
Create a UVID from a raw 128-bit integer value.
range(chr, start_pos, end_pos, assembly='GRCh38')
staticmethod
¶
Compute UVID range bounds for a genomic region.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chr
|
str
|
Chromosome name. |
required |
start_pos
|
int
|
Start position (1-based, inclusive). |
required |
end_pos
|
int
|
End position (1-based, inclusive). |
required |
assembly
|
str
|
Genome assembly ("GRCh37", "GRCh38"). |
'GRCh38'
|
Returns:
| Type | Description |
|---|---|
tuple[UVID, UVID]
|
A (lower, upper) tuple of UVIDs bounding the region. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the chromosome or position is invalid. |
uuid5()
¶
Convert this UVID to a deterministic UUIDv5.
Uses the UVID namespace (derived from OID namespace + "UVID") and the raw 128-bit integer bytes as the name.
Returns:
| Type | Description |
|---|---|
UUID
|
A Python uuid.UUID with version=5. |
AssemblyNotDetectedError
¶
Bases: builtins.ValueError
Raised when assembly cannot be detected from the VCF header.
Subclass of ValueError so it can be caught as either
AssemblyNotDetectedError or ValueError.
__module__ = 'uvid._core'
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__weakref__
property
¶
list of weak references to the object
Collection(path)
¶
A .uvid collection file backed by DuckDB.
Open or create a .uvid collection file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the .uvid file. |
required |
Raises:
| Type | Description |
|---|---|
OSError
|
If the file cannot be opened or created. |
add_vcf(vcf_path, assembly='GRCh38')
¶
Add a VCF file to the collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vcf_path
|
str
|
Path to a VCF file (.vcf or .vcf.gz). |
required |
assembly
|
str
|
Genome assembly ("GRCh37", "GRCh38", "hg19", "hg38"). |
'GRCh38'
|
Raises:
| Type | Description |
|---|---|
OSError
|
If the VCF file cannot be read or parsed. |
ValueError
|
If the assembly is invalid. |
list_samples()
¶
List all samples in the collection.
Returns:
| Type | Description |
|---|---|
list[tuple[str, str, str]]
|
A list of (table_name, source_file, sample_name) tuples. |
Raises:
| Type | Description |
|---|---|
OSError
|
If the query fails. |
list_sources()
¶
List all source files in the collection.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of source file names. |
Raises:
| Type | Description |
|---|---|
OSError
|
If the query fails. |
search_region(table_name, chr, start_pos, end_pos, assembly='GRCh38')
¶
Search for variants in a genomic region.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Sample table name (from list_samples). |
required |
chr
|
str
|
Chromosome name. |
required |
start_pos
|
int
|
Start position (1-based, inclusive). |
required |
end_pos
|
int
|
End position (1-based, inclusive). |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, object]]
|
A list of dicts with keys: uvid, allele1, allele2, phased, |
list[dict[str, object]]
|
dp, gq, qual, filter, multiallelic. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the chromosome is invalid. |
OSError
|
If the search fails. |
ReferenceNotFoundError
¶
Bases: builtins.ValueError
Raised when a reference genome file is not found in the data directory.
Subclass of ValueError so it can be caught as either
ReferenceNotFoundError or ValueError.
To install reference genomes, run uvid setup or set
UVID_DATA_DIR to a directory containing the reference files.
__module__ = 'uvid._core'
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__weakref__
property
¶
list of weak references to the object
data_dir()
builtin
¶
Return the platform-specific data directory for UVID reference files.
Resolution order
UVID_DATA_DIRenvironment variable- Platform default (Linux:
~/.local/share/uvid, macOS:~/Library/Application Support/uvid, Windows:AppData\Roaming\uvid)
Returns:
| Type | Description |
|---|---|
str | None
|
The data directory path as a string, or |
str | None
|
data directory can be determined and |
hgvs_to_uvid(hgvs, reference=None, assembly=None)
builtin
¶
Convert an HGVS genomic variant string to a UVID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hgvs
|
str
|
HGVS string (e.g. |
required |
reference
|
str | None
|
Optional path to a reference genome file ( |
None
|
assembly
|
str | None
|
Optional expected assembly ( |
None
|
Returns:
| Type | Description |
|---|---|
UVID
|
A |
Raises:
| Type | Description |
|---|---|
ValueError
|
On parse errors, unknown accessions, assembly mismatches, or when a reference genome is required but not provided. |
uvid_to_hgvs(uvid, detect_dup_inv=False, reference=None)
builtin
¶
Convert a UVID back to HGVS genomic notation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uvid
|
str
|
Hex string of the UVID to convert. |
required |
detect_dup_inv
|
bool
|
If |
False
|
reference
|
str | None
|
Optional path to a reference genome file. Required
when |
None
|
Returns:
| Type | Description |
|---|---|
str
|
A tuple of |
list[str]
|
list of strings describing any approximations (e.g. length-mode |
tuple[str, list[str]]
|
alleles whose exact sequence is unavailable). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the UVID cannot be decoded or the reference genome cannot be opened. |
vcf_passthrough(input, output=None, use_uuid=False, assembly=None, normalize=False)
builtin
¶
Process a VCF file, replacing the ID column with UVID identifiers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
str | PathLike[str]
|
Path to input VCF file (.vcf or .vcf.gz). |
required |
output
|
str | PathLike[str] | None
|
Path to output file (None for stdout). If ends in .vcf.gz, bgzf-compressed. |
None
|
use_uuid
|
bool
|
If True, emit UUIDv5 instead of UVID hex. |
False
|
assembly
|
str | None
|
Assembly override ("GRCh37", "GRCh38", etc.). None to auto-detect from header. |
None
|
normalize
|
bool
|
If True, normalise variants (Tan et al. 2015, https://doi.org/10.1093/bioinformatics/btv112) before encoding. Requires a reference genome file in the data directory. |
False
|
Returns:
| Type | Description |
|---|---|
int
|
Number of data records processed. |
Raises:
| Type | Description |
|---|---|
AssemblyNotDetectedError
|
If assembly cannot be detected and no override given. |
OSError
|
On I/O errors. |
ValueError
|
On normalization errors (e.g. reference genome not found). |