Seqhash is a simple algorithm to produce consistent identifiers for any genetic sequence https://blog.libredna.org/post/seqhash/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
This repo is archived. You can view files and clone it, but cannot push or open issues/pull-requests.
ohm 373529c0d3 Last commit before archive (moved to biotorrents/biophp) 2 months ago
.gitignore First commit 4 months ago
license.md Add tests output to readme 4 months ago
readme.md Last commit before archive (moved to biotorrents/biophp) 2 months ago
seqhash.class.php Errors and tests 4 months ago
seqhash.tests.php Errors and tests 4 months ago

readme.md

What is Seqhash?

There is a big problem with current sequence databases - they all use different identifiers and accession numbers. This means cross-referencing databases is a complicated exercise, especially as the quantity of databases increases, or if you need to compare “wild” DNA sequences.

Seqhash is a simple algorithm to produce consistent identifiers for any genetic sequence. The basic premise of the Seqhash algorithm is to hash sequences with the hash being a robust cross-database identifier. Sequences themselves shouldn’t be used as a database index (often, they’re too big), so a hash based off of a sequence is the next best thing. Usability wise, you should be able to Seqhash any rotation of a sequence in any direction and get a consistent hash.

For example, plasmids from different providers such as Addgene, iGEM, or FreeGenes could have slightly different names as they are passed from person to person to organization. A Seqhash for the plasmids in those repositories would ensure stable cross-platform identification of plasmid material.

What does a Seqhash look like?

A Seqhash is separated into 3 different elements divided by underscores. It looks like the following:

v1_DCD_4b0616d1b3fc632e42d78521deb38b44fba95cca9fde159e01cd567fa996ceb9

The first element is the version tag (v1 for version 1). If there is ever a Seqhash version 2, this tag will differentiate seqhashes. The second element is the metadata tag, which has 3 letters. The first letter codes for the sequenceType (D for DNA, R for RNA, and P for Protein). The second letter codes for whether or not the sequence is circular (C for Circular, L for Linear). The final letter codes for whether or not the sequence is double stranded (D for Double stranded, S for Single stranded). The final element is the blake3 hash of the sequence (once rotated and complemented).

How does Seqhash work?

The Seqhash algorithm makes several opinionated design choices, primarily to make working with Seqhashes more consistent and easy. The Seqhash algorithm only uses a single hash function, Blake3, and only operates on DNA, RNA, and Protein sequences. These identifiers will be seen by human beings, so versioning and metadata is attached to the front of the hashes so that a human operator can quickly identify problems with hashing.

If the sequence is DNA or RNA, the Seqhash algorithm needs to know whether or not the nucleic acid is circular and/or double stranded. If circular, the sequence is rotated to a deterministic point. If double stranded, the sequence is compared to its reverse complement, and the lexiographically minimal sequence is taken (whether or not the min or max is used doesn’t matter, just needs to be consistent). If the sequence is a Protein sequence, the Seqhash algorithm supports circularity in the rare case that you are working with a circular protein.

If the sequence is RNA, the sequence will be converted to DNA before hashing. While the full Seqhash will still be different between RNA and DNA (due to the metadata string), the hash afterwards will be the same. This makes it easy to cross reference DNA and RNA sequences. This becomes important in deduplication efforts for large sequence databases.

For DNA or RNA sequences, only ATUGCYRSWKMBDHVNZ characters are allowed. For Proteins, only ACDEFGHIKLMNPQRSTVWYUO*BXZ characters are allowed in sequences. Selenocysteine (Sec; U) and pyrrolysine (Pyl; O) are included in the protein character set - usually U and O don’t occur within protein sequences, but for certain organisms they do, and it is certainly a relevant amino acid for those particular proteins. B (Asx), X (Xaa), and Z (Glx) are included in the UniProtKB/SwissProt data dump, so these are included as valid amino acid sequences, even if their occurrence is historical baggage from a different time.

How do I get Seqhash?

The reference implementation for Seqhash is in the Golang Poly package. Seqhash is also available for Python through pypi. If you are interested in Seqhash in your own language, please contact me. It is a fairly simple algorithm to implement.

Keoni Gandall

PS: I had this idea for a while but didn’t know what the algorithm was called for deterministic points until I saw this Ginkgo Bioworks blog post. Thanks Ginkgo!

Seqhash tests output

$ php seqhash.tests.php
Caught exception: SequenceType must be one of [DNA, RNA, PROTEIN]. Got TNA.
Caught exception: Only letters ATUGCYRSWKMBDHVNZ are allowed for DNA/RNA. Got X.
Caught exception: Only letters ACDEFGHIKLMNPQRSTVWYUO*BXZ are allowed for proteins. Got J.
Caught exception: Proteins can't be double stranded.
Circular double stranded hashing passed.
Circular single stranded hashing passed.
Linear double stranded hashing passed.
Linear single stranded hashing passed.
Linear single stranded hashing (RNA) passed.
Linear single stranded hashing (protein) passed.
Complement mapping passed.
Reverse complement mapping passed.
Booth sequence rotation passed.
Caught exception: Invalid version info. Got v2.
Caught exception: Invalid metadata. Got ABC.
Caught exception: Invalid Blake3 hash. Expected a 64-character hex string. Got xyz616d1b3fc632e42d78521deb38b44fba95cca9fde159e01cd567fa996ceb9.