A new development in genomic research, the Genomic Tokenizer (GT), has been introduced to enhance DNA sequence analysis through a biology-driven tokenization method. This approach aligns with the central dogma of molecular biology by utilizing codons—three-letter sequences that represent amino acids. The Genomic Tokenizer incorporates start codons, synonymous codons, and stop codons into the tokenizer interface of the HuggingFace transformer package. This enables it to manage shifts in reading frames caused by nucleotide additions or deletions. Additionally, the tokenizer preserves biological nuances inherent in genetic variations during the tokenization process. The vocabulary of the Genomic Tokenizer includes all possible codons, with synonymous codons assigned the same IDs to improve efficiency.
Genomic Tokenizer: Toward a biology-driven tokenization in transformer models for DNA sequences https://t.co/yN90IYcZA7 🧬🖥️🧪 https://t.co/uBG9ryWF8a
Genomic Tokenizer ensures biological nuances inherent in genetic variations preserved during tokenization. The vocabulary includes all possible codons, but synonymous codons coding for the same amino acids are assigned the same IDs, improving the efficiency of the tokenizer.
Genomic Tokenizer (GT) incorporates start codons, synonymous codons, and stop codons into a tokenizer interface of the HuggingFace transformer package, giving it the ability to handle shifts in reading frames caused by nucleotide additions or deletions within DNA sequences.