Tmranalysis

summary

"Sequencing meets cryptography: quadratic substitution error suppression through homotrimer redundancy". Please check a preprint here .

The homotrimer unique molecular identifier (UMI), a simple yet powerful design in which each nucleotide is tripled (e.g., A → AAA), has demonstrated strong empirical performance in reducing sequencing errors¹.

We provide a theoretical foundation for this approach by framing it through the lens of information theory and error-correcting codes, specifically the triple modular redundancy (TMR) and the (3,1) repetition code used in cryptography. Unlike binary errors, nucleotide substitutions involve one of three alternative bases, introducing a unique probabilistic challenge.

We derive an analytical model that accounts for both deterministic majority voting and stochastic tie-breaking, and show that the probability of decoding error scales quadratically with the per-base substitution rate. Compared to conventional UMIs and classical binary redundancy models, homotrimer UMIs provide high resilience under sequencing error conditions. The theoretical framework support the robustness of homotrimer redundancy and offer a new way for optimising UMI design, bridging principles from cryptography and molecular biology.

Theoretical Derivation of Error Probabilities¶

1. TMR¶

We provide a conceptual analogy between repetition code error correction in cryptography and nucleotide-level error correction in sequencing using TMR.

Fig 1. Overview of TMR applications in cryptography and sequencing.

2. Binary (3,1) Repetition Code¶

In digital communication, the canonical characters Alice and Bob are often used to represent the sender and receiver of a message. Suppose Alice wishes to transmit the bit 0. To protect against transmission errors, she applies TMR to encode 0 as 000, also known as the (3,1) repetition code. Due to potential noise in the channel, Bob may receive one of eight possible 3-bit combinations. As shown in Fig. 2, Bob applies majority voting to decode them into the original bit: if two or more of the received bits agree, that value is interpreted as the intended message. In this context, if a single bit is flipped regardless of its position, the voting mechanism can correctly recover the original message 0. Although half of the possible voting outcomes may lead to an erroneous result, the actual probability of incorrect interpretation remains low.

A classic (3,1) repetition code scheme is illustrated in a binary system, where Alice sends a bit 0 encoded as 000, and Bob decodes received messages using majority voting. A probability model quantifies the probability of correct or incorrect interpretation based on the number and pattern of bit flips.

Fig 2. TMR in cryptography.

If each bit has an independent probability \(p\) of flipping, then the probability \(p_\text{rc_block}\) of a decoding error after applying majority voting is:

\[ P_\text{rc_block} = 3p^2(1 - p) + p^3 = 3p^2 - 2p^3 \]

3. Homotrimer Block Error Probability¶

We extend this logic to sequencing, where each base is encoded as a homotrimer (e.g., AAA). Unlike binary systems, DNA sequencing operates in a four-base system with substitution errors (e.g., A → C, G, or T). The majority-vote decoding is retained when possible, and collapsing is performed by random selection when no majority exists (e.g., ACG). A probability model is derived to quantify correct vs. incorrect block interpretation as a function of per-base error rate 𝑝, accounting for the combinatorics of 1, 2, and 3 substitution error scenarios.

Fig 3. TMR in sequencing.

Therefore, the probability \(p_\text{ht_block}\) of a mistakenly decoded nucleotide from a homotrimer block is:

\[ p_\text{ht_block} = \frac{7}{3}p^2(1-p) + p^3 = \frac{7}{3}p^2 - \frac{4}{3}p^3 \]

Here, \(p\) represents the per-base substitution error rate.

3. Example: Substitution error rate of \(10^{-5}\)¶

Given \(p = 0.00001\), the homotrimer block decoding error becomes:

\[ p_{\text{ht_block}} = \frac{7}{3}(0.00001)^2 - \frac{4}{3}(0.00001)^3 = \frac{7}{3} \cdot 10^{-10} - \frac{4}{3} \cdot 10^{-15} = 2.3333 \times 10^{-10} - 1.3333 \times 10^{-15} \approx 2.3333200000e-10 \]

This demonstrates the quadratic suppression of substitution errors enabled by homotrimer redundancy.

Error rate calculation using UMIche¶

Let’s start with some preparation. We can use UMIche to calculate UMI error rates under various scenarios as follows. We initialize an error rate (\(p=0.00001\)) representing the probability of an error occurring at a single nucleotide.

import umiche as uc

ht_tmr = uc.homotrimer.tmr(error_rate=0.00001)

1. Block error rate¶

We consider a building block to be erroneous if majority voting fails to correctly identify the original nucleotide. Based on this definition, we can calculate the probability of such an error.

ht_tmr.homotrimer_block_error