2.4 Anchor simulation

To evaluate the effectiveness and robustness of anchor-based UMI identification versus a positional-based method under varying error rates (from \(10^{-5}\) to 0.1), in the presence of:

PCR errors
Sequencing errors
Insertion/deletion errors

Please see this tutorial for the Tresor application with UMIche.

1. Efficiency metric for UMI identification¶

To assess how well UMIs can be discovered under either strategy, we define:

\[ P = \frac{n}{N} \]

where:

\(N = 5000\): Total number of subsampled reads
\(n\): Number of reads whose UMIs were successfully identified

This is used to compare the performance between:

Positional strategy (based on fixed offsets in primer sequence)
Anchor strategy (using inserted sequences such as BAGC and a terminal base V)

2. Robustness against indel errors¶

To assess how well UMIs can be extracted in the presence of indels, we define:

\[ Q = \frac{m}{N} \]

Where:

\(m\): Number of reads with successfully identified UMIs under anchor design
\(N\): Total number of simulated reads
The denominator remains the same as in \(P\), allowing for fair comparison

This metric isolates the impact of indel tolerance, contrasting anchor-based extraction vs. methods without anchor (assuming indel-free reads).

3. Logistic fit function for captured read quantity¶

A logistic function is fitted to characterise how performance (i.e., estimated numbers of captured reads) degrades as error rate increases:

\[ l(x) = \frac{a}{1 + e^{-b(x - c)}} + \varepsilon \]

Where:

\(x\): P and Q
\(a, b, c\): Parameters to fit

This model helps quantify robustness of each method to increasing error rates.

Both simulations used a wide range of error rates:

\[ 10^{-5},\ 2.5 \times 10^{-5},\ 5 \times 10^{-5},\ 7.5 \times 10^{-5},\ 0.0001,\ 0.00025,\ 0.0005,\ 0.00075,\ 0.001,\ 0.0025,\ 0.005,\ 0.0075,\ 0.01,\ 0.025,\ 0.05,\ 0.075,\ 0.1 \]

These allow for comprehensive simulation-based benchmarking of UMI extraction robustness under real-world sequencing noise conditions.