2.3 Error propagation
We developed a method for estimating error propagation through PCR amplification. Specifically, we track all types of nucleotide-level errors introduced during library preparation, along with their error types. This information is recorded in a table that provides a per-sequence breakdown of error status. We had a specific analysis here using the method.
The table contains four columns: the first three correspond to substitution, deletion, and insertion errors for mut
, del
, and ins
, respectively. The fourth column indicates whether the sequence acquired an error during PCR amplification. At the time of initial library preparation, this fourth column is set to False for all entries. With each round of PCR amplification, the table is updated accordingly. Each row represents a single sequence.
Before sampling, the table looks like:
mut del ins pcr_err_mark
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False True False False
8 False False False False
9 False False False False
10 False False False False
11 False False False False
12 False False False False
13 False False False False
14 False False False False
15 False True False False
16 False False False False
17 False False False False
18 False False False False
19 False False False False
20 False True False False
21 False False False False
22 False False False False
23 False False False False
24 False True False False
25 False False False False
26 False False False False
27 False False False False
28 False False False False
29 False False False False
30 False False False False
31 False False False False
32 False False False False
33 False False False False
34 False False False False
35 False False True False
36 False False False False
37 False False True False
38 False False False False
39 False True False False
40 False False False False
41 False False False False
42 False False False False
43 False False True False
44 False False False False
45 False False False False
46 False False True False
47 False False False False
48 False False False False
49 False False False False
Before sampling, the table looks like:
mut del ins pcr_err_mark
0 True False False False
1 False False False False
2 False False False False
3 False False False False
4 False True False False
... ... ... ... ...
23809 False False False True
23810 False False False False
23811 False True False False
23812 False False True False
23813 False False False False
[23814 rows x 4 columns]
After sampling (500 reads as sequencing depth), the table looks like:
mut del ins pcr_err_mark
0 False False False False
1 False False False False
2 False False False False
3 False True False False
4 False False False False
.. ... ... ... ...
495 False True False False
496 False False False False
497 False False False False
498 False True False False
499 False False False False
[500 rows x 4 columns]
Error Rate Estimation: Bead Synthesis and PCR Amplification¶
To quantify error propagation in sequencing, we define the following two key metrics based on per-read error flags. We first define a few symbols.
- Let \( N \) be the total number of reads.
-
For each read \( i \) (where \( i = 1, 2, \dots, N \))
-
\( d_i = 1 \) if a deletion error occurred during bead synthesis, otherwise \( d_i = 0 \).
- \( p_i = 1 \) if an error occurred during PCR amplification, otherwise \( p_i = 0 \).
We calculate the proportion of reads with errors from either bead synthesis or PCR amplification.
where \( \mathbb{f}(\cdot) \) is a function that is equal to 1 if the condition is true, and 0 otherwise.
We calculate the proportion of reads that have a deletion error during bead synthesis, regardless of PCR status.
These equations were directly computed.
err_union = df['del'] | df['pcr_err_mark']
p_bead_or_pcr = err_union[err_union].shape[0] / err_union.shape[0]
p_bead = df[df['del']].shape[0] / df.shape[0]