5. Directional
Since Cluster
and Adjacency
often led to an underestimated number and an overestimated number of unique UMIs, the Directional
method was developed to seek for a more balanced estimation between somewhat of the two extremes. Its deduplication process can coarsely be described as a directed edge-visiting strategy, that is, merging node B by node A if the count of node A is at least two-fold greater than that of node B. The Directional
method in the UMI-tools suite is reported to gain the highest accuracy in identifying PCR duplicates1.
Similar to the Adjacency
method, Directional
also draws on the information about the count of UMIs. In the example graph, the counts of the 7 unique UMIs are given in node_val_sorted
.
1 2 3 4 5 6 7 8 9 |
|
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
console
30/07/2024 15:52:33 logger: ======>visited UMI nodes: {'A'}
30/07/2024 15:52:33 logger: {'A'}
30/07/2024 15:52:33 logger: =========>the neighbor: B
30/07/2024 15:52:33 logger: ======>visited UMI nodes: {'A', 'B'}
30/07/2024 15:52:33 logger: {'A', 'B'}
30/07/2024 15:52:33 logger: =========>the neighbor: A
30/07/2024 15:52:33 logger: =========>the neighbor: C
30/07/2024 15:52:33 logger: =========>the neighbor: C
30/07/2024 15:52:33 logger: ======>visited UMI nodes: {'A', 'B', 'C'}
30/07/2024 15:52:33 logger: {'A', 'B', 'C'}
30/07/2024 15:52:33 logger: =========>the neighbor: A
30/07/2024 15:52:33 logger: =========>the neighbor: B
30/07/2024 15:52:33 logger: =========>the neighbor: D
30/07/2024 15:52:33 logger: remaining: {'D', 'F', 'G', 'E'}
30/07/2024 15:52:33 logger: disapproval [['B', 'C'], ['A', 'D']]
30/07/2024 15:52:33 logger: ======>visited UMI nodes: {'D'}
30/07/2024 15:52:33 logger: {'D'}
30/07/2024 15:52:33 logger: =========>the neighbor: A
30/07/2024 15:52:33 logger: =========>the neighbor: E
30/07/2024 15:52:33 logger: =========>the neighbor: F
30/07/2024 15:52:33 logger: ======>visited UMI nodes: {'D', 'F'}
30/07/2024 15:52:33 logger: {'D', 'F'}
30/07/2024 15:52:33 logger: =========>the neighbor: D
30/07/2024 15:52:33 logger: =========>the neighbor: G
30/07/2024 15:52:33 logger: remaining: {'G', 'E'}
30/07/2024 15:52:33 logger: disapproval [['D', 'E'], ['F', 'G']]
30/07/2024 15:52:33 logger: ======>visited UMI nodes: {'E'}
30/07/2024 15:52:33 logger: {'E'}
30/07/2024 15:52:33 logger: =========>the neighbor: D
30/07/2024 15:52:33 logger: =========>the neighbor: G
30/07/2024 15:52:33 logger: ======>visited UMI nodes: {'G', 'E'}
30/07/2024 15:52:33 logger: {'G', 'E'}
30/07/2024 15:52:33 logger: =========>the neighbor: E
30/07/2024 15:52:33 logger: =========>the neighbor: F
30/07/2024 15:52:33 logger: remaining: set()
30/07/2024 15:52:33 logger: disapproval []
deduplicated count:
3
deduplicated clusters:
{'cc_0': {'node_A': ['A', 'B', 'C'], 'node_D': ['D', 'F'], 'node_E': ['G', 'E']}}
deduplicated clusters decomposed:
{0: ['A', 'B', 'C'], 1: ['D', 'F'], 2: ['G', 'E']}
Deduplicated UMI count
There are 3 connected subcomponents (node_A
, node_D
, and node_E
) in connected component cc_0
and therefore, the deduplicated count of UMIs from 7 unique UMIs is 3 at the single locus.
In addition, if we want to explore if merged UMIs are of the same or different origin, we can also output the statistics.
Python
1 2 3 4 5 |
|
console
{'cc_0': {'node_A': [['A', 'B'], ['A', 'C']], 'node_D': [['D', 'F']], 'node_E': [['E', 'G']]}}
{'cc_0': {'node_A': [['B', 'C'], ['A', 'D']], 'node_D': [['D', 'E'], ['F', 'G']], 'node_E': []}}
We implemented the Directional
method in UMIche.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
|
The directional_search
function utilises a depth-first search (DFS) algorithm for travesing the UMI graph. It is written this way
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
-
Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017 Mar;27(3):491-499. doi: 10.1101/gr.209601.116. Epub 2017 Jan 18. PMID: 28100584; PMCID: PMC5340976. ↩