1. UMI graph
In this tutorial, I would walk you through basic operations on a graph constructed by unique UMIs observed at a single genomic locus.
UMI and graph nodes¶
UMIs are imagined as nodes in a graph. There are two UMI graphs from UMI-tools1 and mclUMI2.
UMI-tools graph: 6 unique UMIs observed at a locus from raw reads. Its adjacent list:
Python
1 2 3 4 5 6 7 8 |
|
mclUMI graph: 7 unique UMIs observed at a locus from raw reads. Its adjacent list:
Python
1 2 3 4 5 6 7 8 9 |
|
For graph_adj_mclumi
, the graph is visually represented this way.

Build a UMI graph¶
Python
1 2 3 4 5 6 |
|
console
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
If you want to turn it to a graph with digital nodes.
Python
1 |
|
console
30/07/2024 12:02:10 logger: ===>the graph is being mapped
30/07/2024 12:02:10 logger: ======>key map: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}
30/07/2024 12:02:10 logger: ======>the graph is a dict
30/07/2024 12:02:10 logger: ======>the mapped graph: {0: [1, 2, 3], 1: [0, 2], 2: [0, 1], 3: [0, 4, 5], 4: [3, 6], 5: [3, 6], 6: [4, 5]}
{0: [1, 2, 3], 1: [0, 2], 2: [0, 1], 3: [0, 4, 5], 4: [3, 6], 5: [3, 6], 6: [4, 5]}
Python dictionary
, which is the same as the form of the graph adjacent list.
Python
1 |
|
console
{'A': ['B', 'C', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['A', 'E', 'F'], 'E': ['D', 'G'], 'F': ['D', 'G'], 'G': ['E', 'F']}
Python set
Python
1 |
|
console
{'A': {'B', 'C', 'D'}, 'B': {'A', 'C'}, 'C': {'A', 'B'}, 'D': {'A', 'E', 'F'}, 'E': {'G', 'D'}, 'F': {'G', 'D'}, 'G': {'E', 'F'}}
Python list
Python
1 |
|
console
[['B', 'C', 'D'], ['A', 'C'], ['A', 'B'], ['A', 'E', 'F'], ['D', 'G'], ['D', 'G'], ['E', 'F']]
Python numpy ndarray
: adjacency matrix
Python
1 |
|
console
[[0. 1. 1. 1. 0. 0. 0.]
[1. 0. 1. 0. 0. 0. 0.]
[1. 1. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 1. 1. 0.]
[0. 0. 0. 1. 0. 0. 1.]
[0. 0. 0. 1. 0. 0. 1.]
[0. 0. 0. 0. 1. 1. 0.]]
Edge list: symmetrical matrix containing repeated edges
Python
1 |
|
console
[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B'), ('D', 'A'), ('D', 'E'), ('D', 'F'), ('E', 'D'), ('E', 'G'), ('F', 'D'), ('F', 'G'), ('G', 'E'), ('G', 'F')]
Edge list: triangular matrix extracted from symmetrical matrix. You need to set rr
as True
.
Python
1 2 3 |
|
console
[('C', 'B'), ('C', 'A'), ('B', 'A'), ('G', 'E'), ('E', 'D'), ('G', 'F'), ('D', 'A'), ('F', 'D')]
UMI edges¶
A graph can be represented by an edge list as illustrated above. We put the above edge list to_edge_list(rr=False)
to uc.graph.edge
and we will be able to operate on the edges.
Python
1 2 3 4 5 6 7 8 |
|
console
graph: [('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B'), ('D', 'A'), ('D', 'E'), ('D', 'F'), ('E', 'D'), ('E', 'G'), ('F', 'D'), ('F', 'G'), ('G', 'E'), ('G', 'F')]
key_mapped: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}
nodes: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
If we want to get non-redundant edges in the edge list, we can do
Python
1 |
|
console
[('G', 'E'), ('F', 'D'), ('E', 'D'), ('C', 'A'), ('B', 'A'), ('D', 'A'), ('G', 'F'), ('C', 'B')]
Then, we can still output the properties of the edge-based graph.
Python
1 2 3 4 |
|
console
[('G', 'E'), ('F', 'D'), ('E', 'D'), ('C', 'A'), ('B', 'A'), ('D', 'A'), ('G', 'F'), ('C', 'B')]
{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}
[('G', 'E'), ('F', 'D'), ('E', 'D'), ('C', 'A'), ('B', 'A'), ('D', 'A'), ('G', 'F'), ('C', 'B')]
We can convert it back the adjacency list and a new graph with digits mapped from the original graph.
Python
1 2 |
|
console
adjacency list: {'A': ['C', 'B', 'D'], 'B': ['A', 'C'], 'C': ['A', 'B'], 'D': ['F', 'E', 'A'], 'E': ['G', 'D'], 'F': ['D', 'G'], 'G': ['E', 'F']}
graph_mapped: [(6, 4), (5, 3), (4, 3), (2, 0), (1, 0), (3, 0), (6, 5), (2, 1)]
-
Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017 Mar;27(3):491-499. doi: 10.1101/gr.209601.116. Epub 2017 Jan 18. PMID: 28100584; PMCID: PMC5340976. ↩
-
Jianfeng Sun and Adam Cribss. mclUMI: Markov clustering of unique molecular identifiers allows removing PCR duplicates dynamically. https://github.com/2003100127/mclumi ↩