3. Cluster
Connected components¶
Connected components are a key concept in graph theory, a field within mathematics and computer science that explores the properties of graphs. A graph is made up of vertices (or nodes) and edges that link pairs of vertices. In an undirected graph, a connected component is defined as a maximal subgraph where each pair of vertices is connected by a path. This means that within a connected component, any vertex can be reached from any other vertex through some path, and there are no additional vertices in the overall graph that can be included in this component without losing this property.
Breadth-First Search (BFS) is an effective algorithm for finding connected components in an undirected graph. It starts from a given node, explores all its neighbors, then proceeds to explore the neighbors' neighbors, and so on. A queue is used to manage the vertices that need to be explored next. To identify all connected components using BFS, follow these steps:
Tip
Initialization
-
Visited List/Set: Maintain a list or set to track which vertices have been visited.
-
Component List: Initialize a list or counter to store the connected components found.
Iteration Over All Vertices
For each vertex in the graph, check if it has already been visited. If it hasn't been visited, it marks the beginning of a new connected component.
BFS Traversal
- Start a BFS from this unvisited vertex.
- Use a queue to manage the exploration process. Enqueue the starting vertex and mark it as visited.
- While the queue is not empty:
- Dequeue a vertex.
- For each of its neighbors, if they have not been visited, mark them as visited and enqueue them.
- All vertices visited during this BFS are part of the current connected component.
Repeated operation
Repeat the process for every unvisited vertex, thereby identifying new connected components.
Output
A list of connected components, with each component consisting of a set of vertices that are interconnected.
Programming BFS algorithm¶
We programmed a Python function for breadth-First Search for connected components in a UMI graph with two methods, deque
and set
.
The
deque
method
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
console
30/07/2024 13:47:17 logger: ======> root A has not been visited
30/07/2024 13:47:17 logger: ======> a queue built by root A is deque(['A'])
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['A'])
30/07/2024 13:47:17 logger: =========> node: A
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['B', 'C', 'D'])
30/07/2024 13:47:17 logger: =========> node: B
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['C', 'D'])
30/07/2024 13:47:17 logger: =========> node: C
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['D'])
30/07/2024 13:47:17 logger: =========> node: D
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['E', 'F'])
30/07/2024 13:47:17 logger: =========> node: E
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['F', 'G'])
30/07/2024 13:47:17 logger: =========> node: F
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['G'])
30/07/2024 13:47:17 logger: =========> node: G
30/07/2024 13:47:17 logger: =========> visited nodes {'A', 'B', 'D', 'F', 'E', 'G', 'C'}
30/07/2024 13:47:17 logger: =========> root B has been visited
30/07/2024 13:47:17 logger: =========> root C has been visited
30/07/2024 13:47:17 logger: =========> root D has been visited
30/07/2024 13:47:17 logger: =========> root E has been visited
30/07/2024 13:47:17 logger: =========> root F has been visited
30/07/2024 13:47:17 logger: =========> root G has been visited
[['A', 'B', 'C', 'D', 'E', 'F', 'G']]
The deque
function
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
In addition, there is a
set
method.
Python
1 2 3 4 5 6 7 8 |
|
console
30/07/2024 13:47:17 logger: ======> root A has not been visited
30/07/2024 13:47:17 logger: ======> a queue built by root A is ['A']
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['A']
30/07/2024 13:47:17 logger: ======> node: A
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['B', 'C', 'D']
30/07/2024 13:47:17 logger: ======> node: B
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['C', 'D']
30/07/2024 13:47:17 logger: ======> node: C
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['D']
30/07/2024 13:47:17 logger: ======> node: D
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['E', 'F']
30/07/2024 13:47:17 logger: ======> node: E
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['F', 'G']
30/07/2024 13:47:17 logger: ======> node: F
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['G']
30/07/2024 13:47:17 logger: ======> node: G
30/07/2024 13:47:17 logger: ======> visited nodes {'A', 'B', 'D', 'F', 'E', 'G', 'C'}
30/07/2024 13:47:17 logger: =========>root B has been visited
30/07/2024 13:47:17 logger: =========>root C has been visited
30/07/2024 13:47:17 logger: =========>root D has been visited
30/07/2024 13:47:17 logger: =========>root E has been visited
30/07/2024 13:47:17 logger: =========>root F has been visited
30/07/2024 13:47:17 logger: =========>root G has been visited
[['A', 'B', 'C', 'D', 'E', 'F', 'G']]
The set
function
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Cluster
for UMI deduplication¶
There are two methods to get the connected components from a UMI graph, a deque
method based on our customised code and a networkx
method using NetworkX
(installed via pip install pip install networkx
). The number of connected components are reported as the Cluster
method in UMI-tools1.
The
deque
method.
Python
1 2 3 4 5 6 7 |
|
console
{0: ['A', 'B', 'C', 'D', 'E', 'F', 'G']}
Deduplicated UMI count
There is only one cluster 0, and therefore, the deduplicated count of UMIs from 7 unique UMIs is 1 at the single locus.
The deque
function
Python
1 2 3 4 5 |
|
The
networkx
method. This method requires a graph as input represented by an edge list.
Python
1 2 3 4 5 6 7 8 9 10 11 |
|
console
{0: NodeView(('B', 'A', 'D', 'C', 'F', 'G', 'E'))}
The networkx
function
Python
1 2 3 4 5 6 7 8 |
|
-
Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017 Mar;27(3):491-499. doi: 10.1101/gr.209601.116. Epub 2017 Jan 18. PMID: 28100584; PMCID: PMC5340976. ↩