Skip to content

3. Cluster

Connected components

Connected components are a key concept in graph theory, a field within mathematics and computer science that explores the properties of graphs. A graph is made up of vertices (or nodes) and edges that link pairs of vertices. In an undirected graph, a connected component is defined as a maximal subgraph where each pair of vertices is connected by a path. This means that within a connected component, any vertex can be reached from any other vertex through some path, and there are no additional vertices in the overall graph that can be included in this component without losing this property.

Breadth-First Search (BFS) is an effective algorithm for finding connected components in an undirected graph. It starts from a given node, explores all its neighbors, then proceeds to explore the neighbors' neighbors, and so on. A queue is used to manage the vertices that need to be explored next. To identify all connected components using BFS, follow these steps:

Tip

Initialization

  • Visited List/Set: Maintain a list or set to track which vertices have been visited.

  • Component List: Initialize a list or counter to store the connected components found.

Iteration Over All Vertices

For each vertex in the graph, check if it has already been visited. If it hasn't been visited, it marks the beginning of a new connected component.

BFS Traversal

  • Start a BFS from this unvisited vertex.
  • Use a queue to manage the exploration process. Enqueue the starting vertex and mark it as visited.
  • While the queue is not empty:
  • Dequeue a vertex.
  • For each of its neighbors, if they have not been visited, mark them as visited and enqueue them.
  • All vertices visited during this BFS are part of the current connected component.

Repeated operation

Repeat the process for every unvisited vertex, thereby identifying new connected components.

Output

A list of connected components, with each component consisting of a set of vertices that are interconnected.

Programming BFS algorithm

We programmed a Python function for breadth-First Search for connected components in a UMI graph with two methods, deque and set.

⭐ The deque method

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import umiche as uc

graph_adj_mclumi = {
    'A': ['B', 'C', 'D'],
    'B': ['A', 'C'],
    'C': ['A', 'B'],
    'D': ['A', 'E', 'F'],
    'E': ['D', 'G'],
    'F': ['D', 'G'],
    'G': ['E', 'F'],
}

connected_components = uc.graph.cc(
    graph_adj=graph_adj_mclumi,
    method='deque',
    verbose=True
)
print(list(connected_components))

console

30/07/2024 13:47:17 logger: ======> root A has not been visited
30/07/2024 13:47:17 logger: ======> a queue built by root A is deque(['A'])
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['A'])
30/07/2024 13:47:17 logger: =========> node: A
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['B', 'C', 'D'])
30/07/2024 13:47:17 logger: =========> node: B
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['C', 'D'])
30/07/2024 13:47:17 logger: =========> node: C
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['D'])
30/07/2024 13:47:17 logger: =========> node: D
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['E', 'F'])
30/07/2024 13:47:17 logger: =========> node: E
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['F', 'G'])
30/07/2024 13:47:17 logger: =========> node: F
30/07/2024 13:47:17 logger: =========> a queue built by each root node deque(['G'])
30/07/2024 13:47:17 logger: =========> node: G
30/07/2024 13:47:17 logger: =========> visited nodes {'A', 'B', 'D', 'F', 'E', 'G', 'C'}
30/07/2024 13:47:17 logger: =========> root B has been visited
30/07/2024 13:47:17 logger: =========> root C has been visited
30/07/2024 13:47:17 logger: =========> root D has been visited
30/07/2024 13:47:17 logger: =========> root E has been visited
30/07/2024 13:47:17 logger: =========> root F has been visited
30/07/2024 13:47:17 logger: =========> root G has been visited
[['A', 'B', 'C', 'D', 'E', 'F', 'G']]

The deque function

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def deque(
        graph : Dict,
):
    visited = set()
    for root, nbrs in graph.items():
        if root not in visited:
            visited.add(root)
            component = []
            queue = deque([root])
            while queue:
                node = queue.popleft()
                component.append(node)
                for nbr in graph[node]:
                    if nbr not in visited:
                        visited.add(nbr)
                        queue.append(nbr)
            yield component
        else:
            continue

⭐ In addition, there is a set method.

Python

1
2
3
4
5
6
7
8
import umiche as uc

connected_components = uc.graph.cc(
    graph_adj=graph_adj_mclumi,
    method='set',
    verbose=True
)
print(connected_components)

console

30/07/2024 13:47:17 logger: ======> root A has not been visited
30/07/2024 13:47:17 logger: ======> a queue built by root A is ['A']
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['A']
30/07/2024 13:47:17 logger: ======> node: A
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['B', 'C', 'D']
30/07/2024 13:47:17 logger: ======> node: B
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['C', 'D']
30/07/2024 13:47:17 logger: ======> node: C
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['D']
30/07/2024 13:47:17 logger: ======> node: D
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['E', 'F']
30/07/2024 13:47:17 logger: ======> node: E
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['F', 'G']
30/07/2024 13:47:17 logger: ======> node: F
30/07/2024 13:47:17 logger: ======> a queue built by each root node ['G']
30/07/2024 13:47:17 logger: ======> node: G
30/07/2024 13:47:17 logger: ======> visited nodes {'A', 'B', 'D', 'F', 'E', 'G', 'C'}
30/07/2024 13:47:17 logger: =========>root B has been visited
30/07/2024 13:47:17 logger: =========>root C has been visited
30/07/2024 13:47:17 logger: =========>root D has been visited
30/07/2024 13:47:17 logger: =========>root E has been visited
30/07/2024 13:47:17 logger: =========>root F has been visited
30/07/2024 13:47:17 logger: =========>root G has been visited
[['A', 'B', 'C', 'D', 'E', 'F', 'G']]

The set function

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def set(
        graph : Dict,
) -> List:
    visited = set()
    components = []
    for root, nbrs in graph.items():
        if root not in visited:
            visited.add(root)
            component = []
            queue = [root]
            while queue:
                node = queue.pop(0)
                component.append(node)
                for nbr in graph[node]:
                    if nbr not in visited:
                        visited.add(nbr)
                        queue.append(nbr)
            components.append(component)
        else:
            print('=========>root {} has been visited'.format(root))
    return components

Cluster for UMI deduplication

There are two methods to get the connected components from a UMI graph, a deque method based on our customised code and a networkx method using NetworkX (installed via pip install pip install networkx). The number of connected components are reported as the Cluster method in UMI-tools1.

⭐ The deque method.

Python

1
2
3
4
5
6
7
import umiche as uc

cc_cluster = uc.dedup.cluster(
    graph=graph_adj_mclumi,
    method='deque',
)
print(cc_cluster)

console

{0: ['A', 'B', 'C', 'D', 'E', 'F', 'G']}

Deduplicated UMI count

There is only one cluster 0, and therefore, the deduplicated count of UMIs from 7 unique UMIs is 1 at the single locus.

The deque function

Python

1
2
3
4
5
def cc(
        graph_adj,
):
    connected_components = list(gbfscc().deque(graph_adj))
    return {i: cc for i, cc in enumerate(connected_components)}

⭐ The networkx method. This method requires a graph as input represented by an edge list.

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import umiche as uc

adj = uc.graph.adjacency(
    graph_adj=graph_adj_mclumi,
)

cc_cluster = uc.dedup.cluster(
    graph=adj.to_edge_list(rr=True),
    method='networkx',
)
print(cc_cluster)

console

{0: NodeView(('B', 'A', 'D', 'C', 'F', 'G', 'E'))}

The networkx function

Python

1
2
3
4
5
6
7
8
def ccnx(
        edge_list,
):
    import networkx as nx
    G = nx.Graph()
    for edge in edge_list:
        G.add_edge(edge[0], edge[1])
    return {i: G.subgraph(cc).nodes() for i, cc in enumerate(nx.connected_components(G))}


  1. Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017 Mar;27(3):491-499. doi: 10.1101/gr.209601.116. Epub 2017 Jan 18. PMID: 28100584; PMCID: PMC5340976.