Skip to content

1. Splitting UMIs

Splitting UMIs

We designed two strategies for splitting homotrimer UMIs into monomer UMIs.

Abstract

  1. The first strategy involves splitting monomer UMIs from homotrimer UMIs by resolving nucleotide heterogeneity obviated in each trimer block using majority vote. This can be accessed via ``.

  2. The second strategy, spALL, splits monomer UMIs without addressing nucleotide heterogeneity conflicts in trimer blocks. This can be accessed via ``.

Consider a homotrimer UMI sequence without an error on it.

trimer = AAACCCGGGTTTGGGAAATTTGGGCCCCCC

To split it into monomer UMIs with the majority vote, we can do

Python

1
2
3
4
5
import umiche as uc

splitter = uc.homotrimer.collapse()

print(splitter.split_by_mv('AAACCCGGGTTTGGGAAATTTGGGCCCCCC', recur_len=3))

console

{'ACGTGATGCC'}

We can add some errors to the trimer sequence. For example, we can split the following 4 erroneous UMIs.

Python

1
2
3
4
print(splitter.split_to_mv('AAATCCGGATTTCGGAAATTTGGGCACCCC', recur_len=3))
print(splitter.split_to_mv('AAATCCGGATTTCGGAAATTTGGGCCCCCC', recur_len=3))
print(splitter.split_to_mv('AAATCCGGATTTGGGAAATTTGGGCCCCCC', recur_len=3))
print(splitter.split_to_mv('AAATCCGGGTTTGGGAAATTTGGGCCCCCC', recur_len=3))

Using majority vote, all four UMI sequences are all split into the same UMI. This is because errors added to the sequences per trimer block. Thus, conflicts within trimer blocks can be addressed.

console

{'ACGTGATGCC'}
{'ACGTGATGCC'}
{'ACGTGATGCC'}
{'ACGTGATGCC'}

But if we use the split_to_all strategy, the errors lead to different numbers of split UMIs for 4 homotrimer UMIs, respectively.

Python

1
2
3
4
print(splitter.split_to_all('AAATCCGGATTTCGGAAATTTGGGCACCCC', recur_len=3))
print(splitter.split_to_all('AAATCCGGATTTCGGAAATTTGGGCCCCCC', recur_len=3))
print(splitter.split_to_all('AAATCCGGATTTGGGAAATTTGGGCCCCCC', recur_len=3))
print(splitter.split_to_all('AAATCCGGGTTTGGGAAATTTGGGCCCCCC', recur_len=3))

console

{'ATGTCATGCC', 'ACATGATGAC', 'ATATCATGAC', 'ACGTCATGCC', 'ACATGATGCC', 'ACGTCATGAC', 'ATGTCATGAC', 'ATATCATGCC', 'ACATCATGCC', 'ATATGATGAC', 'ATGTGATGAC', 'ATGTGATGCC', 'ACATCATGAC', 'ACGTGATGAC', 'ACGTGATGCC', 'ATATGATGCC'}
{'ATGTCATGCC', 'ACGTCATGCC', 'ACATGATGCC', 'ATATCATGCC', 'ACATCATGCC', 'ATATGATGCC', 'ACGTGATGCC', 'ATGTGATGCC'}
{'ACATGATGCC', 'ACGTGATGCC', 'ATATGATGCC', 'ATGTGATGCC'}
{'ACGTGATGCC', 'ATGTGATGCC'}

Homotrimer block conflicts

We can address conflicts within a homotrimer block in two ways.

  1. Randomly picking out a base from each block
  2. Sequentially picking out a base from each block
  3. Voting the most-common base from each block

Python

1
2
print(splitter.majority_vote('AAATCCGGGTTTGGGAAATTTGGGCCCCCC', recur_len=3))
print(splitter.take_by_order('AAATCCGGGTTTGGGAAATTTGGGCCCCCC', pos=0, recur_len=3))

Using take_by_order, we can sequentially pick out 1st base (pos as 0) from each block. While using majority_vote, the error in the 2nd block can be addressed by voting the most-common base C.

console

ACGTGATGCC
ATGTGATGCC

Majority vote

We can better understand how we split a homotrimer UMI with or without conflicts in blocks like below.

Python

1
2
3
print(splitter.vote('AAA', recur_len=3))
print(splitter.vote('TAA', recur_len=3))
print(splitter.vote('TGA', recur_len=3))

console

{'A'}
{'A'}
{'T', 'G', 'A'}