Convert
1. Combining single proteins into one¶
1.1 Command
1 2 3 4 5 6 7 8 9 10 |
|
1.2 Results
>1aigL
ALLSFERKYRVPGGTLVGGNLFDFWVGPFYVGFFGVATFFFAALGIILIAWSAVLQGTWNPQLISVYPPALEYGLGGAPLAKGGLWQIITICATGAFVSWALREVEICRKLGIGYHIPFAFAFAILAYLTLVLFRPVMMGAWGYAFPYGIWTHLDWVSNTGYTYGNFHYNPAHMIAISFFFTNALALALHGALVLSAANPEKGKEMRTPDHEDTFFRDLVGYSIGTLGIHRLGLLLSLSAVFFSALCMIITGTIWFDQWVDWWQWWVKLPWWANIPGGING
>1aijL
ALLSFERKYRVPGGTLVGGNLFDFWVGPFYVGFFGVATFFFAALGIILIAWSAVLQGTWNPQLISVYPPALEYGLGGAPLAKGGLWQIITICATGAFVSWALREVEICRKLGIGYHIPFAFAFAILAYLTLVLFRPVMMGAWGYAFPYGIWTHLDWVSNTGYTYGNFHYNPAHMIAISFFFTNALALALHGALVLSAANPEKGKEMRTPDHEDTFFRDLVGYSIGTLGIHRLGLLLSLSAVFFSALCMIITGTIWFDQWVDWWQWWVKLPWWANIPGGING
>1xqfA
AVADKADNAFMMICTALVLFMTIPGIALFYGGLIRGKNVLSMLTQVTVTFALVCILWVVYGYSLAFGEGNNFFGNINWLMLKNIELTAVMGSIYQYIHVAFQGSFACITVGLIVGALAERIRFSAVLIFVVVWLTLSYIPIAHMVWGGGLLASHGALDFAGGTVVHINAAIAGLVGAYLPHNLPMVFTGTAILYIGWFGFNAGSAGTANEIAALAFVNTVVATAAAILGWIFGEWALRGKPSLLGACSGAIAGLVGVTPACGYIGVGGALIIGVVAGLAGLWGVTMPCDVFGVHGVCGIVGCIMTGIFAASSLGGVGFAEGVTMGHQLLVQLESIAITIVWSGVVAFIGYKLADLTVGLRVP
27/07/2024 01:36:24 logger: =========>integrate 3 protein sequences
27/07/2024 01:36:24 logger: =========>save fasta to files
27/07/2024 01:36:24 logger: =========>save finished
2. Splitting into single proteins¶
1.1 Proteins from DrugBank
1 2 3 4 5 6 |
|
1.2 Fasta from UniProt Only extract human proteins by adding species as 'HUMAN'.
1 2 3 4 5 6 7 8 |
|
There are 2888 fasta files in output directory to('data/msa/').
27/07/2024 02:17:26 logger: =========>content of the dataframe:
Index(['fasta_ids', 'fasta_seqs', 'fasta_names', 'fasta_dpts', 'target_ids',
'drug_ids'],
dtype='object')
27/07/2024 02:17:26 logger: =========>target IDs:
0 P45059
1 P19113
2 Q9UI32
3 P00488
4 P35228
...
3272 P07766
3273 P09693
3274 P20963
3275 P03420
3276 P11209
Name: target_ids, Length: 3277, dtype: object
27/07/2024 02:17:26 logger: =========>start to split into single fasta files
27/07/2024 02:17:26 logger: =========># of proteins: 3277
27/07/2024 02:17:26 logger: =========># of proteins after deduplication: 2888

3. Extract sequences from MSA¶
pp.convert.msa2fas
can extract each homolog from a MSA to write in Fasta format.
Python
1 2 3 4 5 6 |
|
Output
28/07/2024 09:37:52 logger: =========>extract E to save
28/07/2024 09:37:52 logger: =========>extract UR100_A0A023PSW1 to save
28/07/2024 09:37:52 logger: =========>extract UR100_A0A088DKU1 to save
28/07/2024 09:37:52 logger: =========>extract UR100_A0A0K1Z002 to save
...
28/07/2024 09:37:52 logger: =========>extract UR100_U3M734 to save
28/07/2024 09:37:52 logger: =========>extract UR100_V5N926 to save
'Finished'
4. Convert bewteen MSA formats¶
pp.convert.msa_reformat
can convert a MSA from one to another format, e.g., from fasta format to stockholm format.
Python
1 2 3 4 5 6 7 8 |
|
Output
28/07/2024 09:53:30 logger: =========> convert from fasta to stockholm
'Finished'
5. Convert from PDB to FASTA¶
To perform this conversion, a pp.str
module is used.
Python
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Output
28/07/2024 17:00:02 logger: ============>No0. protein 1aig chain L
D:\Programming\anaconda3\envs\prot\Lib\site-packages\Bio\PDB\PDBParser.py:395: PDBConstructionWarning: Ignoring unrecognized record 'END' at line 2234
warnings.warn(
28/07/2024 17:00:02 logger: ===============>successfully converted
28/07/2024 17:00:02 logger: ============>No1. protein 1aij chain L
D:\Programming\anaconda3\envs\prot\Lib\site-packages\Bio\PDB\PDBParser.py:395: PDBConstructionWarning: Ignoring unrecognized record 'END' at line 2234
warnings.warn(
28/07/2024 17:00:02 logger: ===============>successfully converted
28/07/2024 17:00:02 logger: ============>No2. protein 1xqf chain A
D:\Programming\anaconda3\envs\prot\Lib\site-packages\Bio\PDB\PDBParser.py:395: PDBConstructionWarning: Ignoring unrecognized record 'END' at line 2634
warnings.warn(
28/07/2024 17:00:03 logger: ===============>successfully converted
finished