Fetch data#

Cath[1] is a protein structural classification database that currently has more than 100 million protein domains that are classified into thousands of superfamilies.

TMKit provides a module, tmkit.cath, to access the Cath database.

Details about domain and family#

Similarly, we run the following code.

import tmkit as tmk

res = tmk.cath.fetch_by_id(
    id='1cukA01'
)
print(res)

Output#

The detailed information about its domain is made in JSON format as shown below.

{'atom_length': 66, 's35_id': '2.40.50.140.109', 'combs_segments': [{'pdb_code': '1cuk', 'stop': '66', 'chain_code': 'A', 'start': '1'}], 'residues': [{'pdbres': '1', 'seqres': 1, 'aa': 'M'}, {'pdbres': '2', 'seqres': 2, 'aa': 'I'}, {'pdbres': '3', 'seqres': 3, 'aa': 'G'}, {'pdbres': '4', 'seqres': 4, 'aa': 'R'}, {'pdbres': '5', 'seqres': 5, 'aa': 'L'}, {'pdbres': '6', 'seqres': 6, 'aa': 'R'}, {'pdbres': '7', 'seqres': 7, 'aa': 'G'}, {'pdbres': '8', 'seqres': 8, 'aa': 'I'}, {'pdbres': '9', 'seqres': 9, 'aa': 'I'}, {'pdbres': '10', 'seqres': 10, 'aa': 'I'}, {'pdbres': '11', 'seqres': 11, 'aa': 'E'}, {'pdbres': '12', 'seqres': 12, 'aa': 'K'}, {'pdbres': '13', 'seqres': 13, 'aa': 'Q'}, {'pdbres': '14', 'seqres': 14, 'aa': 'P'}, {'pdbres': '15', 'seqres': 15, 'aa': 'P'}, {'pdbres': '16', 'seqres': 16, 'aa': 'L'}, {'pdbres': '17', 'seqres': 17, 'aa': 'V'}, {'pdbres': '18', 'seqres': 18, 'aa': 'L'}, {'pdbres': '19', 'seqres': 19, 'aa': 'I'}, {'pdbres': '20', 'seqres': 20, 'aa': 'E'}, {'pdbres': '21', 'seqres': 21, 'aa': 'V'}, {'pdbres': '22', 'seqres': 22, 'aa': 'G'}, {'pdbres': '23', 'seqres': 23, 'aa': 'G'}, {'pdbres': '24', 'seqres': 24, 'aa': 'V'}, {'pdbres': '25', 'seqres': 25, 'aa': 'G'}, {'pdbres': '26', 'seqres': 26, 'aa': 'Y'}, {'pdbres': '27', 'seqres': 27, 'aa': 'E'}, {'pdbres': '28', 'seqres': 28, 'aa': 'V'}, {'pdbres': '29', 'seqres': 29, 'aa': 'H'}, {'pdbres': '30', 'seqres': 30, 'aa': 'M'}, {'pdbres': '31', 'seqres': 31, 'aa': 'P'}, {'pdbres': '32', 'seqres': 32, 'aa': 'M'}, {'pdbres': '33', 'seqres': 33, 'aa': 'T'}, {'pdbres': '34', 'seqres': 34, 'aa': 'C'}, {'pdbres': '35', 'seqres': 35, 'aa': 'F'}, {'pdbres': '36', 'seqres': 36, 'aa': 'Y'}, {'pdbres': '37', 'seqres': 37, 'aa': 'E'}, {'pdbres': '38', 'seqres': 38, 'aa': 'L'}, {'pdbres': '39', 'seqres': 39, 'aa': 'P'}, {'pdbres': '40', 'seqres': 40, 'aa': 'E'}, {'pdbres': '41', 'seqres': 41, 'aa': 'A'}, {'pdbres': '42', 'seqres': 42, 'aa': 'G'}, {'pdbres': '43', 'seqres': 43, 'aa': 'Q'}, {'pdbres': '44', 'seqres': 44, 'aa': 'E'}, {'pdbres': '45', 'seqres': 45, 'aa': 'A'}, {'pdbres': '46', 'seqres': 46, 'aa': 'I'}, {'pdbres': '47', 'seqres': 47, 'aa': 'V'}, {'pdbres': '48', 'seqres': 48, 'aa': 'F'}, {'pdbres': '49', 'seqres': 49, 'aa': 'T'}, {'pdbres': '50', 'seqres': 50, 'aa': 'H'}, {'pdbres': '51', 'seqres': 51, 'aa': 'F'}, {'pdbres': '52', 'seqres': 52, 'aa': 'V'}, {'pdbres': '53', 'seqres': 53, 'aa': 'V'}, {'pdbres': '54', 'seqres': 54, 'aa': 'R'}, {'pdbres': '55', 'seqres': 55, 'aa': 'E'}, {'pdbres': '56', 'seqres': 56, 'aa': 'D'}, {'pdbres': '57', 'seqres': 57, 'aa': 'A'}, {'pdbres': '58', 'seqres': 58, 'aa': 'Q'}, {'pdbres': '59', 'seqres': 59, 'aa': 'L'}, {'pdbres': '60', 'seqres': 60, 'aa': 'L'}, {'pdbres': '61', 'seqres': 61, 'aa': 'Y'}, {'pdbres': '62', 'seqres': 62, 'aa': 'G'}, {'pdbres': '63', 'seqres': 63, 'aa': 'F'}, {'pdbres': '64', 'seqres': 64, 'aa': 'N'}, {'pdbres': '65', 'seqres': 65, 'aa': 'N'}, {'pdbres': '66', 'seqres': 66, 'aa': 'K'}], 'funfam_number': 58874, 'cath_id': '2.40.50.140.109.1.1.1.1', 'superfamily_id': '2.40.50.140', 'domain_id': '1cukA01', 'pdb_segments': [{'pdb_code': '1cuk', 'stop': '66', 'chain_code': 'A', 'start': '1'}], 'ssg5_number': 12, 'ec_terms': [{'alternative_names': None, 'cofactors': None, 'reaction': 'ATP + H(2)O = ADP + phosphate.', 'comments': "-!- DNA helicases utilize the energy from ATP hydrolysis to unwind double-stranded DNA. -!- Some of them unwind duplex DNA with a 3' to 5' polarity (1,3,5,8), other show 5' to 3' polarity (10,11,12,13) or unwind DNA in both directions (14,15). -!- Some helicases unwind DNA as well as RNA (4,9). -!- May be identical with EC 3.6.4.13 (RNA helicase).", 'description': 'DNA helicase.', 'ec_code': '3.6.4.12', 'uniprot_acc': 'P0A809'}], 'go_terms': [{'term_type': 'biological_process', 'go_acc': 'GO:0000725', 'name': 'recombinational repair', 'evidence': 'IMP', 'definition': 'A DNA repair process that involves the exchange, reciprocal or nonreciprocal, of genetic material between the broken DNA molecule and a homologous region of DNA.'}, {'term_type': 'molecular_function', 'go_acc': 'GO:0009378', 'name': 'four-way junction helicase activity', 'evidence': 'IDA', 'definition': 'Catalysis of the unwinding of the DNA helix of DNA containing four-way junctions, including Holliday junctions.'}, {'term_type': 'cellular_component', 'go_acc': 'GO:0009379', 'name': 'Holliday junction helicase complex', 'evidence': 'IDA', 'definition': 'A DNA helicase complex that forms part of a Holliday junction resolvase complex, where the helicase activity is involved in the migration of the junction branch point. The best-characterized example is the E. coli RuvAB complex, in which a hexamer of RuvB subunits possesses helicase activity that is modulated by association with RuvA.'}, {'term_type': 'biological_process', 'go_acc': 'GO:0009432', 'name': 'SOS response', 'evidence': 'IEP', 'definition': 'An error-prone process for repairing damaged microbial DNA.'}, {'term_type': 'biological_process', 'go_acc': 'GO:0009432', 'name': 'SOS response', 'evidence': 'RCA', 'definition': 'An error-prone process for repairing damaged microbial DNA.'}, {'term_type': 'cellular_component', 'go_acc': 'GO:0048476', 'name': 'Holliday junction resolvase complex', 'evidence': 'IDA', 'definition': 'A protein complex that mediates the conversion of a Holliday junction into two separate duplex DNA molecules; the complex includes a single- or multisubunit helicase that catalyzes the extension of heteroduplex DNA by branch migration and a nuclease that resolves the junction by nucleolytic cleavage.'}], 'pdb_code': '1cuk', 'ssg9_number': 12, 'atom_sequence': 'MIGRLRGIIIEKQPPLVLIEVGGVGYEVHMPMTCFYELPEAGQEAIVFTHFVVREDAQLLYGFNNK', 'combs_sequence': 'MIGRLRGIIIEKQPPLVLIEVGGVGYEVHMPMTCFYELPEAGQEAIVFTHFVVREDAQLLYGFNNK'}

The entire Cath database#

If you would like to access the domain information for all available proteins, we can use the tmk.cath.read function. Before that, we need to retrieve a Cath database. By default, we get the newest version and save it in ./data/cath/ as shown below.

import tmkit as tmk

tmk.cath.download(
    version='newest',
    sv_fp='./data/cath/'
)

The file will be automatically decompressed into ./data/cath/cath-b-newest-all.txt.

import tmkit as tmk

df = tmk.cath.read(
    cath_fpn='./data/cath/cath-b-newest-all.txt',
    groupby='version',
    group='v4_2_0',
)
print(df)

Attributes#

Attribute

Description

version

version of the Cath database

sv_fpn

path to save the Cath database

cath_fpn

path to the downloaded Cath database

groupby

metric used to group data, e.g., version. There are 4 metrics in total, i.e., domain, version, superfamily, and bound

group

value of a metric. For example, if version is chosen, there are two, namely, v4_2_0 and putative

Output#

The result is made as a Pandas dataframe.

======>reading CATH...
======>CATH features are:
=========>No.1: domain
=========>No.2: version
=========>No.3: superfamily
=========>No.4: bound
         domain version  superfamily    bound
0       101mA00  v4_2_0  1.10.490.10  0-153:A
1       102lA00  v4_2_0  1.10.530.40  1-162:A
2       102mA00  v4_2_0  1.10.490.10  0-153:A
3       103lA00  v4_2_0  1.10.530.40  1-162:A
4       103mA00  v4_2_0  1.10.490.10  0-153:A
...         ...     ...          ...      ...
439144  9xiaA00  v4_2_0  3.20.20.150  1-387:A
439145  9ximA00  v4_2_0  3.20.20.150  3-394:A
439146  9ximB00  v4_2_0  3.20.20.150  3-394:B
439147  9ximC00  v4_2_0  3.20.20.150  4-394:C
439148  9ximD00  v4_2_0  3.20.20.150  3-394:D

[439149 rows x 4 columns]