for `chembl.drug_mechanisms`, can we add CHEMBL.TARGET ID mapping to UniProt? #151

colleenXu · 2022-12-28T21:21:27Z

I asked here biothings/pending.api#100 whether we could make a new pending API for the chembl drug mechanisms data https://mychem.info/v1/query?q=_exists_:%22chembl.drug_mechanisms%22.

One reason was because the current API used CHEMBL.TARGET IDs (this target entity is involved in the drug mechanism of the chembl compound X). It looks like the CHEMBL.TARGETs are mostly Gene/Protein entities. It would be easier to use if we used a more-universal ID namespace...

Related: Chunlei has found mappings between CHEMBL.TARGET and UniProtKB IDs here biothings/mygene.info#105 (comment)

colleenXu · 2022-12-28T21:28:00Z

Example from Andrew:

The chemical Evacetrapib has a chembl drug mechanism entry that says the target is CHEMBL3572 aka Cholesteryl ester transfer protein (SINGLE PROTEIN in Homo sapiens).

The Chembl website shows a UniProt accession for this CHEMBL3572: P11597. So it seems like CHEMBL.TARGET -> UniProtKB mappings exist.

erikyao · 2023-01-19T18:41:08Z

Related code: https://github.com/biothings/mychem.info/blob/master/src/hub/dataload/sources/chembl/chembl_parser.py#L688

erikyao · 2023-02-02T07:28:52Z

Source Data

It's not necessary to use the csv/tsv that Chunlei suggested. We can collect the target["target_components"]["accession"] fields for each target parsed from the already dumped https://www.ebi.ac.uk/chembl/api/data/target.json files.

E.g. from the dumped target.json files, CHEMBL1995 has only one component, which has the accession ID P43119:

{
    'organism': 'Homo sapiens',
    'pref_name': 'Prostanoid IP receptor',
    'target_chembl_id': 'CHEMBL1995',
    'target_components': [
        {
            'accession': 'P43119',
            'component_description': 'Prostacyclin receptor',
            'component_id': 325,
            'component_type': 'PROTEIN',
            'relationship': 'SINGLE PROTEIN',
        }
    ],
    'target_type': 'SINGLE PROTEIN',
    'tax_id': 9606
}

The result is identical to the accession shown on https://www.ebi.ac.uk/chembl/g/#search_results/all/query=CHEMBL1995

Special Cases

A target Chembl may have zero, one, or multiple component accessions.
None accession ID exists.
- E.g. CHEMBL2364096
- Solution: Discard those values
Not all accessions found in the target.json are UniProt.
- E.g. CHEMBL1615321 has accession ENSG00000207827
- Solution: Use regex; see UniProt Help: Accession

The Fix

The initial fix is committed to branch issue-151-fix

Sample document (CHEMBL238804 with target CHEMBL1995 and component P43119):

{
    '_id': 'QXWZQTURMXZVHJ-UHFFFAOYSA-N',
    'chembl': {
        'molecule_chembl_id': 'CHEMBL238804',
        'inchi_key': 'QXWZQTURMXZVHJ-UHFFFAOYSA-N',
        'smiles': 'CC(C)N(CCCCOCC(=O)NS(C)(=O)=O)c1cnc(-c2ccccc2)c(-c2ccccc2)n1',
        'inchi': 'InChI=1S/C26H32N4O4S/c1-20(2)30(16-10-11-17-34-19-24(31)29-35(3,32)33)23-18
-27-25(21-12-6-4-7-13-21)26(28-23)22-14-8-5-9-15-22/h4-9,12-15,18,20H,10-11,16-17,19H2,1-3H3,
(H,29,31)',
        'drug_indications': [ ... ],
        'drug_mechanisms': [
            {
                'action_type': 'AGONIST',
                'mechanism_refs': [
                    {
                        'id': 'label/2015/207947s000lbl.pdf',
                        'type': 'FDA',
                        'url': 
'http://www.accessdata.fda.gov/drugsatfda_docs/label/2015/207947s000lbl.pdf',
                        'FDA': 'label/2015/207947s000lbl.pdf'
                    }
                ],
                'target_chembl_id': 'CHEMBL1995',
                'target_components': ['P43119'],
                'target_type': 'SINGLE PROTEIN',
                'target_organism': 'Homo sapiens',
                'target_name': 'Prostanoid IP receptor'
            }
        ]
    }
}

TODO: field structure

Several questions to @colleenXu :

Is target_components a good field name? Would you prefer, say target_component_accessions, or something else?
Would you prefer a list of raw accession IDs like 'target_components': ['P43119', ...], or something like {'target_components': [{'uniprot': 'P43119'}, {'uniprot': ...}, ...]}, or any other structure?
Do you want to keep those Ensembl Gene accessions?

colleenXu · 2023-02-02T21:37:39Z

@erikyao

Some questions:

Are the target components always protein/gene IDs? I noticed here that some targets might not be...
- I don't know if that affects how target field name and structure
- I think it's okay to keep ensembl gene IDs, but it'll probably be helpful to have a separate field for them like

target_components: {
    ensembl_gene: [ 'ENSG00000207827', ...],
    uniprot_protein: ['P43119', ...],
    ...   
}

I also notice that some drug mechanisms entries don't seem to have info like target name/ID. Does your new code catch and handle these?

@andrewsu and @newgene may also have thoughts?

colleenXu · 2023-02-02T21:45:24Z

@erikyao Feedback on the ENSEMBL ID part:

The ENSG ID and that entity (microRNA 30a) doesn't seem to exist in Uniprot. Which I guess makes sense since it doesn't seem to code for a protein...

I guess the structure I proposed above (which I edited since I made a mistake) would still work...

erikyao · 2023-02-02T21:58:38Z

Thank you, @colleenXu

Are the target components always protein/gene IDs?

From the data I parsed from https://www.ebi.ac.uk/chembl/api/data/target.json, a component is either None, an Ensembl Gene ID, or a UniProt accession ID.

I guess the structure I proposed above (which I edited since I made a mistake) would still work...

Sure, I can make it in the parser.

erikyao · 2023-02-02T22:01:40Z

I also notice that biothings/pending.api#100 (comment). Does your new code catch and handle these?

Nope, my current code cannot fix this problem. Do you have any idea (like a second file/API to fill the blanks)?

colleenXu · 2023-02-03T00:09:51Z

Do you think the uniprot IDs are always proteins? If not, you can change the field name I proposed from uniprot_protein to uniprot...
Oops I didn't put the right link for viewing the 369 chemicals where the drug_mechanisms.action_type is null (and it looks like the target_chembl_id are also null) in my post so I fixed that. I'm wondering whether these entries make sense to remove entirely, or whether they can be kept and the fields with null values can be removed...I don't know how to fill in those blanks

erikyao · 2023-02-03T00:14:08Z

Do you think the uniprot IDs are always proteins?

I am not sure... and I think uniprot is a better prefix.

I'm wondering whether these entries make sense to remove entirely, or whether they can be kept and the fields with null values can be removed

My latest code will remove those null fields. Those documents will be kept.

colleenXu · 2023-02-03T00:15:15Z

Here's the chembl sites for some chemicals that have this "null" issue

Looks like CHEMBL's website also lacks the target info / any mechanism info.

erikyao · 2023-02-03T00:18:14Z

Looks like CHEMBL's website also lacks the target info / any mechanism info.

Yep, I think we can at first align our documents to the the CHEMBL report cards, and then fill the blanks in the future if necessary.

erikyao · 2023-02-03T23:40:59Z

Side note: I found a possible error in CHEMBL, and have reported it to chembl/GLaDOS#1310

erikyao · 2023-02-13T20:38:34Z

Check the latest release of MyChem for CHEMBL238804: https://mychem.info/v1/chem/CHEMBL238804?fields=chembl

The target_components field has been added.

colleenXu · 2023-02-27T19:56:26Z

Looks good to me. @erikyao, can we close this issue?

erikyao · 2023-02-27T20:04:12Z

@colleenXu issue closed.

One more thing: although we have target_components source data containing ensembl_gene entries, none of them is mapped to any of our target records. Therefore currently we only have target_components with uniprot in our document collection.

colleenXu · 2023-02-28T21:18:55Z

@erikyao I'm confused. I see 60 documents that seem to have both target_components.ensembl fields and target_chembl_id fields? https://mychem.info/v1/query?q=_exists_:%22chembl.drug_mechanisms.target_components.ensembl_gene%22%20AND%20_exists_:%22chembl.drug_mechanisms.target_chembl_id%22&fields=chembl

erikyao · 2023-02-28T21:25:39Z

@colleenXu sorry, my bad. I found I was analyzing on a subset of source data... Gladly we have both the uniprot and ensembl fields indexed.

erikyao self-assigned this Jan 19, 2023

erikyao added a commit that referenced this issue Feb 4, 2023

adjust target_components structure (see #151 (comment))

7bd7296

erikyao mentioned this issue Feb 4, 2023

Issue 151 fix #154

Merged

erikyao closed this as completed Feb 27, 2023

colleenXu mentioned this issue Feb 27, 2023

more specific operations for MyChem chembl.drug_mechanisms data biothings/pending.api#100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

for `chembl.drug_mechanisms`, can we add CHEMBL.TARGET ID mapping to UniProt? #151

for `chembl.drug_mechanisms`, can we add CHEMBL.TARGET ID mapping to UniProt? #151

colleenXu commented Dec 28, 2022

colleenXu commented Dec 28, 2022 •

edited

Loading

erikyao commented Jan 19, 2023

erikyao commented Feb 2, 2023 •

edited

Loading

colleenXu commented Feb 2, 2023 •

edited

Loading

colleenXu commented Feb 2, 2023

erikyao commented Feb 2, 2023

erikyao commented Feb 2, 2023

colleenXu commented Feb 3, 2023 •

edited

Loading

erikyao commented Feb 3, 2023 •

edited

Loading

colleenXu commented Feb 3, 2023 •

edited

Loading

erikyao commented Feb 3, 2023

erikyao commented Feb 3, 2023

erikyao commented Feb 13, 2023

colleenXu commented Feb 27, 2023

erikyao commented Feb 27, 2023 •

edited

Loading

colleenXu commented Feb 28, 2023

erikyao commented Feb 28, 2023

for chembl.drug_mechanisms, can we add CHEMBL.TARGET ID mapping to UniProt? #151

for chembl.drug_mechanisms, can we add CHEMBL.TARGET ID mapping to UniProt? #151

Comments

colleenXu commented Dec 28, 2022

colleenXu commented Dec 28, 2022 • edited Loading

erikyao commented Jan 19, 2023

erikyao commented Feb 2, 2023 • edited Loading

Source Data

Special Cases

The Fix

TODO: field structure

colleenXu commented Feb 2, 2023 • edited Loading

colleenXu commented Feb 2, 2023

erikyao commented Feb 2, 2023

erikyao commented Feb 2, 2023

colleenXu commented Feb 3, 2023 • edited Loading

erikyao commented Feb 3, 2023 • edited Loading

colleenXu commented Feb 3, 2023 • edited Loading

erikyao commented Feb 3, 2023

erikyao commented Feb 3, 2023

erikyao commented Feb 13, 2023

colleenXu commented Feb 27, 2023

erikyao commented Feb 27, 2023 • edited Loading

colleenXu commented Feb 28, 2023

erikyao commented Feb 28, 2023

for `chembl.drug_mechanisms`, can we add CHEMBL.TARGET ID mapping to UniProt? #151

for `chembl.drug_mechanisms`, can we add CHEMBL.TARGET ID mapping to UniProt? #151

colleenXu commented Dec 28, 2022 •

edited

Loading

erikyao commented Feb 2, 2023 •

edited

Loading

colleenXu commented Feb 2, 2023 •

edited

Loading

colleenXu commented Feb 3, 2023 •

edited

Loading

erikyao commented Feb 3, 2023 •

edited

Loading

colleenXu commented Feb 3, 2023 •

edited

Loading

erikyao commented Feb 27, 2023 •

edited

Loading