Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for chembl.drug_mechanisms, can we add CHEMBL.TARGET ID mapping to UniProt? #151

Closed
colleenXu opened this issue Dec 28, 2022 · 17 comments
Closed
Assignees

Comments

@colleenXu
Copy link

I asked here biothings/pending.api#100 whether we could make a new pending API for the chembl drug mechanisms data https://mychem.info/v1/query?q=_exists_:%22chembl.drug_mechanisms%22.

One reason was because the current API used CHEMBL.TARGET IDs (this target entity is involved in the drug mechanism of the chembl compound X). It looks like the CHEMBL.TARGETs are mostly Gene/Protein entities. It would be easier to use if we used a more-universal ID namespace...

Related: Chunlei has found mappings between CHEMBL.TARGET and UniProtKB IDs here biothings/mygene.info#105 (comment)

@colleenXu
Copy link
Author

colleenXu commented Dec 28, 2022

Example from Andrew:

The chemical Evacetrapib has a chembl drug mechanism entry that says the target is CHEMBL3572 aka Cholesteryl ester transfer protein (SINGLE PROTEIN in Homo sapiens).

The Chembl website shows a UniProt accession for this CHEMBL3572: P11597. So it seems like CHEMBL.TARGET -> UniProtKB mappings exist.
Screen Shot 2022-12-28 at 1 28 24 PM

@erikyao erikyao self-assigned this Jan 19, 2023
@erikyao
Copy link
Contributor

erikyao commented Jan 19, 2023

@erikyao
Copy link
Contributor

erikyao commented Feb 2, 2023

Source Data

It's not necessary to use the csv/tsv that Chunlei suggested. We can collect the target["target_components"]["accession"] fields for each target parsed from the already dumped https://www.ebi.ac.uk/chembl/api/data/target.json files.

E.g. from the dumped target.json files, CHEMBL1995 has only one component, which has the accession ID P43119:

{
    'organism': 'Homo sapiens',
    'pref_name': 'Prostanoid IP receptor',
    'target_chembl_id': 'CHEMBL1995',
    'target_components': [
        {
            'accession': 'P43119',
            'component_description': 'Prostacyclin receptor',
            'component_id': 325,
            'component_type': 'PROTEIN',
            'relationship': 'SINGLE PROTEIN',
        }
    ],
    'target_type': 'SINGLE PROTEIN',
    'tax_id': 9606
}

The result is identical to the accession shown on https://www.ebi.ac.uk/chembl/g/#search_results/all/query=CHEMBL1995

Special Cases

  1. A target Chembl may have zero, one, or multiple component accessions.
  2. None accession ID exists.
  3. Not all accessions found in the target.json are UniProt.

The Fix

The initial fix is committed to branch issue-151-fix

Sample document (CHEMBL238804 with target CHEMBL1995 and component P43119):

{
    '_id': 'QXWZQTURMXZVHJ-UHFFFAOYSA-N',
    'chembl': {
        'molecule_chembl_id': 'CHEMBL238804',
        'inchi_key': 'QXWZQTURMXZVHJ-UHFFFAOYSA-N',
        'smiles': 'CC(C)N(CCCCOCC(=O)NS(C)(=O)=O)c1cnc(-c2ccccc2)c(-c2ccccc2)n1',
        'inchi': 'InChI=1S/C26H32N4O4S/c1-20(2)30(16-10-11-17-34-19-24(31)29-35(3,32)33)23-18
-27-25(21-12-6-4-7-13-21)26(28-23)22-14-8-5-9-15-22/h4-9,12-15,18,20H,10-11,16-17,19H2,1-3H3,
(H,29,31)',
        'drug_indications': [ ... ],
        'drug_mechanisms': [
            {
                'action_type': 'AGONIST',
                'mechanism_refs': [
                    {
                        'id': 'label/2015/207947s000lbl.pdf',
                        'type': 'FDA',
                        'url': 
'http://www.accessdata.fda.gov/drugsatfda_docs/label/2015/207947s000lbl.pdf',
                        'FDA': 'label/2015/207947s000lbl.pdf'
                    }
                ],
                'target_chembl_id': 'CHEMBL1995',
                'target_components': ['P43119'],
                'target_type': 'SINGLE PROTEIN',
                'target_organism': 'Homo sapiens',
                'target_name': 'Prostanoid IP receptor'
            }
        ]
    }
}

TODO: field structure

Several questions to @colleenXu :

  1. Is target_components a good field name? Would you prefer, say target_component_accessions, or something else?
  2. Would you prefer a list of raw accession IDs like 'target_components': ['P43119', ...], or something like {'target_components': [{'uniprot': 'P43119'}, {'uniprot': ...}, ...]}, or any other structure?
  3. Do you want to keep those Ensembl Gene accessions?

@colleenXu
Copy link
Author

colleenXu commented Feb 2, 2023

@erikyao

Some questions:

  • Are the target components always protein/gene IDs? I noticed here that some targets might not be...
    • I don't know if that affects how target field name and structure
    • I think it's okay to keep ensembl gene IDs, but it'll probably be helpful to have a separate field for them like
target_components: {
    ensembl_gene: [ 'ENSG00000207827', ...],
    uniprot_protein: ['P43119', ...],
    ...   
}

@andrewsu and @newgene may also have thoughts?

@colleenXu
Copy link
Author

@erikyao Feedback on the ENSEMBL ID part:

The ENSG ID and that entity (microRNA 30a) doesn't seem to exist in Uniprot. Which I guess makes sense since it doesn't seem to code for a protein...

I guess the structure I proposed above (which I edited since I made a mistake) would still work...

@erikyao
Copy link
Contributor

erikyao commented Feb 2, 2023

Thank you, @colleenXu

Are the target components always protein/gene IDs?

From the data I parsed from https://www.ebi.ac.uk/chembl/api/data/target.json, a component is either None, an Ensembl Gene ID, or a UniProt accession ID.

I guess the structure I proposed above (which I edited since I made a mistake) would still work...

Sure, I can make it in the parser.

@erikyao
Copy link
Contributor

erikyao commented Feb 2, 2023

I also notice that biothings/pending.api#100 (comment). Does your new code catch and handle these?

Nope, my current code cannot fix this problem. Do you have any idea (like a second file/API to fill the blanks)?

@colleenXu
Copy link
Author

colleenXu commented Feb 3, 2023

  • Do you think the uniprot IDs are always proteins? If not, you can change the field name I proposed from uniprot_protein to uniprot...
  • Oops I didn't put the right link for viewing the 369 chemicals where the drug_mechanisms.action_type is null (and it looks like the target_chembl_id are also null) in my post so I fixed that. I'm wondering whether these entries make sense to remove entirely, or whether they can be kept and the fields with null values can be removed...I don't know how to fill in those blanks

@erikyao
Copy link
Contributor

erikyao commented Feb 3, 2023

Do you think the uniprot IDs are always proteins?

I am not sure... and I think uniprot is a better prefix.

I'm wondering whether these entries make sense to remove entirely, or whether they can be kept and the fields with null values can be removed

My latest code will remove those null fields. Those documents will be kept.

@colleenXu
Copy link
Author

colleenXu commented Feb 3, 2023

Here's the chembl sites for some chemicals that have this "null" issue

Looks like CHEMBL's website also lacks the target info / any mechanism info.

@erikyao
Copy link
Contributor

erikyao commented Feb 3, 2023

Looks like CHEMBL's website also lacks the target info / any mechanism info.

Yep, I think we can at first align our documents to the the CHEMBL report cards, and then fill the blanks in the future if necessary.

@erikyao
Copy link
Contributor

erikyao commented Feb 3, 2023

Side note: I found a possible error in CHEMBL, and have reported it to chembl/GLaDOS#1310

@erikyao
Copy link
Contributor

erikyao commented Feb 13, 2023

Check the latest release of MyChem for CHEMBL238804: https://mychem.info/v1/chem/CHEMBL238804?fields=chembl

The target_components field has been added.

@colleenXu
Copy link
Author

Looks good to me. @erikyao, can we close this issue?

@erikyao
Copy link
Contributor

erikyao commented Feb 27, 2023

@colleenXu issue closed.

One more thing: although we have target_components source data containing ensembl_gene entries, none of them is mapped to any of our target records. Therefore currently we only have target_components with uniprot in our document collection.

@colleenXu
Copy link
Author

@erikyao I'm confused. I see 60 documents that seem to have both target_components.ensembl fields and target_chembl_id fields? https://mychem.info/v1/query?q=_exists_:%22chembl.drug_mechanisms.target_components.ensembl_gene%22%20AND%20_exists_:%22chembl.drug_mechanisms.target_chembl_id%22&fields=chembl

@erikyao
Copy link
Contributor

erikyao commented Feb 28, 2023

@colleenXu sorry, my bad. I found I was analyzing on a subset of source data... Gladly we have both the uniprot and ensembl fields indexed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants