[ensembl-dev] VEP - protein "domains" results with the REST API and the command line tool

Thu Jun 23 09:40:13 BST 2022

Hi Likhitha,

many thanks for your prompt reply. I thought I was using the Ensembl
transcripts cache as I wasn't using the `--refseq` command line switch.
Nevertheless, I tried to install the cache again and run VEP, but still
finding the same situation in the domains' output.

To install the new cache + fasta, I did the following:

    $ perl INSTALL.pl --AUTO c --CACHEDIR ../../vep_106 --SPECIES
"homo_sapiens" --ASSEMBLY GRCh38
    WARNING: DBD::mysql module not found. VEP can only run in offline
(--offline) mode without DBD::mysql installed

http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#requirements
     - getting list of available cache files
     - downloading
ftp://ftp.ensembl.org/pub/release-106/variation/indexed_vep_cache/homo_sapiens_vep_106_GRCh38.tar.gz
     - unpacking homo_sapiens_vep_106_GRCh38.tar.gz
     - converting cache, this may take some time but will allow VEP to look
up variants and frequency data much faster
     - use CTRL-C to cancel if you do not wish to convert this cache now
(you may run convert_cache.pl later)
    2022-06-22 18:11:42 - Processing homo_sapiens
    2022-06-22 18:11:42 - Processing version 106_GRCh38
    2022-06-22 18:11:42 - No unprocessed types remaining, skipping
    2022-06-22 18:11:42 - All done!

    All done

    $ perl INSTALL.pl --AUTO f --CACHEDIR ../../vep_106 --SPECIES
"homo_sapiens" --ASSEMBLY GRCh38
    WARNING: DBD::mysql module not found. VEP can only run in offline
(--offline) mode without DBD::mysql installed

http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#requirements
     - downloading Homo_sapiens.GRCh38.dna.toplevel.fa.gz

    All done

Then, ran VEP:

    vep --offline --cache --assembly GRCh38 --dir_cache
/opt/bioResources/vep_106 --fasta
/opt/bioResources/vep_106/homo_sapiens/106_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
--input_file T790M.vcf --json --output_file T790M.vep.json
--force_overwrite --domains

Is this the right way to do it? I'm still not getting information of
protein domains for all other databases (including Pfam) besides de
`ENSP_mappings`...

Many thanks,
Pedro

On Wed, 22 Jun 2022 at 09:12, Likhitha Surapaneni <likhithas at ebi.ac.uk>
wrote:

> Hi Pedro,
>
> I am sorry to hear that you are facing an issue with VEP command line.
>
> Could you please confirm if you were using RefSeq cache? RefSeq cache
> lacks classes of data present in the Ensembl transcript cache, one of
> them being Protein domains
> (https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#refseq).
>
> Could you please try with Ensembl transcript cache and see if you are
> facing the same issue?
>
> Hope this helps and please let me know if you have further questions.
>
> Thanks and regards,
>
> Likhitha
>
> On 21/06/2022 18:01, Pedro Almeida wrote:
> > Hi all,
> >
> > I've been trying to get information of overlapping protein domains for
> > one variant using VEP, but it looks as if the REST API returns more
> > domains than the command line tool. Domains here means the output of
> > the command line switch `--domains`, which, as far as I can tell, is
> > the same as `domains=1` with the `GET vep/:species/id/:id` API request.
> >
> > For example, for this single variant I'm using for testing, EGFR
> > T790M, with the GET method above
> > `
> https://rest.ensembl.org/vep/human/id/rs121434569?domains=1&content-type=application/json`
> <https://rest.ensembl.org/vep/human/id/rs121434569?domains=1&content-type=application/json>
> > <
> https://rest.ensembl.org/vep/human/id/rs121434569?domains=1&content-type=application/json`
> <https://rest.ensembl.org/vep/human/id/rs121434569?domains=1&content-type=application/json>>
>
> > the `domains` list of the `transcript_consequences` object, lists
> > several ENSP_mappings and also information from CDD, Pfam,
> > PROSITE_profiles, and others. I'm more interested in the Pfam
> > information, which in this case corresponds to a protein tyrosine and
> > serine/threonine kinase, PF07714.
> >
> > However, when I run this same variant in the command line (using a VCF
> > file with this single variant as input), I can only obtain information
> > from the ENSP_mappings, but all other databases appear to be missing.
> > The command used was the following:
> >
> > ```
> > vep --domains --dir_cache /opt/bioResources/vep_106/ --fasta
> >
> /opt/bioResources/vep_106/homo_sapiens_refseq/106_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
>
> > --input_file T790M.vcf --output_file T790M.vep.json --cache --offline
> > --json --force_overwrite
> > ```
> >
> > Does anyone know if this is expected, or how to get the same output of
> > the REST API (regarding the list of protein domains) when using the
> > command line tool? Are custom annotations needed for these cases?
> >
> > Many thanks,
> > Pedro
> >
> >
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info:
> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> > Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20220623/962ff5ca/attachment.html>