[ensembl-dev] variant_effect_predictor and gtf2vep version 82, usage with personnal genome

Mon Nov 2 14:08:51 GMT 2015

Hi Maria,

Thanks for your email. I'll try my best to answer your questions inline
below. A couple of things to bear in mind:

1) The GTF/GFF parser is still in development. It originally supported only
GTF, and a strict subset of GTF at that. We had multiple requests to
support GFF as well, hence where we are now. As a result there may be
lingering issues, and as you've noted the documentation is currently very
sparse and incomplete.

2) GFF is a poorly defined specification and is not well adhered to in many
cases, making it difficult to parse consistently and reliably

3) The GFF parser was written to support parsing GFF files as produced by
Ensembl (for Ensembl genes) and NCBI (for RefSeq genes). The NCBI GFF in
particular has many quirks that make it more complicated to parse; this
explains the sometimes odd logic in the script.

On 30 October 2015 at 15:33, Maria <maria.bernard at jouy.inra.fr> wrote:

>
> Dear developpers,
>
> I am a bioinformatic engineer working on various species, and some of them
> are not in EnsEMBL database (such as the trout genome, for now).
> Biologists are interested in the annotation of their variant, and we given
> them the possibility to use Variant_effect_predictor API via our Galaxy
> instance.
>
> For non EnsEMBL species I use gtf2vep to construct the "annotation
> database" before VEP. Here are the command lines I use:
> gtf2vep.pl -i genome.gff -f genome.fa -s genomeName --dir
> database_directory
> variant_effect_predictor.pl --species genomeName -i genome.vcf --format
> vcf -o genome.vep --offline --no_stats --dir_cache database_directory
> --verbose
>
> I have 3 different questions/comments about the use of these two programs
> together. I will take the trout and human genomes to illustrate.
>
>    1. *GFF format specifications*
>
> I clearly understand the GFF/GFT format asked. But I could not find
> anywhere the minimum necessary informations and the way to express them in
> each field of the annotation file.
>
> Trout annotation file (that you could find here :
> <http://www.genoscope.cns.fr/trout/data/>
> http://www.genoscope.cns.fr/trout/data/), looks like this:
>
> seq_name    source    feature_type    start    stop    score    strand
> frame attribute
> chr_1    Gaze    Gene    94130    196267    36.8621    +    .    Gene
> GSONMG00071408001
> chr_1    Gaze    mRNA    94130    196267    36.8621    +    .    mRNA
> GSONMT00071408001
> chr_1    Gaze    CDS    94130    94318    6.0000    +    .    CDS
> GSONMT00071408001
> chr_1    Gaze    CDS    115360    115542    10.5000    +    .    CDS
> GSONMT00071408001
> chr_1    Gaze    CDS    116586    116839    11.0000    +    .    CDS
> GSONMT00071408001
> chr_1    Gaze    CDS    154563    154705    5.0000    +    .    CDS
> GSONMT00071408001
> chr_1    Gaze    CDS    195996    196267    6.0000    +    .    CDS
> GSONMT00071408001
>
> Here is what I founded in the gtf2vep code (version 81), maybe I passed
> through someones.
>
>    - particular feature_type are read and the other are excluded:
>
> included : gene transcript mrna rrna trna ncrna primary_transcript
> c_gene_segment d_gene_segment vd_gene_segment j_gene_segment v_gene_segment
> lincrna_gene mirna_gene rrna_gene snorna_gene  snrna_gene lincrna mirna
> snorna snrna nmd_transcript_variant pseudogene processed_pseudogene
> pseudogenic_transcript aberrant_processed_transcript nc_primary_transcript
> processed_transcript gene_segment exon cds
>

Correct. Note that some types are effectively duplicates of others to
support the different GFF specifications produced by Ensembl and NCBI.

>
>    - for transcript lines, particular attributes are necessery, and some
>    are optionnal.
>       - necesary : gene_id , transcript_id, transcript_biotype (if
>       feature_type is not a particular transcript type)
>
> gene_id is not required. If gene entries are present in the file (they may
be used to store gene symbols, for example), they can be referred to by
"gene_id" or by "parent_id"

transcript_id may also be "ID" or "name" on transcript lines.

>
>    - optionnal : transcript_biotype (if feature type is a particular
>       transcript type such as mRNA), transcript_protein ...
>
>
>    - "protein_coding" transcript type (indicated as mRNA in feature type
>    column or whith "protein_coding" as transcript_type attribute), must be
>    reliated with exon/CDS line(s).
>
> Correct. The logic to infer biotype can be very complex, especially with
the NCBI GFF which does not use a consistent way of describing them. If you
have biotype data available then the script will work better if you enter
it as a transcript_biotype attribute, though make sure to use biotypes from
the Sequence Ontology.

>
>
>
>    - CDS lines must overlap exon lines and must have a frame (0,1,or 2
>    and not "." in column 8), and (not shure) must have a transcript_id
>    attribute (other attribute are optionnal)
>
>
>    - exon lines must have at least a transcript_id attribute.
>
> Yes; the parent_id attribute may also be used to identify the relevant
transcript for both exon and CDS lines.

Furthermore, exon and CDS lines must occur in the file _after_ their parent
transcript entries. It looks from your example data like you are following
this pragma.

>
>
> So finally, my trout GFF file looks now like this:
>
> chr_1    Gaze    Gene    94130    196267    36.8621    +    .    gene_id
> GSONMG00071408001
> chr_1    Gaze    mRNA    94130    196267    36.8621    +    .
> transcript_id GSONMT00071408001
> chr_1    Gaze    exon    94130    94318    6.0000    +    .
> transcript_id GSONMT00071408001
> chr_1    Gaze    CDS    94130    94318    6.0000    +    0
> transcript_id GSONMT00071408001
> chr_1    Gaze    exon    115360    115542    10.5000    +    .
> transcript_id GSONMT00071408001
> chr_1    Gaze    CDS    115360    115542    10.5000    +    0
> transcript_id GSONMT00071408001
> chr_1    Gaze    exon    116586    116839    11.0000    +    .
> transcript_id GSONMT00071408001
> chr_1    Gaze    CDS    116586    116839    11.0000    +    0
> transcript_id GSONMT00071408001
> chr_1    Gaze    exon    154563    154705    5.0000    +    .
> transcript_id GSONMT00071408001
> chr_1    Gaze    CDS    154563    154705    5.0000    +    0
> transcript_id GSONMT00071408001
> chr_1    Gaze    exon    195996    196267    6.0000    +    .
> transcript_id GSONMT00071408001
> chr_1    Gaze    CDS    195996    196267    6.0000    +    0
> transcript_id GSONMT00071408001
>
> As I said, I found those informations by reading gtf2vep.pl code version
> 81. Do I miss something (may be for gene lines, or gene_id in transcript
> line ...)?
> So it would be very helpfull if you can publish those kind of
> informations, because I am not shure to take time to read future API
> versions code to find out.
>
You've done a great job at interpreting it, and I do apologise for the lack
of documentation. We will aim to rectify this for the next Ensembl release.

>     *2. format of sequence name in the fasta file*
>
> As you can see, the sequence of the trout genome are currently named
> chr_1, chr_Un1, .... and chrUn (it will certainly change when included in
> EnsEMBL).
> It seem's that having the presence of "chr" causes VEP to crash. I made
> some tests (on version 81 and 82) with or without "chr" in the gff, and/or
> fasta, and/or vcf files.
>
>    - GTF2VEP tests
>
> gtf2vep works fine if you give these combinations:
>     - GFF and FASTA with "chr" ==> database called trout_chr,
>     - GFF and FASTA without "chr" ==> database called trout,
>     - GFF with "chr" and FASTA without "chr" ==> database called trout_gchr
>
> The constructed database stores fasta name sequences and change the gff
> sequence name (if it does not correspond to the fasta sequence names) by
> deleting « chr » at the begining.
> But this is not working if fasta contains « chr » and gff not.
>
>    - VEP tests
>
> For the 3 precedent databases, I tested VEP whith a VCF file containing or
> not "chr" in the sequence names.
>
> VEP looks for sequence names whithout "chr" even if « chr » is specified
> in the VCF. So the full pipeline works only if the fasta file does not
> contain « chr » in the fasta sequence name (gff name does not matter).
>
> Why sequence names are not simply stored as they are indicated, and
> softwares crash if GFF FASTA and VCF are not corresponding to each other?
> Maybe I missed something, but could you also indicate those kind of
> information somewhere in the documentation. (shame on me if you already
> have ...)
>

Ensembl prefers to work in a system without "chr" prefixed to the names.
This creates unfortunate side effects as you've seen here; if your FASTA
and GFF have "chr" prefixes, the cache will be created with "chr" names but
the VEP won't be able to use it as it trims "chr" from the chromosome names
of your input variant data before looking them up.

We'll look into ways of making this easier to deal with.

>
>     *3. **how VEP deals with rsid in the VCF file*
>
> Last thing I wanted to comment is the way rsid are stored in VEP output.
>
> If VCF file (for an EnsEMBL genomie) does not contain rsID,
> variant_effect_predictor may add it in the output file in the
> "Existing_variation" column.
> exemple on the human genome:
>
>     #CHROM    POS    *ID*    REF    ALT    QUAL    FILTER    INFO
> FORMAT    Pickrell    Sample1
>     chr12    939302    *.*    A    G    .    PASS
> AC=1;ADP=587;AF=0.500;AN=2;HET=1;HOM=0;NC=0;WT=0;set=Varscan
> GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PVAL:RBQ:RD:RDF:RDR:SDP    ./.
> 0/1:30:213:85:128:587:36,29%:99:3,2364E-75:27:370:149:221:651
>
> become after VEP (one line exemple among many others)
>
>     *#Uploaded_variation*    Location    Allele    Gene    Feature
> Feature_type    Consequence    cDNA_position    CDS_position
> Protein_position    Amino_acids    Codons    *Existing_variation*    Extra
>     *12_939302_A/G*    12:939302    G    ENSG00000002016
> ENST00000544742    Transcript    intron_variant    -    -    -    -    -
> *rs139381800*    IMPACT=MODIFIER;STRAND=-1
>
> but if the variant caller used we already annotated variant with existing
> rsID, rsID are managed differently:
>
>     #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
> FORMAT    Pickrell    Sample1
>     chr12    406292    *rs2229351*    G    A    1002.77    PASS
> AC=2;ADP=163;AF=0.500;AN=4;BaseQRankSum=3.502;DB;DP=69;Dels=0.00;FS=4.853;HET=1;HOM=0;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=37.00;MQ0=0;MQRankSum=0.943;NC=0;QD=14.53;ReadPosRankSum=-1.219;WT=0;set=Intersection
> GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PL:PVAL:RBQ:RD:RDF:RDR:SDP
> 0/1:.:33,36:.:.:66:.:99:1031,0,999
> 0/1:26:39:21:18:163:23,93%:99:.:1,3713E-13:30:124:66:58:169
> become (only one line result)
>
>     *#Uploaded_variation*    Location    Allele    Gene    Feature
> Feature_type    Consequence    cDNA_position    CDS_position
> Protein_position    Amino_acids    Codons    *Existing_variation*    Extra
> *    rs2229351*    12:406292    A    ENSG00000120647
> ENST00000535052    Transcript    intron_variant    -    -    -    -    -   *
> -*    IMPACT=MODIFIER;STRAND=1
>
> First case, uploaded variation id is construct with sequence name,
> position and alleles, and rsID is added in Existing_variation column.
> Second case, uploaded variation id is the known rsID, and
> Existing_variation stays empty.
> Would it be possible to stay in case 1 even if the VCF file includes the
> rsID ?
>

This should work for your second example; I get the following:

> echo "chr12    406292    rs2229351    G    A" | perl
variant_effect_predictor.pl -force -cache -check_existing -port 3337 -o
stdout | grep -v # | head -n1
rs2229351       12:406292       A       ENSG00000073614 ENST00000382815
Transcript      synonymous_variant      4512    4149    1383    S   tcC/tcT
 rs2229351       IMPACT=LOW;STRAND=-1

Perhaps you have not enabled --check_existing, or you are using the wrong
genome (this appears to be a location from GRCh37; the VEP default is to
use GRCh38)?

The ID you enter in your input has no bearing on what gets looked up when
using --check_existing; the two are treated independently.

>
> In any cases thank you for those API which evolved since the last few
> versions. A little bit more documentation would be great (not a funny task,
> I know), some behaviour corrections that facilitate my life would be very
> very very much appreciated but I know that we can't satisfy every one.
>

Thanks for sticking with it, we are always working to improve and hopefully
we will have some improved GFF-parsing documentation for the next Ensembl
release.

Regards

Will McLaren
Ensembl Variation

>
> Again thank you for these tools.
>
> Best Regards
>
> --
> Maria Bernard
> Bioinformatics
>
> Sigenae team
> GABI Unity
>
> INRA de Jouy en Josas - batiment 320
> Domaine de Vilvert
> 78352 Jouy en josas
> FRANCE
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151102/19ffce30/attachment.html>