[ensembl-dev] variant_effect_predictor and gtf2vep version 82, usage with personnal genome

Fri Oct 30 15:33:32 GMT 2015

Dear developpers,

I am a bioinformatic engineer working on various species, and some of 
them are not in EnsEMBL database (such as the trout genome, for now).
Biologists are interested in the annotation of their variant, and we 
given them the possibility to use Variant_effect_predictor API via our 
Galaxy instance.

For non EnsEMBL species I use gtf2vep to construct the "annotation 
database" before VEP. Here are the command lines I use:
gtf2vep.pl -i genome.gff -f genome.fa -s genomeName --dir database_directory
variant_effect_predictor.pl --species genomeName -i genome.vcf --format 
vcf -o genome.vep --offline --no_stats --dir_cache database_directory 
--verbose

I have 3 different questions/comments about the use of these two 
programs together. I will take the trout and human genomes to illustrate.

 1. _GFF format specifications_

I clearly understand the GFF/GFT format asked. But I could not find 
anywhere the minimum necessary informations and the way to express them 
in each field of the annotation file.

Trout annotation file (that you could find here : 
http://www.genoscope.cns.fr/trout/data/), looks like this:

seq_name source    feature_type    start    stop    score    strand 
frame attribute
chr_1    Gaze    Gene    94130    196267    36.8621    + .    Gene 
GSONMG00071408001
chr_1    Gaze    mRNA    94130    196267    36.8621    + .    mRNA 
GSONMT00071408001
chr_1    Gaze    CDS    94130    94318    6.0000    +    . CDS 
GSONMT00071408001
chr_1    Gaze    CDS    115360    115542    10.5000    + .    CDS 
GSONMT00071408001
chr_1    Gaze    CDS    116586    116839    11.0000    + .    CDS 
GSONMT00071408001
chr_1    Gaze    CDS    154563    154705    5.0000    + .    CDS 
GSONMT00071408001
chr_1    Gaze    CDS    195996    196267    6.0000    + .    CDS 
GSONMT00071408001

Here is what I founded in the gtf2vep code (version 81), maybe I passed 
through someones.

  * particular feature_type are read and the other are excluded:

included : gene transcript mrna rrna trna ncrna primary_transcript 
c_gene_segment d_gene_segment vd_gene_segment j_gene_segment 
v_gene_segment lincrna_gene mirna_gene rrna_gene snorna_gene  snrna_gene 
lincrna mirna snorna snrna nmd_transcript_variant pseudogene 
processed_pseudogene pseudogenic_transcript 
aberrant_processed_transcript nc_primary_transcript processed_transcript 
gene_segment exon cds

  * for transcript lines, particular attributes are necessery, and some
    are optionnal.
      o necesary : gene_id , transcript_id, transcript_biotype (if
        feature_type is not a particular transcript type)
      o optionnal : transcript_biotype (if feature type is a particular
        transcript type such as mRNA), transcript_protein ...

  * "protein_coding" transcript type (indicated as mRNA in feature type
    column or whith "protein_coding" as transcript_type attribute), must
    be reliated with exon/CDS line(s).

  * CDS lines must overlap exon lines and must have a frame (0,1,or 2
    and not "." in column 8), and (not shure) must have a transcript_id
    attribute (other attribute are optionnal)

  * exon lines must have at least a transcript_id attribute.

So finally, my trout GFF file looks now like this:

chr_1 Gaze    Gene    94130    196267    36.8621    +    . gene_id 
GSONMG00071408001
chr_1    Gaze    mRNA    94130    196267    36.8621 +    .    
transcript_id GSONMT00071408001
chr_1    Gaze    exon    94130    94318    6.0000    + .    
transcript_id GSONMT00071408001
chr_1    Gaze    CDS    94130    94318    6.0000    + 0    transcript_id 
GSONMT00071408001
chr_1    Gaze    exon    115360    115542    10.5000 +    .    
transcript_id GSONMT00071408001
chr_1    Gaze    CDS    115360    115542    10.5000 +    0    
transcript_id GSONMT00071408001
chr_1    Gaze    exon    116586    116839    11.0000 +    .    
transcript_id GSONMT00071408001
chr_1    Gaze    CDS    116586    116839    11.0000 +    0    
transcript_id GSONMT00071408001
chr_1    Gaze    exon    154563    154705    5.0000 +    .    
transcript_id GSONMT00071408001
chr_1    Gaze    CDS    154563    154705    5.0000    + 0    
transcript_id GSONMT00071408001
chr_1    Gaze    exon    195996    196267    6.0000 +    .    
transcript_id GSONMT00071408001
chr_1    Gaze    CDS    195996    196267    6.0000    + 0    
transcript_id GSONMT00071408001

As I said, I found those informations by reading gtf2vep.pl code version 
81. Do I miss something (may be for gene lines, or gene_id in transcript 
line ...)?
So it would be very helpfull if you can publish those kind of 
informations, because I am not shure to take time to read future API 
versions code to find out.

_2. format of sequence name in the fasta file_

As you can see, the sequence of the trout genome are currently named 
chr_1, chr_Un1, .... and chrUn (it will certainly change when included 
in EnsEMBL).
It seem's that having the presence of "chr" causes VEP to crash. I made 
some tests (on version 81 and 82) with or without "chr" in the gff, 
and/or fasta, and/or vcf files.

  * GTF2VEP tests

gtf2vep works fine if you give these combinations:
     - GFF and FASTA with "chr" ==> database called trout_chr,
     - GFF and FASTA without "chr" ==> database called trout,
     - GFF with "chr" and FASTA without "chr" ==> database called trout_gchr

The constructed database stores fasta name sequences and change the gff 
sequence name (if it does not correspond to the fasta sequence names) by 
deleting « chr » at the begining.
But this is not working if fasta contains « chr » and gff not.

  * VEP tests

For the 3 precedent databases, I tested VEP whith a VCF file containing 
or not "chr" in the sequence names.

VEP looks for sequence names whithout "chr" even if « chr » is specified 
in the VCF. So the full pipeline works only if the fasta file does not 
contain « chr » in the fasta sequence name (gff name does not matter).

Why sequence names are not simply stored as they are indicated, and 
softwares crash if GFF FASTA and VCF are not corresponding to each 
other? Maybe I missed something, but could you also indicate those kind 
of information somewhere in the documentation. (shame on me if you 
already have ...)

_3. __how VEP deals with rsid in the VCF file_

Last thing I wanted to comment is the way rsid are stored in VEP output.

If VCF file (for an EnsEMBL genomie) does not contain rsID, 
variant_effect_predictor may add it in the output file in the 
"Existing_variation" column.
exemple on the human genome:

     #CHROM    POS *ID* REF    ALT    QUAL    FILTER    INFO    
FORMAT    Pickrell Sample1
     chr12    939302 *.* A    G    .    PASS 
AC=1;ADP=587;AF=0.500;AN=2;HET=1;HOM=0;NC=0;WT=0;set=Varscan 
GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PVAL:RBQ:RD:RDF:RDR:SDP    ./. 
0/1:30:213:85:128:587:36,29%:99:3,2364E-75:27:370:149:221:651

become after VEP (one line exemple among many others)

*#Uploaded_variation* Location    Allele    Gene    Feature    
Feature_type Consequence    cDNA_position    CDS_position 
Protein_position    Amino_acids    Codons *Existing_variation*    Extra
*12_939302_A/G* 12:939302    G    ENSG00000002016    ENST00000544742 
Transcript    intron_variant    -    -    -    -    - *rs139381800* 
IMPACT=MODIFIER;STRAND=-1

but if the variant caller used we already annotated variant with 
existing rsID, rsID are managed differently:

     #CHROM    POS ID    REF    ALT    QUAL    FILTER    INFO    FORMAT 
Pickrell    Sample1
     chr12    406292 *rs2229351* G    A    1002.77    PASS 
AC=2;ADP=163;AF=0.500;AN=4;BaseQRankSum=3.502;DB;DP=69;Dels=0.00;FS=4.853;HET=1;HOM=0;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=37.00;MQ0=0;MQRankSum=0.943;NC=0;QD=14.53;ReadPosRankSum=-1.219;WT=0;set=Intersection 
GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PL:PVAL:RBQ:RD:RDF:RDR:SDP 
0/1:.:33,36:.:.:66:.:99:1031,0,999 
0/1:26:39:21:18:163:23,93%:99:.:1,3713E-13:30:124:66:58:169
become (only one line result)

*#Uploaded_variation* Location    Allele    Gene    Feature    
Feature_type Consequence    cDNA_position    CDS_position 
Protein_position    Amino_acids    Codons *Existing_variation*    Extra
*    rs2229351* 12:406292    A    ENSG00000120647    ENST00000535052 
Transcript    intron_variant    -    -    -    -    - *-*    
IMPACT=MODIFIER;STRAND=1

First case, uploaded variation id is construct with sequence name, 
position and alleles, and rsID is added in Existing_variation column.
Second case, uploaded variation id is the known rsID, and 
Existing_variation stays empty.
Would it be possible to stay in case 1 even if the VCF file includes the 
rsID ?

In any cases thank you for those API which evolved since the last few 
versions. A little bit more documentation would be great (not a funny 
task, I know), some behaviour corrections that facilitate my life would 
be very very very much appreciated but I know that we can't satisfy 
every one.

Again thank you for these tools.

Best Regards

-- 
Maria Bernard
Bioinformatics

Sigenae team
GABI Unity

INRA de Jouy en Josas - batiment 320
Domaine de Vilvert
78352 Jouy en josas
FRANCE

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151030/e3bd0d13/attachment.html>