[ensembl-dev] variant_effect_predictor and gtf2vep version 82, usage with personnal genome
Maria
maria.bernard at jouy.inra.fr
Fri Oct 30 15:33:32 GMT 2015
Dear developpers,
I am a bioinformatic engineer working on various species, and some of
them are not in EnsEMBL database (such as the trout genome, for now).
Biologists are interested in the annotation of their variant, and we
given them the possibility to use Variant_effect_predictor API via our
Galaxy instance.
For non EnsEMBL species I use gtf2vep to construct the "annotation
database" before VEP. Here are the command lines I use:
gtf2vep.pl -i genome.gff -f genome.fa -s genomeName --dir database_directory
variant_effect_predictor.pl --species genomeName -i genome.vcf --format
vcf -o genome.vep --offline --no_stats --dir_cache database_directory
--verbose
I have 3 different questions/comments about the use of these two
programs together. I will take the trout and human genomes to illustrate.
1. _GFF format specifications_
I clearly understand the GFF/GFT format asked. But I could not find
anywhere the minimum necessary informations and the way to express them
in each field of the annotation file.
Trout annotation file (that you could find here :
http://www.genoscope.cns.fr/trout/data/), looks like this:
seq_name source feature_type start stop score strand
frame attribute
chr_1 Gaze Gene 94130 196267 36.8621 + . Gene
GSONMG00071408001
chr_1 Gaze mRNA 94130 196267 36.8621 + . mRNA
GSONMT00071408001
chr_1 Gaze CDS 94130 94318 6.0000 + . CDS
GSONMT00071408001
chr_1 Gaze CDS 115360 115542 10.5000 + . CDS
GSONMT00071408001
chr_1 Gaze CDS 116586 116839 11.0000 + . CDS
GSONMT00071408001
chr_1 Gaze CDS 154563 154705 5.0000 + . CDS
GSONMT00071408001
chr_1 Gaze CDS 195996 196267 6.0000 + . CDS
GSONMT00071408001
Here is what I founded in the gtf2vep code (version 81), maybe I passed
through someones.
* particular feature_type are read and the other are excluded:
included : gene transcript mrna rrna trna ncrna primary_transcript
c_gene_segment d_gene_segment vd_gene_segment j_gene_segment
v_gene_segment lincrna_gene mirna_gene rrna_gene snorna_gene snrna_gene
lincrna mirna snorna snrna nmd_transcript_variant pseudogene
processed_pseudogene pseudogenic_transcript
aberrant_processed_transcript nc_primary_transcript processed_transcript
gene_segment exon cds
* for transcript lines, particular attributes are necessery, and some
are optionnal.
o necesary : gene_id , transcript_id, transcript_biotype (if
feature_type is not a particular transcript type)
o optionnal : transcript_biotype (if feature type is a particular
transcript type such as mRNA), transcript_protein ...
* "protein_coding" transcript type (indicated as mRNA in feature type
column or whith "protein_coding" as transcript_type attribute), must
be reliated with exon/CDS line(s).
* CDS lines must overlap exon lines and must have a frame (0,1,or 2
and not "." in column 8), and (not shure) must have a transcript_id
attribute (other attribute are optionnal)
* exon lines must have at least a transcript_id attribute.
So finally, my trout GFF file looks now like this:
chr_1 Gaze Gene 94130 196267 36.8621 + . gene_id
GSONMG00071408001
chr_1 Gaze mRNA 94130 196267 36.8621 + .
transcript_id GSONMT00071408001
chr_1 Gaze exon 94130 94318 6.0000 + .
transcript_id GSONMT00071408001
chr_1 Gaze CDS 94130 94318 6.0000 + 0 transcript_id
GSONMT00071408001
chr_1 Gaze exon 115360 115542 10.5000 + .
transcript_id GSONMT00071408001
chr_1 Gaze CDS 115360 115542 10.5000 + 0
transcript_id GSONMT00071408001
chr_1 Gaze exon 116586 116839 11.0000 + .
transcript_id GSONMT00071408001
chr_1 Gaze CDS 116586 116839 11.0000 + 0
transcript_id GSONMT00071408001
chr_1 Gaze exon 154563 154705 5.0000 + .
transcript_id GSONMT00071408001
chr_1 Gaze CDS 154563 154705 5.0000 + 0
transcript_id GSONMT00071408001
chr_1 Gaze exon 195996 196267 6.0000 + .
transcript_id GSONMT00071408001
chr_1 Gaze CDS 195996 196267 6.0000 + 0
transcript_id GSONMT00071408001
As I said, I found those informations by reading gtf2vep.pl code version
81. Do I miss something (may be for gene lines, or gene_id in transcript
line ...)?
So it would be very helpfull if you can publish those kind of
informations, because I am not shure to take time to read future API
versions code to find out.
_2. format of sequence name in the fasta file_
As you can see, the sequence of the trout genome are currently named
chr_1, chr_Un1, .... and chrUn (it will certainly change when included
in EnsEMBL).
It seem's that having the presence of "chr" causes VEP to crash. I made
some tests (on version 81 and 82) with or without "chr" in the gff,
and/or fasta, and/or vcf files.
* GTF2VEP tests
gtf2vep works fine if you give these combinations:
- GFF and FASTA with "chr" ==> database called trout_chr,
- GFF and FASTA without "chr" ==> database called trout,
- GFF with "chr" and FASTA without "chr" ==> database called trout_gchr
The constructed database stores fasta name sequences and change the gff
sequence name (if it does not correspond to the fasta sequence names) by
deleting « chr » at the begining.
But this is not working if fasta contains « chr » and gff not.
* VEP tests
For the 3 precedent databases, I tested VEP whith a VCF file containing
or not "chr" in the sequence names.
VEP looks for sequence names whithout "chr" even if « chr » is specified
in the VCF. So the full pipeline works only if the fasta file does not
contain « chr » in the fasta sequence name (gff name does not matter).
Why sequence names are not simply stored as they are indicated, and
softwares crash if GFF FASTA and VCF are not corresponding to each
other? Maybe I missed something, but could you also indicate those kind
of information somewhere in the documentation. (shame on me if you
already have ...)
_3. __how VEP deals with rsid in the VCF file_
Last thing I wanted to comment is the way rsid are stored in VEP output.
If VCF file (for an EnsEMBL genomie) does not contain rsID,
variant_effect_predictor may add it in the output file in the
"Existing_variation" column.
exemple on the human genome:
#CHROM POS *ID* REF ALT QUAL FILTER INFO
FORMAT Pickrell Sample1
chr12 939302 *.* A G . PASS
AC=1;ADP=587;AF=0.500;AN=2;HET=1;HOM=0;NC=0;WT=0;set=Varscan
GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PVAL:RBQ:RD:RDF:RDR:SDP ./.
0/1:30:213:85:128:587:36,29%:99:3,2364E-75:27:370:149:221:651
become after VEP (one line exemple among many others)
*#Uploaded_variation* Location Allele Gene Feature
Feature_type Consequence cDNA_position CDS_position
Protein_position Amino_acids Codons *Existing_variation* Extra
*12_939302_A/G* 12:939302 G ENSG00000002016 ENST00000544742
Transcript intron_variant - - - - - *rs139381800*
IMPACT=MODIFIER;STRAND=-1
but if the variant caller used we already annotated variant with
existing rsID, rsID are managed differently:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
Pickrell Sample1
chr12 406292 *rs2229351* G A 1002.77 PASS
AC=2;ADP=163;AF=0.500;AN=4;BaseQRankSum=3.502;DB;DP=69;Dels=0.00;FS=4.853;HET=1;HOM=0;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=37.00;MQ0=0;MQRankSum=0.943;NC=0;QD=14.53;ReadPosRankSum=-1.219;WT=0;set=Intersection
GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PL:PVAL:RBQ:RD:RDF:RDR:SDP
0/1:.:33,36:.:.:66:.:99:1031,0,999
0/1:26:39:21:18:163:23,93%:99:.:1,3713E-13:30:124:66:58:169
become (only one line result)
*#Uploaded_variation* Location Allele Gene Feature
Feature_type Consequence cDNA_position CDS_position
Protein_position Amino_acids Codons *Existing_variation* Extra
* rs2229351* 12:406292 A ENSG00000120647 ENST00000535052
Transcript intron_variant - - - - - *-*
IMPACT=MODIFIER;STRAND=1
First case, uploaded variation id is construct with sequence name,
position and alleles, and rsID is added in Existing_variation column.
Second case, uploaded variation id is the known rsID, and
Existing_variation stays empty.
Would it be possible to stay in case 1 even if the VCF file includes the
rsID ?
In any cases thank you for those API which evolved since the last few
versions. A little bit more documentation would be great (not a funny
task, I know), some behaviour corrections that facilitate my life would
be very very very much appreciated but I know that we can't satisfy
every one.
Again thank you for these tools.
Best Regards
--
Maria Bernard
Bioinformatics
Sigenae team
GABI Unity
INRA de Jouy en Josas - batiment 320
Domaine de Vilvert
78352 Jouy en josas
FRANCE
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151030/e3bd0d13/attachment.html>
More information about the Dev
mailing list