[ensembl-dev] Custom Annotation
Derek Conkle-Gutierrez
dconklegutierrez at sdsu.edu
Tue Feb 13 17:35:20 GMT 2018
Hello,
Adding those elements to our GFF worked. Thank you.
________________________________
From: Laurent Gil <lgil at ebi.ac.uk>
Sent: Tuesday, February 13, 2018 6:26:53 AM
To: Ensembl developers list; Derek Conkle-Gutierrez
Subject: Re: [ensembl-dev] Custom Annotation
Dear Derek,
Your GFF file is missing the "biotype" and the "parent" parameters for the CDS lines.
e.g. using your input example:
NC_000962.3 Modlin et. al. 2018 CDS 1 1524 . + . ID=CDS1;parent=gene1;biotype=protein_coding;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways
Furthermore, you need to add "exon" line(s) after the CDS line (and using the "parent" attribute), e.g.:
NC_000962.3 Modlin et. al. 2018 CDS 1 1524 . + . ID=CDS1;parent=gene1;biotype=protein_coding;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways
NC_000962.3 Modlin et. al. 2018 exon 1 1524 . + . ID=exon1;parent=CDS1;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways
We will try to improve our documentation regarding the GFF files in VEP.
Best regards,
Laurent
Ensembl Variation
On 12/02/2018 19:10, Derek Conkle-Gutierrez wrote:
Hello,
I work for Dr. Faramarz Valafar at San Diego State University. Previously we have used Ensembl's VEP program on our vcf files of Mycobacterium tuberculosis sequences, using annotation from a cache file downloaded from your website. However, recently we have developed additional annotations (mostly from running I-TASSER on ambiguously annotated genes) that we would like to include. To that end I converted our custom annotation file to a GFF3 format, and followed your website's instructions for running VEP with that as the annotation source. This ran, but unfortunately it identified every variant as intergenic, even when they were within one of our annotated CDS features. I assume this is due to a formatting error on my part with our GFF file, though I've been following the specifications described here https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
I'm using ensembl-vep version 91.3
Here's a bit of our gff:
NC_000962.3 Modlin et. al. 2018 gene 1 1524 . + . ID=gene1;locus_tag=Rv0001;alias=dnaA;experiment=DESCRIPTION:Mutation analysis, gene expression[PMID: 10375628];Dbxref=GeneID:885041
NC_000962.3 Modlin et. al. 2018 CDS 1 1524 . + . ID=CDS1;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways
NC_000962.3 Modlin et. al. 2018 gene 2052 3260 . + . ID=gene2;locus_tag=Rv0002;alias=dnaN;Dbxref=GeneID:887092
NC_000962.3 Modlin et. al. 2018 CDS 2052 3260 . + . ID=CDS2;locus_tag=Rv0002;product=DNA polymerase III (beta chain) DnaN (DNA nucleotidyltransferase);note=FunctionalCategory: information pathways
Here's a bit of our test input vcf:
##fileformat=VCFv4.0
##source=pbhooverV1.0.0a8
##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting reads">
##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting reads">
##INFO=<ID=VF,Number=1,Type=Float,Description="Variant frequency">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FILTER=<ID=LOW,Description="Position with too low of depth">
##FILTER=<ID=NO,Description="Does not meet criteria to call variant">
##FILTER=<ID=HETERO,Description="Enough support to call reference and variant (mixed population)">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 8 . A AGT 7.22 NO RSR=4;VSR=1;VF=0.2;DP=5
etc.
1 2050 . C CC 11.36 NO RSR=15;VSR=2;VF=0.117647058824;DP=21
1 2051 . A AA 13.36 NO RSR=18;VSR=2;VF=0.1;DP=20
1 2051 . AA A 15.75 NO RSR=20;VSR=1;VF=0.047619047619;DP=21
1 2052 . AT A 15.22 NO RSR=19;VSR=1;VF=0.05;DP=21
1 2053 . TG T 33.02 NO RSR=15;VSR=2;VF=0.117647058824;DP=18
1 2054 . GG G 45.75 NO RSR=17;VSR=3;VF=0.15;DP=21
1 2056 . A ACG 12.46 NO RSR=17;VSR=1;VF=0.0555555555556;DP=21
1 2057 . C CAC 13.28 NO RSR=20;VSR=1;VF=0.047619047619;DP=21
I used these commands on our gff and vcf files:
grep -v "#" mannotation-with-computation-4vep.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | tabix/bgzip -c > mannotation-with-computation-4vep.gff.gz
tabix/tabix -p gff mannotation-with-computation-4vep.gff.gz
ensembl-vep/vep --force_overwrite --synonyms synonyms-hyp.txt --format vcf --vcf --species mycobacterium_tuberculosis --symbol --variant_class --flag_pick --everything -i test1-0006.vcf -gff mannotation-with-computation-4vep.gff.gz -fasta H37Rv.fasta.gz -o test1-0006-annotated2.vcf
And the output vcf looks like this:
##fileformat=VCFv4.0
##source=pbhooverV1.0.0a8
##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting reads">
##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting reads">
##INFO=<ID=VF,Number=1,Type=Float,Description="Variant frequency">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FILTER=<ID=LOW,Description="Position with too low of depth">
##FILTER=<ID=NO,Description="Does not meet criteria to call variant">
##FILTER=<ID=HETERO,Description="Enough support to call reference and variant (mixed population)">
##VEP="v91" time="2018-02-12 10:57:23" ensembl-variation=91.c78d8b4 ensembl-funcgen=91.4681d69 ensembl=91.18ee742 ensembl-io=91.923d668
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|PICK|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SOURCE|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|mannotation-with-computation.gff.gz">
##INFO=<ID=mannotation-with-computation.gff.gz,Number=.,Type=String,Description="mannotation-with-computation.gff.gz (overlap)">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 8 . A AGT 7.22 NO RSR=4;VSR=1;VF=0.2;DP=5;CSQ=GT|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
etc.
1 2050 . C CC 11.36 NO RSR=15;VSR=2;VF=0.117647058824;DP=21;CSQ=C|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
1 2051 . A AA 13.36 NO RSR=18;VSR=2;VF=0.1;DP=20;CSQ=A|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
1 2051 . AA A 15.75 NO RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
1 2052 . AT A 15.22 NO RSR=19;VSR=1;VF=0.05;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
1 2053 . TG T 33.02 NO RSR=15;VSR=2;VF=0.117647058824;DP=18;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
1 2054 . GG G 45.75 NO RSR=17;VSR=3;VF=0.15;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
1 2056 . A ACG 12.46 NO RSR=17;VSR=1;VF=0.0555555555556;DP=21;CSQ=CG|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
1 2057 . C CAC 13.28 NO RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=AC|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
I've also tried adding 'Is_circular=true' to the attribute column of the first entry in the GFF, replacing 'locus_tag' with 'Name', and capitalizing 'alias', in case those deviations from the format described in the GFF documentation were the problem. I also added 'biotype' attributes to the GFF, after seeing this discussion in the forums: http://lists.ensembl.org/pipermail/dev/2018-January/012867.html, though I was unsure if that advice was meant for the GFF or VCF, or whether it was applicable to whole genome reads vs transcripts. None of this changed the resulting output.
Do you have an example of a GFF annotation file that's worked with VEP, so I can compare it with ours to see what I've done wrong? Or is there a tool we can use to create our own cache files?
Thank you for your assistance.
_______________________________________________
Dev mailing list Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180213/8fe07aa4/attachment.html>
More information about the Dev
mailing list