[ensembl-dev] Custom Annotation
Laurent Gil
lgil at ebi.ac.uk
Tue Feb 13 14:26:53 GMT 2018
Dear Derek,
Your GFF file is missing the "biotype" and the "parent" parameters for
the CDS lines.
e.g. using your input example:
NC_000962.3 Modlin et. al. 2018 CDS 1 1524 . + .
ID=CDS1;*p**arent=gene1;biotype=protein_coding;*locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways
Furthermore, you need to add "exon" line(s) after the CDS line (and
using the "parent" attribute), e.g.:
NC_000962.3 Modlin et. al. 2018 CDS 1 1524 . + .
ID=CDS1;*p**arent=gene1;biotype=protein_coding;*locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways
NC_000962.3 Modlin et. al. 2018 *exon* 1 1524 . + .
ID=exon1;*p**arent=CDS1;*locus_tag=Rv0001;product=Chromosomal
replication initiator protein DnaA;note=FunctionalCategory: information
pathways
We will try to improve our documentation regarding the GFF files in VEP.
Best regards,
Laurent
Ensembl Variation
On 12/02/2018 19:10, Derek Conkle-Gutierrez wrote:
>
> Hello,
>
> I work for Dr. Faramarz Valafar at San Diego State University.
> Previously we have used Ensembl's VEP program on our vcf files of
> Mycobacterium tuberculosis sequences, using annotation from a cache
> file downloaded from your website. However, recently we have developed
> additional annotations (mostly from running I-TASSER on ambiguously
> annotated genes) that we would like to include. To that end I
> converted our custom annotation file to a GFF3 format, and followed
> your website's instructions for running VEP with that as the
> annotation source. This ran, but unfortunately it identified every
> variant as intergenic, even when they were within one of our annotated
> CDS features. I assume this is due to a formatting error on my part
> with our GFF file, though I've been following the specifications
> described here
> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
>
> I'm using ensembl-vep version 91.3
> Here's a bit of our gff:
> NC_000962.3 Modlin et. al. 2018 gene 1 1524 . + .
> ID=gene1;locus_tag=Rv0001;alias=dnaA;experiment=DESCRIPTION:Mutation
> analysis, gene expression[PMID: 10375628];Dbxref=GeneID:885041
> NC_000962.3 Modlin et. al. 2018 CDS 1 1524 . + .
> ID=CDS1;locus_tag=Rv0001;product=Chromosomal replication initiator
> protein DnaA;note=FunctionalCategory: information pathways
> NC_000962.3 Modlin et. al. 2018 gene 2052 3260 . +
> . ID=gene2;locus_tag=Rv0002;alias=dnaN;Dbxref=GeneID:887092
> NC_000962.3 Modlin et. al. 2018 CDS 2052 3260 . +
> . ID=CDS2;locus_tag=Rv0002;product=DNA polymerase III (beta chain)
> DnaN (DNA nucleotidyltransferase);note=FunctionalCategory: information
> pathways
>
> Here's a bit of our test input vcf:
> ##fileformat=VCFv4.0
> ##source=pbhooverV1.0.0a8
> ##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting
> reads">
> ##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting
> reads">
> ##INFO=<ID=VF,Number=1,Type=Float,Description="Variant frequency">
> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
> ##FILTER=<ID=LOW,Description="Position with too low of depth">
> ##FILTER=<ID=NO,Description="Does not meet criteria to call variant">
> ##FILTER=<ID=HETERO,Description="Enough support to call reference and
> variant (mixed population)">
> #CHROM POS ID REF ALT QUAL FILTER INFO
> 1 8 . A AGT 7.22 NO RSR=4;VSR=1;VF=0.2;DP=5
> etc.
> 1 2050 . C CC 11.36 NO
> RSR=15;VSR=2;VF=0.117647058824;DP=21
> 1 2051 . A AA 13.36 NO RSR=18;VSR=2;VF=0.1;DP=20
> 1 2051 . AA A 15.75 NO
> RSR=20;VSR=1;VF=0.047619047619;DP=21
> 1 2052 . AT A 15.22 NO RSR=19;VSR=1;VF=0.05;DP=21
> 1 2053 . TG T 33.02 NO
> RSR=15;VSR=2;VF=0.117647058824;DP=18
> 1 2054 . GG G 45.75 NO RSR=17;VSR=3;VF=0.15;DP=21
> 1 2056 . A ACG 12.46 NO
> RSR=17;VSR=1;VF=0.0555555555556;DP=21
> 1 2057 . C CAC 13.28 NO
> RSR=20;VSR=1;VF=0.047619047619;DP=21
>
>
> I used these commands on our gff and vcf files:
> grep -v "#" mannotation-with-computation-4vep.gff | sort -k1,1 -k4,4n
> -k5,5n -t$'\t' | tabix/bgzip -c > mannotation-with-computation-4vep.gff.gz
> tabix/tabix -p gff mannotation-with-computation-4vep.gff.gz
> ensembl-vep/vep --force_overwrite --synonyms synonyms-hyp.txt --format
> vcf --vcf --species mycobacterium_tuberculosis --symbol
> --variant_class --flag_pick --everything -i test1-0006.vcf -gff
> mannotation-with-computation-4vep.gff.gz -fasta H37Rv.fasta.gz -o
> test1-0006-annotated2.vcf
>
> And the output vcf looks like this:
> ##fileformat=VCFv4.0
> ##source=pbhooverV1.0.0a8
> ##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting
> reads">
> ##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting
> reads">
> ##INFO=<ID=VF,Number=1,Type=Float,Description="Variant frequency">
> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
> ##FILTER=<ID=LOW,Description="Position with too low of depth">
> ##FILTER=<ID=NO,Description="Does not meet criteria to call variant">
> ##FILTER=<ID=HETERO,Description="Enough support to call reference and
> variant (mixed population)">
> ##VEP="v91" time="2018-02-12 10:57:23" ensembl-variation=91.c78d8b4
> ensembl-funcgen=91.4681d69 ensembl=91.18ee742 ensembl-io=91.923d668
> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence
> annotations from Ensembl VEP. Format:
> Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|PICK|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SOURCE|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|mannotation-with-computation.gff.gz">
> ##INFO=<ID=mannotation-with-computation.gff.gz,Number=.,Type=String,Description="mannotation-with-computation.gff.gz
> (overlap)">
> #CHROM POS ID REF ALT QUAL FILTER INFO
> 1 8 . A AGT 7.22 NO
> RSR=4;VSR=1;VF=0.2;DP=5;CSQ=GT|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
> etc.
> 1 2050 . C CC 11.36 NO
> RSR=15;VSR=2;VF=0.117647058824;DP=21;CSQ=C|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
> 1 2051 . A AA 13.36 NO
> RSR=18;VSR=2;VF=0.1;DP=20;CSQ=A|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
> 1 2051 . AA A 15.75 NO
> RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
> 1 2052 . AT A 15.22 NO
> RSR=19;VSR=1;VF=0.05;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
> 1 2053 . TG T 33.02 NO
> RSR=15;VSR=2;VF=0.117647058824;DP=18;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
> 1 2054 . GG G 45.75 NO
> RSR=17;VSR=3;VF=0.15;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
> 1 2056 . A ACG 12.46 NO
> RSR=17;VSR=1;VF=0.0555555555556;DP=21;CSQ=CG|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
> 1 2057 . C CAC 13.28 NO
> RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=AC|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
>
> I've also tried adding 'Is_circular=true' to the attribute column of
> the first entry in the GFF, replacing 'locus_tag' with 'Name', and
> capitalizing 'alias', in case those deviations from the format
> described in the GFF documentation were the problem. I also added
> 'biotype' attributes to the GFF, after seeing this discussion in the
> forums:
> http://lists.ensembl.org/pipermail/dev/2018-January/012867.html,
> though I was unsure if that advice was meant for the GFF or VCF, or
> whether it was applicable to whole genome reads vs transcripts. None
> of this changed the resulting output.
>
> Do you have an example of a GFF annotation file that's worked with
> VEP, so I can compare it with ours to see what I've done wrong? Or is
> there a tool we can use to create our own cache files?
>
> Thank you for your assistance.
>
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180213/94ee300c/attachment.html>
More information about the Dev
mailing list