[ensembl-dev] Custom Annotation

Derek Conkle-Gutierrez dconklegutierrez at sdsu.edu
Tue Feb 13 17:35:20 GMT 2018


Hello,


Adding those elements to our GFF worked. Thank you.

________________________________
From: Laurent Gil <lgil at ebi.ac.uk>
Sent: Tuesday, February 13, 2018 6:26:53 AM
To: Ensembl developers list; Derek Conkle-Gutierrez
Subject: Re: [ensembl-dev] Custom Annotation


Dear Derek,


Your GFF file is missing the "biotype" and the "parent" parameters for the CDS lines.

e.g. using your input example:

NC_000962.3    Modlin et. al. 2018    CDS    1    1524    .    +    .    ID=CDS1;parent=gene1;biotype=protein_coding;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways


Furthermore, you need to add "exon" line(s) after the CDS line (and using the "parent" attribute), e.g.:

NC_000962.3    Modlin et. al. 2018    CDS    1    1524    .    +    .    ID=CDS1;parent=gene1;biotype=protein_coding;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways

NC_000962.3    Modlin et. al. 2018    exon    1    1524    .    +    .    ID=exon1;parent=CDS1;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways


We will try to improve our documentation regarding the GFF files in VEP.


Best regards,

Laurent
Ensembl Variation


On 12/02/2018 19:10, Derek Conkle-Gutierrez wrote:

Hello,

I work for Dr. Faramarz Valafar at San Diego State University. Previously we have used Ensembl's VEP program on our vcf files of Mycobacterium tuberculosis sequences, using annotation from a cache file downloaded from your website. However, recently we have developed additional annotations (mostly from running I-TASSER on ambiguously annotated genes) that we would like to include. To that end I converted our custom annotation file to a GFF3 format, and followed your website's instructions for running VEP with that as the annotation source. This ran, but unfortunately it identified every variant as intergenic, even when they were within one of our annotated CDS features. I assume this is due to a formatting error on my part with our GFF file, though I've been following the specifications described here https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

I'm using ensembl-vep version 91.3
Here's a bit of our gff:
NC_000962.3    Modlin et. al. 2018    gene    1    1524    .    +    .    ID=gene1;locus_tag=Rv0001;alias=dnaA;experiment=DESCRIPTION:Mutation analysis, gene expression[PMID: 10375628];Dbxref=GeneID:885041
NC_000962.3    Modlin et. al. 2018    CDS    1    1524    .    +    .    ID=CDS1;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways
NC_000962.3    Modlin et. al. 2018    gene    2052    3260    .    +    .    ID=gene2;locus_tag=Rv0002;alias=dnaN;Dbxref=GeneID:887092
NC_000962.3    Modlin et. al. 2018    CDS    2052    3260    .    +    .    ID=CDS2;locus_tag=Rv0002;product=DNA polymerase III (beta chain) DnaN (DNA nucleotidyltransferase);note=FunctionalCategory: information pathways

Here's a bit of our test input vcf:
##fileformat=VCFv4.0
##source=pbhooverV1.0.0a8
##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting reads">
##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting reads">
##INFO=<ID=VF,Number=1,Type=Float,Description="Variant frequency">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FILTER=<ID=LOW,Description="Position with too low of depth">
##FILTER=<ID=NO,Description="Does not meet criteria to call variant">
##FILTER=<ID=HETERO,Description="Enough support to call reference and variant (mixed population)">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
1    8    .    A    AGT    7.22    NO    RSR=4;VSR=1;VF=0.2;DP=5
etc.
1    2050    .    C    CC    11.36    NO    RSR=15;VSR=2;VF=0.117647058824;DP=21
1    2051    .    A    AA    13.36    NO    RSR=18;VSR=2;VF=0.1;DP=20
1    2051    .    AA    A    15.75    NO    RSR=20;VSR=1;VF=0.047619047619;DP=21
1    2052    .    AT    A    15.22    NO    RSR=19;VSR=1;VF=0.05;DP=21
1    2053    .    TG    T    33.02    NO    RSR=15;VSR=2;VF=0.117647058824;DP=18
1    2054    .    GG    G    45.75    NO    RSR=17;VSR=3;VF=0.15;DP=21
1    2056    .    A    ACG    12.46    NO    RSR=17;VSR=1;VF=0.0555555555556;DP=21
1    2057    .    C    CAC    13.28    NO    RSR=20;VSR=1;VF=0.047619047619;DP=21


I used these commands on our gff and vcf files:
grep -v "#" mannotation-with-computation-4vep.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | tabix/bgzip -c > mannotation-with-computation-4vep.gff.gz
tabix/tabix -p gff mannotation-with-computation-4vep.gff.gz
ensembl-vep/vep --force_overwrite --synonyms synonyms-hyp.txt --format vcf --vcf --species mycobacterium_tuberculosis --symbol --variant_class --flag_pick --everything -i test1-0006.vcf -gff mannotation-with-computation-4vep.gff.gz -fasta H37Rv.fasta.gz -o test1-0006-annotated2.vcf

And the output vcf looks like this:
##fileformat=VCFv4.0
##source=pbhooverV1.0.0a8
##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting reads">
##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting reads">
##INFO=<ID=VF,Number=1,Type=Float,Description="Variant frequency">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FILTER=<ID=LOW,Description="Position with too low of depth">
##FILTER=<ID=NO,Description="Does not meet criteria to call variant">
##FILTER=<ID=HETERO,Description="Enough support to call reference and variant (mixed population)">
##VEP="v91" time="2018-02-12 10:57:23" ensembl-variation=91.c78d8b4 ensembl-funcgen=91.4681d69 ensembl=91.18ee742 ensembl-io=91.923d668
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|PICK|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SOURCE|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|mannotation-with-computation.gff.gz">
##INFO=<ID=mannotation-with-computation.gff.gz,Number=.,Type=String,Description="mannotation-with-computation.gff.gz (overlap)">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
1    8    .    A    AGT    7.22    NO    RSR=4;VSR=1;VF=0.2;DP=5;CSQ=GT|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
etc.
1    2050    .    C    CC    11.36    NO    RSR=15;VSR=2;VF=0.117647058824;DP=21;CSQ=C|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
1    2051    .    A    AA    13.36    NO    RSR=18;VSR=2;VF=0.1;DP=20;CSQ=A|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
1    2051    .    AA    A    15.75    NO    RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
1    2052    .    AT    A    15.22    NO    RSR=19;VSR=1;VF=0.05;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
1    2053    .    TG    T    33.02    NO    RSR=15;VSR=2;VF=0.117647058824;DP=18;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
1    2054    .    GG    G    45.75    NO    RSR=17;VSR=3;VF=0.15;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
1    2056    .    A    ACG    12.46    NO    RSR=17;VSR=1;VF=0.0555555555556;DP=21;CSQ=CG|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
1    2057    .    C    CAC    13.28    NO    RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=AC|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||

I've also tried adding 'Is_circular=true' to the attribute column of the first entry in the GFF, replacing 'locus_tag' with 'Name', and capitalizing  'alias', in case those deviations from the format described in the GFF documentation were the problem. I also added 'biotype' attributes to the GFF, after seeing this discussion in the forums: http://lists.ensembl.org/pipermail/dev/2018-January/012867.html, though I was unsure if that advice was meant for the GFF or VCF, or whether it was applicable to whole genome reads vs transcripts. None of this changed the resulting output.

Do you have an example of a GFF annotation file that's worked with VEP, so I can compare it with ours to see what I've done wrong? Or is there a tool we can use to create our own cache files?

Thank you for your assistance.




_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180213/8fe07aa4/attachment.html>


More information about the Dev mailing list