[ensembl-dev] Custom Annotation

Laurent Gil lgil at ebi.ac.uk
Tue Feb 13 14:26:53 GMT 2018


Dear Derek,


Your GFF file is missing the "biotype" and the "parent" parameters for 
the CDS lines.

e.g. using your input example:

NC_000962.3    Modlin et. al. 2018    CDS    1    1524    .  +    .   
  ID=CDS1;*p**arent=gene1;biotype=protein_coding;*locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways


Furthermore, you need to add "exon" line(s) after the CDS line (and 
using the "parent" attribute), e.g.:

NC_000962.3    Modlin et. al. 2018    CDS    1    1524    .  +    .   
  ID=CDS1;*p**arent=gene1;biotype=protein_coding;*locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;note=FunctionalCategory: information pathways

NC_000962.3    Modlin et. al. 2018 *exon*    1    1524  .    +    .   
  ID=exon1;*p**arent=CDS1;*locus_tag=Rv0001;product=Chromosomal 
replication initiator protein DnaA;note=FunctionalCategory: information 
pathways


We will try to improve our documentation regarding the GFF files in VEP.


Best regards,

Laurent
Ensembl Variation

On 12/02/2018 19:10, Derek Conkle-Gutierrez wrote:
>
> Hello,
>
> I work for Dr. Faramarz Valafar at San Diego State University. 
> Previously we have used Ensembl's VEP program on our vcf files of 
> Mycobacterium tuberculosis sequences, using annotation from a cache 
> file downloaded from your website. However, recently we have developed 
> additional annotations (mostly from running I-TASSER on ambiguously 
> annotated genes) that we would like to include. To that end I 
> converted our custom annotation file to a GFF3 format, and followed 
> your website's instructions for running VEP with that as the 
> annotation source. This ran, but unfortunately it identified every 
> variant as intergenic, even when they were within one of our annotated 
> CDS features. I assume this is due to a formatting error on my part 
> with our GFF file, though I've been following the specifications 
> described here 
> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
>
> I'm using ensembl-vep version 91.3
> Here's a bit of our gff:
> NC_000962.3    Modlin et. al. 2018    gene    1    1524  .    +    . 
>  ID=gene1;locus_tag=Rv0001;alias=dnaA;experiment=DESCRIPTION:Mutation 
> analysis, gene expression[PMID: 10375628];Dbxref=GeneID:885041
> NC_000962.3    Modlin et. al. 2018    CDS    1    1524  .    +    .   
>  ID=CDS1;locus_tag=Rv0001;product=Chromosomal replication initiator 
> protein DnaA;note=FunctionalCategory: information pathways
> NC_000962.3    Modlin et. al. 2018    gene    2052    3260  .    +   
>  .  ID=gene2;locus_tag=Rv0002;alias=dnaN;Dbxref=GeneID:887092
> NC_000962.3    Modlin et. al. 2018    CDS    2052    3260  .    +   
>  .    ID=CDS2;locus_tag=Rv0002;product=DNA polymerase III (beta chain) 
> DnaN (DNA nucleotidyltransferase);note=FunctionalCategory: information 
> pathways
>
> Here's a bit of our test input vcf:
> ##fileformat=VCFv4.0
> ##source=pbhooverV1.0.0a8
> ##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting 
> reads">
> ##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting 
> reads">
> ##INFO=<ID=VF,Number=1,Type=Float,Description="Variant frequency">
> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
> ##FILTER=<ID=LOW,Description="Position with too low of depth">
> ##FILTER=<ID=NO,Description="Does not meet criteria to call variant">
> ##FILTER=<ID=HETERO,Description="Enough support to call reference and 
> variant (mixed population)">
> #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
> 1    8    .    A    AGT    7.22    NO  RSR=4;VSR=1;VF=0.2;DP=5
> etc.
> 1    2050    .    C    CC    11.36    NO 
>  RSR=15;VSR=2;VF=0.117647058824;DP=21
> 1    2051    .    A    AA    13.36    NO  RSR=18;VSR=2;VF=0.1;DP=20
> 1    2051    .    AA    A    15.75    NO 
>  RSR=20;VSR=1;VF=0.047619047619;DP=21
> 1    2052    .    AT    A    15.22    NO  RSR=19;VSR=1;VF=0.05;DP=21
> 1    2053    .    TG    T    33.02    NO 
>  RSR=15;VSR=2;VF=0.117647058824;DP=18
> 1    2054    .    GG    G    45.75    NO  RSR=17;VSR=3;VF=0.15;DP=21
> 1    2056    .    A    ACG    12.46    NO 
>  RSR=17;VSR=1;VF=0.0555555555556;DP=21
> 1    2057    .    C    CAC    13.28    NO 
>  RSR=20;VSR=1;VF=0.047619047619;DP=21
>
>
> I used these commands on our gff and vcf files:
> grep -v "#" mannotation-with-computation-4vep.gff | sort -k1,1 -k4,4n 
> -k5,5n -t$'\t' | tabix/bgzip -c > mannotation-with-computation-4vep.gff.gz
> tabix/tabix -p gff mannotation-with-computation-4vep.gff.gz
> ensembl-vep/vep --force_overwrite --synonyms synonyms-hyp.txt --format 
> vcf --vcf --species mycobacterium_tuberculosis --symbol 
> --variant_class --flag_pick --everything -i test1-0006.vcf -gff 
> mannotation-with-computation-4vep.gff.gz -fasta H37Rv.fasta.gz -o 
> test1-0006-annotated2.vcf
>
> And the output vcf looks like this:
> ##fileformat=VCFv4.0
> ##source=pbhooverV1.0.0a8
> ##INFO=<ID=RSR,Number=1,Type=Integer,Description="Reference-supporting 
> reads">
> ##INFO=<ID=VSR,Number=1,Type=Integer,Description="Variant-supporting 
> reads">
> ##INFO=<ID=VF,Number=1,Type=Float,Description="Variant frequency">
> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
> ##FILTER=<ID=LOW,Description="Position with too low of depth">
> ##FILTER=<ID=NO,Description="Does not meet criteria to call variant">
> ##FILTER=<ID=HETERO,Description="Enough support to call reference and 
> variant (mixed population)">
> ##VEP="v91" time="2018-02-12 10:57:23" ensembl-variation=91.c78d8b4 
> ensembl-funcgen=91.4681d69 ensembl=91.18ee742 ensembl-io=91.923d668
> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence 
> annotations from Ensembl VEP. Format: 
> Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|PICK|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SOURCE|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|mannotation-with-computation.gff.gz">
> ##INFO=<ID=mannotation-with-computation.gff.gz,Number=.,Type=String,Description="mannotation-with-computation.gff.gz 
> (overlap)">
> #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
> 1    8    .    A    AGT    7.22    NO 
>  RSR=4;VSR=1;VF=0.2;DP=5;CSQ=GT|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
> etc.
> 1    2050    .    C    CC    11.36    NO 
>  RSR=15;VSR=2;VF=0.117647058824;DP=21;CSQ=C|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
> 1    2051    .    A    AA    13.36    NO 
>  RSR=18;VSR=2;VF=0.1;DP=20;CSQ=A|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
> 1    2051    .    AA    A    15.75    NO 
>  RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
> 1    2052    .    AT    A    15.22    NO 
>  RSR=19;VSR=1;VF=0.05;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
> 1    2053    .    TG    T    33.02    NO 
>  RSR=15;VSR=2;VF=0.117647058824;DP=18;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
> 1    2054    .    GG    G    45.75    NO 
>  RSR=17;VSR=3;VF=0.15;DP=21;CSQ=-|intergenic_variant|MODIFIER|||||||||||||||||||1|deletion||||||||||||||||||||||||||||||||||||||||||||
> 1    2056    .    A    ACG    12.46    NO 
>  RSR=17;VSR=1;VF=0.0555555555556;DP=21;CSQ=CG|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
> 1    2057    .    C    CAC    13.28    NO 
>  RSR=20;VSR=1;VF=0.047619047619;DP=21;CSQ=AC|intergenic_variant|MODIFIER|||||||||||||||||||1|insertion||||||||||||||||||||||||||||||||||||||||||||
>
> I've also tried adding 'Is_circular=true' to the attribute column of 
> the first entry in the GFF, replacing 'locus_tag' with 'Name', and 
> capitalizing  'alias', in case those deviations from the format 
> described in the GFF documentation were the problem. I also added 
> 'biotype' attributes to the GFF, after seeing this discussion in the 
> forums: 
> http://lists.ensembl.org/pipermail/dev/2018-January/012867.html, 
> though I was unsure if that advice was meant for the GFF or VCF, or 
> whether it was applicable to whole genome reads vs transcripts. None 
> of this changed the resulting output.
>
> Do you have an example of a GFF annotation file that's worked with 
> VEP, so I can compare it with ours to see what I've done wrong? Or is 
> there a tool we can use to create our own cache files?
>
> Thank you for your assistance.
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180213/94ee300c/attachment.html>


More information about the Dev mailing list