[ensembl-dev] problem importing GenBank file into local core DB

梁薰文 a24681012142002 at gmail.com
Mon Jul 11 17:21:10 BST 2016


Hi, Paul and Dan

Thanks for your quick response and sorry for fail to attach the code in the last mail.
I put here as an attach. (It’s a little bit longer, but it seems necessary... )

We understand it's extremely hard to take care of each strain. Thus, we are now importing those RefSeq genome into local core database to solve this problem. 

How about origin-spanning features?
I only found this http://dev.ensembl.narkive.com/9KWpLOT4/circular-sequences <http://dev.ensembl.narkive.com/9KWpLOT4/circular-sequences>, which is one of the ensembl-dev topics in 2005.
Are there any updated version? 

Thanks, 
Susan



> On Jul 11, 2016, at 11:56 PM, Paul Kersey <pkersey at ebi.ac.uk> wrote:
> 
> Hi Susan
> 
>> On 11 Jul 2016, at 16:45, 梁薰文 <a24681012142002 at gmail.com <mailto:a24681012142002 at gmail.com>> wrote:
>> 
>> Hi Dan,
>> 
>> We are importing thirteen Klebsiella pneumonia strains, one of them is PMK1(ASM76461v1).
>> After your noticing, I indeed found those genomes existing in EnsemblBacteria.
>> However, I noticed that NCBI RefSeq provides updates to the annotation of the genomes of these strains.  
>> The mainly difference between RefSeq and GenBank assembly lies in the feature annotation and its number, such as gene number and protein number. Here are predicted number of PMK1 in GenBank and RefSeq assembly:
>> 	a. GenBank version: 5,705 genes, 5,594 proteins
>> 	b. RefSeq version:    5,879 genes, 5,672 proteins
>> 
>> Below lists detailed information of PMK1 strain as reference. (NCBI refseq URL: http://www.ncbi.nlm.nih.gov/refseq/ <http://www.ncbi.nlm.nih.gov/refseq/>)
>> Strain: PMK1 (direct URL to the assembly record http://www.ncbi.nlm.nih.gov/assembly/GCA_000764615.1 <http://www.ncbi.nlm.nih.gov/assembly/GCA_000764615.1>)
>> GenBank assembly accession: GCA_000764615.1 (latest) 
>> gb file URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000764615.1_ASM76461v1/GCA_000764615.1_ASM76461v1_genomic.gbff.gz <ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000764615.1_ASM76461v1/GCA_000764615.1_ASM76461v1_genomic.gbff.gz>
>> RefSeq assembly accession: GCF_000764615.1 (latest)
>> gb file URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000764615.1_ASM76461v1/GCF_000764615.1_ASM76461v1_genomic.gbff.gz <ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000764615.1_ASM76461v1/GCF_000764615.1_ASM76461v1_genomic.gbff.gz>
> 
> At present we offer access to bacterial annotation as submitted to the ENA/GenBank/DDBJ archives.  For 30,000 species, we cannot judge the merits of rival annotations in individual cases.  It is possible that we might switch to using the RefSeq-provided annotations at some point in future, but there are not presently plans to do this.
> 
> best wishes,
> 
> Paul
> 
> 
>> 
>> Last time you also mentioned "origin-spanning features”, but I only found this on google.
>> It’s one of the ensembl-dev topics in 2005.
>> Please tell me if I find wrong.
>> http://dev.ensembl.narkive.com/9KWpLOT4/circular-sequences <http://dev.ensembl.narkive.com/9KWpLOT4/circular-sequences>
>> 
>> Thanks very much for your help and improve our code.
>> Please find it in the attachment.
>> 
>> Susan
>> 
>> 
>>> On Jul 7, 2016, at 4:52 PM, Dan Staines <dstaines at ebi.ac.uk <mailto:dstaines at ebi.ac.uk>> wrote:
>>> 
>>> Hi Susan,
>>> 
>>> Ensembl does support origin-spanning features - we have these in Ensembl
>>> Bacteria. Can you please share with me a small piece of code showing how
>>> you are storing the data so we can see what the problem might be?
>>> 
>>> Out of interest, which prokaryotic genomes are you importing? There are
>>> over 40,000 in Ensembl Bacteria which come from EMBL/GenBank so its
>>> possible that the genomes you are interested in are already present.
>>> 
>>> Thanks,
>>> 
>>> Dan.
>>> 
>>> -- 
>>> Dan Staines, PhD
>>> Genomics Technology Infrastructure Coordinator
>>> EMBL-EBI, Wellcome Trust Genome Campus
>>> Cambridge CB10 1SD, UK
>>> Tel: +44-(0)1223-492507
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
> 
> 
> ---
> Dr. Paul Kersey
> Team Leader, Non-vertebrate Genomics
> European Bioinformatics Institute
> European Molecular Biology Laboratory             Tel: +44-(0)1223-494601
> Wellcome Genome Campus, Hinxton                Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK                                email: pkersey at ebi.ac.uk <mailto:pkersey at ebi.ac.uk>
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160712/f17b2692/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: store_gene_from_gb.txt
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160712/f17b2692/attachment.txt>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160712/f17b2692/attachment-0001.html>


More information about the Dev mailing list