[ensembl-dev] Create local ensembl SQL database for new organism

James Allen jallen at ebi.ac.uk
Fri Aug 14 15:06:21 BST 2015


Hello,
There are scripts that will allow you to load an assembly and geneset into a core db; all the code is in git repositories, and while there's documentation within the scripts, I'm not aware of anything that provides an overview. Below is  methodology I have used successfully - I make no claim that this is the best/right way to do it...


Schema:
Load the schema from the ensembl repo: https://github.com/Ensembl/ensembl/blob/master/sql/table.sql
There are a few lookup tables which will need to be populated, but I'm not sure how easy this is outside of the EBI; if the populate_production_db_tables.pl (in the ensembl-production repo: https://github.com/Ensembl/ensembl-production/tree/master/scripts/production_database) doesn't work, copy these tables from one of the cores on the public mysql server: attrib_type, external_db, misc_set, unmapped_reason.


Assembly:
The scripts for loading an assembly are in the ensembl-pipeline repo: https://github.com/Ensembl/ensembl-pipeline/tree/master/scripts/. You'll need the contig sequences in fasta format, and an AGP file; if you don't have AGP, it's fairly easy to generate from scaffold fasta - I've used this script in the past: http://hmpdacc.org/doc/fasta2apg.pl

1. run load_seq_region.pl with the -agp_file parameter to create the scaffolds
2. run load_seq_region.pl with the -fasta_file parameter to create the contigs
3. run load_agp.pl with the -agp_file parameter to create the links between scaffolds and contigs
4. run set_toplevel.pl to add some metadata about the scaffolds
5. run load_taxonomy.pl to add some metadata about the species


Genes:
Loading genes from GFF can be complicated, because the GFF3 spec allows quite a lot of variation in formatting, even if the spec was religiously adhered to (which it pretty much never is). There's a git repo (https://github.com/dsth/GffDoc) with code for this; use the GffDoc.pl script, which is documented (after a fashion) here: http://www.ebi.ac.uk/~jallen/GffDoc.html.

If you struggle to get the gff import working, then please let me know, I have an alternative script (that only works if your GFF is valid) but it's undocumented and not in a public repo, so I'd need to provide more detailed guidance on using it...


Please let me know if you have any questions/problems...

Cheers,
James


On Fri, 14 Aug 2015 13:12:35 +0000
Luke Goodsell <Luke.Goodsell at ogt.com> wrote:

> Hi Albert,
> 
> Unfortunately, there isn't any documentation (that I know of) for the creation of the tables for a new species, probably because it's quite a varied
> process depending on the origin of the data. If you want to pursue this approach, I'd suggest studying the schema documentation
> (http://www.ensembl.org/info/docs/api/core/core_schema.html) and trying to replicate the structure for your new species, using one of the simpler
> species' architecture as a template. This would be a time-consuming task, though.
> 
> Kind regards,
> Luke
> 
> -----Original Message-----
> From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of Zhou Albert
> Sent: 14 August 2015 11:11
> To: Ensembl developers list
> Subject: Re: [ensembl-dev] Create local ensembl SQL database for new organism
> 
> Hi Luke,
> 
> Many thanks for the response! 
> 
> Yes I have read these pages. However what I would like to do is creating a new SQL database in core schema that contains new organism’s genome data
> (currently in GFF and FASTA), so that it can be recognized and used in the web code. I’m looking for the proper tool to fulfill this task. 
> 
> best wishes,
> Albert
> 
> 
> > 在 2015年8月14日,上午10:15,Luke Goodsell <Luke.Goodsell at ogt.com> 写道:
> > 
> > Hi again, Albert,
> > 
> > This section might also be useful: http://www.ensembl.org/info/docs/webcode/custom/index.html
> > 
> > And more generally: http://www.ensembl.org/info/docs/webcode/index.html
> > 
> > Kind regards,
> > Luke
> > 
> > -----Original Message-----
> > From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of Luke Goodsell
> > Sent: 14 August 2015 09:30
> > To: Ensembl developers list
> > Subject: Re: [ensembl-dev] Create local ensembl SQL database for new organism
> > 
> > Dear Albert
> > 
> > Have you seen this section of the EnsEMBL website: http://www.ensembl.org/info/docs/webcode/mirror/index.html
> > 
> > Kind regards,
> > Luke
> > 
> > -----Original Message-----
> > From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of Zhou Albert
> > Sent: 12 August 2015 13:16
> > To: dev at ensembl.org
> > Subject: [ensembl-dev] Create local ensembl SQL database for new organism
> > 
> > Dear all,
> > 
> > I'm currently working on building a local ensembl database web server, within which we would like to include our own genome data from a new organism.
> > However after googling this subject, I still can't find any documents explaining how this can be happened. 
> > 
> > Could someone please show me where can I find such document / guide, or perhaps the ensembl simply does not provide such function?
> > 
> > Many thanks!
> > 
> > Albert
> > 
> > 
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> > Ensembl Blog: http://www.ensembl.info/
> > 
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> > Ensembl Blog: http://www.ensembl.info/
> > 
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> > Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/




More information about the Dev mailing list