[ensembl-dev] Create local ensembl SQL database for new organism

Tue Aug 18 13:51:54 BST 2015

Hi Albert,
James' method should bring you where you want, I've just added some comments below

> On 14 Aug 2015, at 15:41, Zhou Albert <030bug at gmail.com> wrote:
> 
> Hi all,
> 
> Thanks for all the responses.
> 
> James: your method looks great. I will try it in the next few weeks.
> 
> best wishes,
> Albert
> 
> 
>> 在 2015年8月14日，下午3:06，James Allen <jallen at ebi.ac.uk> 写道：
>> 
>> Hello,
>> There are scripts that will allow you to load an assembly and geneset into a core db; all the code is in git repositories, and while there's documentation within the scripts, I'm not aware of anything that provides an overview. Below is  methodology I have used successfully - I make no claim that this is the best/right way to do it...
>> 
>> 
>> Schema:
>> Load the schema from the ensembl repo: https://github.com/Ensembl/ensembl/blob/master/sql/table.sql
>> There are a few lookup tables which will need to be populated, but I'm not sure how easy this is outside of the EBI; if the populate_production_db_tables.pl (in the ensembl-production repo: https://github.com/Ensembl/ensembl-production/tree/master/scripts/production_database) doesn't work, copy these tables from one of the cores on the public mysql server: attrib_type, external_db, misc_set, unmapped_reason.

The database needed is called ensembl_production_XX (XX is the release version)and you can find it on our public servers: http://www.ensembl.org/info/data/mysql.html

>> 
>> 
>> Assembly:
>> The scripts for loading an assembly are in the ensembl-pipeline repo: https://github.com/Ensembl/ensembl-pipeline/tree/master/scripts/. You'll need the contig sequences in fasta format, and an AGP file; if you don't have AGP, it's fairly easy to generate from scaffold fasta - I've used this script in the past: http://hmpdacc.org/doc/fasta2apg.pl
>> 
>> 1. run load_seq_region.pl with the -agp_file parameter to create the scaffolds
>> 2. run load_seq_region.pl with the -fasta_file parameter to create the contigs
>> 3. run load_agp.pl with the -agp_file parameter to create the links between scaffolds and contigs
>> 4. run set_toplevel.pl to add some metadata about the scaffolds
>> 5. run load_taxonomy.pl to add some metadata about the species

Step 5 unfortunately relies on an internal database so the script won't work. You can use the NCBI taxonomy. The script sets all the 'species.classification', 'species.alias' keys in the meta table of the database.

>> 
>> 
>> Genes:
>> Loading genes from GFF can be complicated, because the GFF3 spec allows quite a lot of variation in formatting, even if the spec was religiously adhered to (which it pretty much never is). There's a git repo (https://github.com/dsth/GffDoc) with code for this; use the GffDoc.pl script, which is documented (after a fashion) here: http://www.ebi.ac.uk/~jallen/GffDoc.html.

If GffDoc.pl import what you want without tweaking to much, use it, we didn't have that chance. Although I really hate to say that but most of the time it's faster to write your GFF parser. We have a script to load the NCBI annotation: ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl which may not work for you but can be a good way to see how gene/transcript/exon/translation are created to be written in the database. The script uses https://github.com/Ensembl/ensembl-io

Hope this help
Thibaut
>> 
>> If you struggle to get the gff import working, then please let me know, I have an alternative script (that only works if your GFF is valid) but it's undocumented and not in a public repo, so I'd need to provide more detailed guidance on using it...
>> 
>> 
>> Please let me know if you have any questions/problems...
>> 
>> Cheers,
>> James
>> 
>> 
>> On Fri, 14 Aug 2015 13:12:35 +0000
>> Luke Goodsell <Luke.Goodsell at ogt.com> wrote:
>> 
>>> Hi Albert,
>>> 
>>> Unfortunately, there isn't any documentation (that I know of) for the creation of the tables for a new species, probably because it's quite a varied
>>> process depending on the origin of the data. If you want to pursue this approach, I'd suggest studying the schema documentation
>>> (http://www.ensembl.org/info/docs/api/core/core_schema.html) and trying to replicate the structure for your new species, using one of the simpler
>>> species' architecture as a template. This would be a time-consuming task, though.
>>> 
>>> Kind regards,
>>> Luke
>>> 
>>> -----Original Message-----
>>> From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of Zhou Albert
>>> Sent: 14 August 2015 11:11
>>> To: Ensembl developers list
>>> Subject: Re: [ensembl-dev] Create local ensembl SQL database for new organism
>>> 
>>> Hi Luke,
>>> 
>>> Many thanks for the response! 
>>> 
>>> Yes I have read these pages. However what I would like to do is creating a new SQL database in core schema that contains new organism’s genome data
>>> (currently in GFF and FASTA), so that it can be recognized and used in the web code. I’m looking for the proper tool to fulfill this task. 
>>> 
>>> best wishes,
>>> Albert
>>> 
>>> 
>>>> 在 2015年8月14日，上午10:15，Luke Goodsell <Luke.Goodsell at ogt.com> 写道：
>>>> 
>>>> Hi again, Albert,
>>>> 
>>>> This section might also be useful: http://www.ensembl.org/info/docs/webcode/custom/index.html
>>>> 
>>>> And more generally: http://www.ensembl.org/info/docs/webcode/index.html
>>>> 
>>>> Kind regards,
>>>> Luke
>>>> 
>>>> -----Original Message-----
>>>> From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of Luke Goodsell
>>>> Sent: 14 August 2015 09:30
>>>> To: Ensembl developers list
>>>> Subject: Re: [ensembl-dev] Create local ensembl SQL database for new organism
>>>> 
>>>> Dear Albert
>>>> 
>>>> Have you seen this section of the EnsEMBL website: http://www.ensembl.org/info/docs/webcode/mirror/index.html
>>>> 
>>>> Kind regards,
>>>> Luke
>>>> 
>>>> -----Original Message-----
>>>> From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of Zhou Albert
>>>> Sent: 12 August 2015 13:16
>>>> To: dev at ensembl.org
>>>> Subject: [ensembl-dev] Create local ensembl SQL database for new organism
>>>> 
>>>> Dear all,
>>>> 
>>>> I'm currently working on building a local ensembl database web server, within which we would like to include our own genome data from a new organism.
>>>> However after googling this subject, I still can't find any documents explaining how this can be happened. 
>>>> 
>>>> Could someone please show me where can I find such document / guide, or perhaps the ensembl simply does not provide such function?
>>>> 
>>>> Many thanks!
>>>> 
>>>> Albert
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/