[ensembl-dev] Pipeline: Loading genome into DB - Question on input files

Will Chow wc2 at sanger.ac.uk
Thu Feb 7 10:46:28 GMT 2013


Michael is correct, and I might add the relationship between the coordinate systems are stored in the meta table.

check out the row in the meta table for the recent c.elegans database.

|      23 |          1 | assembly.mapping             | chromosome:WBcel215|clone:WBcel215 | 


if you have multiple coord_system you can configure the relationship like so.
eg)
if its a contig-supercontig-chr relationship, with contig being the lowest level.

assembly.mapping = contig|chr
assembly.mapping = supercontig|contig|chr
etc….

check out the human meta tables for examples.  But since c.elegans is two coordinate system, it should be fine if the above line is in the meta table.


Also, as Michael suggested you can read more about AGP specs here.
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml

Will


On 7 Feb 2013, at 10:11, Michael Paulini wrote:

> I stand corrected: we got a two level deep assembly in EnsEMBL now, so
> it seems that all EnsEMBL code can handle it fine.
> 
> M
> 
> On 07/02/13 10:07, Michael Paulini wrote:
>> On 07/02/13 07:55, Marc Hoeppner wrote:
>>> Hi EnsEMBL team,
>>> 
>>> I am currently on a quest to learn the gene annotation pipeline - and
>>> am already stuck.
>>> 
>>> In the examples I could find, it is assumed that the genome I want to
>>> load into a fresh DB is provided in 3 files:
>>> 
>>> Chromosome-to-scaffold relationship (AGP)
>>> Scaffold-to-contig relationship (AGP)
>>> And a contig fasta file
>>> 
>>> Using the load_seq_region.pl script, these files are parsed and used
>>> to populate the db , assigning ranks to seq_regions, reconstructing
>>> the sequence-level to chromsome-level relationships and so on.
>>> 
>>> However, since I decided to use a fairly mature genome for testing
>>> puporses (C. elegans), it seems non-trivial to actually get hold of
>>> anything that is not a full assembly (like:
>>> ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/assemblies/).
>>> 
>>> I assume if I just load the fasta file without
>>> chromosome-scaffold-contig relationships, stuff is going to break down
>>> the line.
>>> 
>>> I would really appreciate it, if someone could give me a hint on how
>>> to deal with that issue. Do I have to dig a little deeper to find the
>>> original assembly details, or is there a way to load the full assembly
>>> 'as is'? (I will also admit that I haven't really worked with AGP
>>> files before, so maybe that is the problem...).
>>> 
>>> Cheers,
>>> 
>>> Marc
>> Hi Marc,
>> 
>> we only have one level of assembly AGP files for C.elegans available
>> from WormBase:
>> clones -> chromosomes
>> 
>> It was clone-by-clone sequenced and an excellent genetic map helped with
>> the scaffolding.
>> 
>> What you can do is, just create a fake supercontigs agp file, that is
>> either the clones or the chromosomes and introduce that artificial
>> assembly 3rd layer.
>> Technically the core-API can handle just 2 layers, but there are a other
>> bits of ensembl-code that assume a three layer deep assembly.
>> 
>> There are also a few more peculiarities of the WormBase->EnsEMBL
>> conversion, as example:
>> Basically use the transcripts as E! transcripts and keep the protein
>> names only in the xrefs, as the WB protein identifiers are unique to a
>> protein sequence, while the E! translation identifiers are unique to a
>> transcript sequence.
>> ... and don't forget to use a different translation table for the
>> mitochondrium and the selenoprotein.
>> 
>> I would therefore recommend using a less curated genome for your first
>> steps in EnsEMBL and
>> also recommend looking into the AGP file format, as it will come handy
>> for understanding the assembly table.
>> 
>> As example, Trichinella spiralis and Loa loa are in the INSDC and don't
>> have any translation exceptions in their annotation. You still might
>> need to fake the third AGP layer, but the rest should work smoothly.
>> 
>> Michael
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2055 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130207/17706a93/attachment.p7s>


More information about the Dev mailing list