[ensembl-dev] Pipeline: Loading genome into DB - Question on input files

Thu Feb 7 10:11:57 GMT 2013

I stand corrected: we got a two level deep assembly in EnsEMBL now, so
it seems that all EnsEMBL code can handle it fine.

M

On 07/02/13 10:07, Michael Paulini wrote:
> On 07/02/13 07:55, Marc Hoeppner wrote:
>> Hi EnsEMBL team,
>>
>> I am currently on a quest to learn the gene annotation pipeline - and
>> am already stuck.
>>
>> In the examples I could find, it is assumed that the genome I want to
>> load into a fresh DB is provided in 3 files:
>>
>> Chromosome-to-scaffold relationship (AGP)
>> Scaffold-to-contig relationship (AGP)
>> And a contig fasta file
>>
>> Using the load_seq_region.pl script, these files are parsed and used
>> to populate the db , assigning ranks to seq_regions, reconstructing
>> the sequence-level to chromsome-level relationships and so on.
>>
>> However, since I decided to use a fairly mature genome for testing
>> puporses (C. elegans), it seems non-trivial to actually get hold of
>> anything that is not a full assembly (like:
>> ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/assemblies/).
>>
>> I assume if I just load the fasta file without
>> chromosome-scaffold-contig relationships, stuff is going to break down
>> the line.
>>
>> I would really appreciate it, if someone could give me a hint on how
>> to deal with that issue. Do I have to dig a little deeper to find the
>> original assembly details, or is there a way to load the full assembly
>> 'as is'? (I will also admit that I haven't really worked with AGP
>> files before, so maybe that is the problem...).
>>
>> Cheers,
>>
>> Marc
> Hi Marc,
>
> we only have one level of assembly AGP files for C.elegans available
> from WormBase:
> clones -> chromosomes
>
> It was clone-by-clone sequenced and an excellent genetic map helped with
> the scaffolding.
>
> What you can do is, just create a fake supercontigs agp file, that is
> either the clones or the chromosomes and introduce that artificial
> assembly 3rd layer.
> Technically the core-API can handle just 2 layers, but there are a other
> bits of ensembl-code that assume a three layer deep assembly.
>
> There are also a few more peculiarities of the WormBase->EnsEMBL
> conversion, as example:
> Basically use the transcripts as E! transcripts and keep the protein
> names only in the xrefs, as the WB protein identifiers are unique to a
> protein sequence, while the E! translation identifiers are unique to a
> transcript sequence.
> ... and don't forget to use a different translation table for the
> mitochondrium and the selenoprotein.
>
> I would therefore recommend using a less curated genome for your first
> steps in EnsEMBL and
> also recommend looking into the AGP file format, as it will come handy
> for understanding the assembly table.
>
> As example, Trichinella spiralis and Loa loa are in the INSDC and don't
> have any translation exceptions in their annotation. You still might
> need to fake the third AGP layer, but the rest should work smoothly.
>
> Michael
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/