[ensembl-dev] Pipeline: Loading genome into DB - Question on input files

Thu Feb 7 10:07:45 GMT 2013

On 07/02/13 07:55, Marc Hoeppner wrote:
> Hi EnsEMBL team,
>
> I am currently on a quest to learn the gene annotation pipeline - and
> am already stuck.
>
> In the examples I could find, it is assumed that the genome I want to
> load into a fresh DB is provided in 3 files:
>
> Chromosome-to-scaffold relationship (AGP)
> Scaffold-to-contig relationship (AGP)
> And a contig fasta file
>
> Using the load_seq_region.pl script, these files are parsed and used
> to populate the db , assigning ranks to seq_regions, reconstructing
> the sequence-level to chromsome-level relationships and so on.
>
> However, since I decided to use a fairly mature genome for testing
> puporses (C. elegans), it seems non-trivial to actually get hold of
> anything that is not a full assembly (like:
> ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/assemblies/).
>
> I assume if I just load the fasta file without
> chromosome-scaffold-contig relationships, stuff is going to break down
> the line.
>
> I would really appreciate it, if someone could give me a hint on how
> to deal with that issue. Do I have to dig a little deeper to find the
> original assembly details, or is there a way to load the full assembly
> 'as is'? (I will also admit that I haven't really worked with AGP
> files before, so maybe that is the problem...).
>
> Cheers,
>
> Marc
Hi Marc,

we only have one level of assembly AGP files for C.elegans available
from WormBase:
clones -> chromosomes

It was clone-by-clone sequenced and an excellent genetic map helped with
the scaffolding.

What you can do is, just create a fake supercontigs agp file, that is
either the clones or the chromosomes and introduce that artificial
assembly 3rd layer.
Technically the core-API can handle just 2 layers, but there are a other
bits of ensembl-code that assume a three layer deep assembly.

There are also a few more peculiarities of the WormBase->EnsEMBL
conversion, as example:
Basically use the transcripts as E! transcripts and keep the protein
names only in the xrefs, as the WB protein identifiers are unique to a
protein sequence, while the E! translation identifiers are unique to a
transcript sequence.
... and don't forget to use a different translation table for the
mitochondrium and the selenoprotein.

I would therefore recommend using a less curated genome for your first
steps in EnsEMBL and
also recommend looking into the AGP file format, as it will come handy
for understanding the assembly table.

As example, Trichinella spiralis and Loa loa are in the INSDC and don't
have any translation exceptions in their annotation. You still might
need to fake the third AGP layer, but the rest should work smoothly.

Michael