[ensembl-dev] Pipeline: Loading genome into DB - Question on input files

Thu Feb 7 07:55:52 GMT 2013

Hi EnsEMBL team,

I am currently on a quest to learn the gene annotation pipeline - and am 
already stuck.

In the examples I could find, it is assumed that the genome I want to 
load into a fresh DB is provided in 3 files:

Chromosome-to-scaffold relationship (AGP)
Scaffold-to-contig relationship (AGP)
And a contig fasta file

Using the load_seq_region.pl script, these files are parsed and used to 
populate the db , assigning ranks to seq_regions, reconstructing the 
sequence-level to chromsome-level relationships and so on.

However, since I decided to use a fairly mature genome for testing 
puporses (C. elegans), it seems non-trivial to actually get hold of 
anything that is not a full assembly (like: 
ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/assemblies/).

I assume if I just load the fasta file without 
chromosome-scaffold-contig relationships, stuff is going to break down 
the line.

I would really appreciate it, if someone could give me a hint on how to 
deal with that issue. Do I have to dig a little deeper to find the 
original assembly details, or is there a way to load the full assembly 
'as is'? (I will also admit that I haven't really worked with AGP files 
before, so maybe that is the problem...).

Cheers,

Marc