[ensembl-dev] Pipeline: Loading genome into DB - Question on input files
Marc Hoeppner
mphoeppner at gmail.com
Thu Feb 7 07:55:52 GMT 2013
Hi EnsEMBL team,
I am currently on a quest to learn the gene annotation pipeline - and am
already stuck.
In the examples I could find, it is assumed that the genome I want to
load into a fresh DB is provided in 3 files:
Chromosome-to-scaffold relationship (AGP)
Scaffold-to-contig relationship (AGP)
And a contig fasta file
Using the load_seq_region.pl script, these files are parsed and used to
populate the db , assigning ranks to seq_regions, reconstructing the
sequence-level to chromsome-level relationships and so on.
However, since I decided to use a fairly mature genome for testing
puporses (C. elegans), it seems non-trivial to actually get hold of
anything that is not a full assembly (like:
ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/assemblies/).
I assume if I just load the fasta file without
chromosome-scaffold-contig relationships, stuff is going to break down
the line.
I would really appreciate it, if someone could give me a hint on how to
deal with that issue. Do I have to dig a little deeper to find the
original assembly details, or is there a way to load the full assembly
'as is'? (I will also admit that I haven't really worked with AGP files
before, so maybe that is the problem...).
Cheers,
Marc
More information about the Dev
mailing list