[ensembl-dev] compara blastz coordinates

Javier Herrero jherrero at ebi.ac.uk
Wed Apr 11 11:56:09 BST 2012


Hi Elena

The term reference sequence might be confusing. It can relate to either 
the species that was used as reference when running the netting 
algorithm or your query sequence when you retrieve the alignments.

As Stephen showed, we tend to store the reference sequence (as the one 
used in the netting step) in the positive strand. For instance, in the 
human-mouse alignments, all the human aligned sequences are in the 
positive strand in the database.

Now, if you retrieve the alignments using the other genome as your 
query, you will find that some of the sequences are in the negative 
strand. This doesn't really matter as the script you are using is 
reverse-complementing the whole alignment to get it for you in the 
positive strand. Our data model does not really assume any reference 
sequence anyway and the fact that we store the netting reference 
sequence on the positive strand is only anecdotal.

The coordinates are given with respect to the end of the chromosome when 
the sequence is in the reverse strand. This is required by the axt 
format and for many other formats commonly used by the UCSC genome 
browser. In Ensembl, we always refer to the start of the chromosome (on 
the positive strand). UCSC has a wiki page to explain how to transform 
coordinates to their system: 
http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms

PD: It is worth mentioning that you are using a script that hasn't been 
modified in 6 years and still works. This is one clear example of how 
the Ensembl API can make your life easier: write your script once and 
run it for years by simply updating the API version! :-)

I hope this helps

Javier

On 11/04/12 11:23, Elena Grassi wrote:
> On Wed, Apr 4, 2012 at 3:05 PM, Stephen Fitzgerald<stephenf at ebi.ac.uk>  wrote:
>> Hi Elena,
>> the reference genome is always positive(1) but the non-reference can be
>> positive or negative(-1):
> Thank you for your reply, I did know that the non-reference sequence
> can be on positive or negative strand, what I'm not sure about is what
> kind of coordinates are given in the latter case because evidence and
> documentation seem to tell different things.
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>

-- 
Javier Herrero, PhD
Ensembl Coordinator and Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK





More information about the Dev mailing list