[ensembl-dev] retrieve conservation score of all human exons

Javier Herrero (TGAC) Javier.Herrero at tgac.ac.uk
Thu Jan 16 11:17:52 GMT 2014


Dear Julien

To optimise the display of scores, these are stored at different levels of detail, say for each 1-bp, 10-bp, 50-bp, 100-bp and 500-bp window. (Note: These are not necessarily the actual window sizes). With the display_size option, you are telling the API how many scores you need to get back. For instance, if the image is 1200 pixels in width, it doesn’t make sense to return 1,000,000 scores for a 1Mb slice. The API takes care of choosing which window size is more appropriate taking into account the slice and display sizes. In your case, I would recommend to get a display_size equal to the slice size to get precise scores and average them as you seem fit.

I think you could do faster with your algorithm if you would chunk the genome in 1 or 5 Mb slices, get the scores and the exons for that slice and treat the results as you seem fit. Note that the full list of exons will include coding and non-coding exons from all sorts of genes (including pseudogenes, ncRNA, etc). I suspect you may want to restrict your analysis a to a given type of exons. To do that, use the sliding slice approach, get the genes you are interested in (make sure the biotype is what you expect), get the corresponding transcripts (again, you may want to consider filtering untranslated transcripts) and get the exons for those transcripts. I should note that the exons can overlap entirely (in that case they will have the same stable_id) or partially (they would have different stable_id).

I hope this helps

Javier

On 16 Jan 2014, at 08:03, Julien Roux <julien.roux at unil.ch<mailto:julien.roux at unil.ch>> wrote:

Dear list,
I have been trying to extract the mean GERP conservation score for all human exons using the Ensembl API.
My script is working properly, however it is quite slow (only ~1000 exons per hour) and I am wondering if there would be a way to speed it up.
I have tried to set "display_size" to 1, but t doesn't seem to give yield the exact mean score (because it is an average of precalculated averages I guess).
I am putting the script below.
Thanks for your help
Julien

#!/usr/bin/env perl
use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::Exception qw(throw);

# This script is inspired by the example script at http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-compara/scripts/examples/dna_getConservationScores.pl?root=ensembl&view=markup

my $reg = "Bio::EnsEMBL::Registry";
my $species = "Homo sapiens";
$reg->load_registry_from_db(
     -host => "ensembldb.ensembl.org<http://ensembldb.ensembl.org>",
     -user => "anonymous",
);

#get slice adaptor for $species
my $slice_adaptor = $reg->get_adaptor($species, 'core', 'Slice');

#get exon adaptor for $species
my $exon_adaptor = $reg->get_adaptor($species, 'Core', 'Exon');

#get method_link_species_set adaptor
my $mlss_adaptor = $reg->get_adaptor("Multi", "compara", "MethodLinkSpeciesSet");

#get method_link_species_set object for gerp conservation scores for mammals
my $mlss = $mlss_adaptor->fetch_by_method_link_type_species_set_name("GERP_CONSERVATION_SCORE", "mammals");

#get conservation score adaptor
my $cs_adaptor = $reg->get_adaptor("Multi", 'compara', 'ConservationScore');

# fetch all exon IDs
my @exon_ids = @{$exon_adaptor->list_stable_ids()};

foreach my $exon (@exon_ids){
 print $exon, "\n";
 my $slice = $slice_adaptor->fetch_by_exon_stable_id($exon);

 #get one score per base in the slice
 my $display_size = $slice->end - $slice->start + 1;
 my $scores = $cs_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice($mlss, $slice, $display_size);

 # average the scores in the slice
 my $sum = 0;
 my $count = 0;
 my $diff_score;
 foreach my $score (@$scores) {
   if (defined $score->diff_score) {
     $sum += $score->diff_score;
     $count++;
   }
 }
 if ($count > 0){
   $diff_score = $sum / $count;
 }
 if (defined $diff_score){
   print "$exon\t$diff_score\n";
 }
 else {
   print "$exon\tNA\n";
 }
}
close OUT;
exit;

--
Julien Roux, PhD
Gilad lab, Department of Human Genetics, University of Chicago
http://giladlab.uchicago.edu/
920 East 58th Street, CLSC 317, Chicago, IL 60637, USA
tel: +1-773-834-1984   fax: +1-773-834-8470


--
Julien Roux, PhD
Gilad lab, Department of Human Genetics, University of Chicago
http://giladlab.uchicago.edu/
920 East 58th Street, CLSC 317, Chicago, IL 60637, USA
tel: +1-773-834-1984   fax: +1-773-834-8470

_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

--
Javier Herrero, PhD
Comparative Genomics Project Leader
TGAC, Norwich Research Park
Norwich, NR4 7UH, UK

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140116/8ee7f1d1/attachment.html>


More information about the Dev mailing list