[ensembl-dev] Problem when retrieving CDS coordinates

Wed Jun 27 14:52:10 BST 2012

Hi Joerg,

You seem to have stumbled onto a section of the API which is a little bit unclear about how it deals with retrieval of sequences on sub-slices. When you do this you are requesting features which overlap this on the requested region but then projected to the given locations. Those -ve numbers are telling you the value is in a location prior to this slice. If you want a more natural view of these coordinates then you will have to transfer the Gene to the full length chromosome before proceeding with anything else:

foreach my $raw_gene ( @{ $slice->get_all_Genes } ) {
  my $gene = $raw_gene->transfer($slice->seq_region_Slice());
}

Hope this helps

Andy

Andrew Yates                   Ensembl Core Software Project Leader
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensembl.org/

On 27 Jun 2012, at 13:59, Joerg Fallmann wrote:

> Dear Ensembl Dev-team,
> 
> I'm using the Ensembl Perl API to get more information on the regions where I have mapped sequencing reads to.
> 
> I have a local MySQL database with the Ensembl Mus_musculus_67_37 dataset and use the according Ensembl module 67.
> 
> What I want to find out is wheter a read could be mapped into exonic,intronic,intergenic region of annotated genes,
> furthermore if they could be mapped into 5' or 3' UTR, CDS and so on, the more info the better.
> 
> Currently I'm running into problems when I try to get the genomic coordinates of the CDS start and end.
> With this piece of code:
> 
> ....
> ## I get the slice where the read could be mapped to like this:
> 
>     my $slice = $slice_adaptor -> fetch_by_region( 'chromosome', "$chromosome", "$readstart", "$readend","$strand");
> 
> ## then I collect a list of all genes on that slice and the correct strand
> 
>    foreach my $gene ( @{ $slice -> get_all_Genes } ) {
> 
>        next unless ($gene->strand() eq $strand);
>        my $genstr = $gene->display_id();
>        my $start  = $gene->seq_region_start();
>        my $end    = $gene->seq_region_end();
> 
> ## Now I want a list of transcripts that overlap with my mapped read
> 
>        foreach my $trans ( @{$gene -> get_all_Transcripts} ){
> 
>            next unless ( $trans -> display_id() && $trans -> is_known);  ## Not sure if this is necessary
> 
>            my $transtr = $trans-> display_id() ;
>            my $trst = $trans->seq_region_start();
>            my $tren = $trans->seq_region_end();
> 
>            if ($strand eq "-1"){
>            $tren = $trans->seq_region_start();
>            $trst = $trans->seq_region_end();
>            }
> 
> ## Divide into coding and non coding transcripts
> 
>            my ( $CDS_start,$CDS_end );
> 
>            if ( $trans->biotype() eq "protein_coding" ){
> 
> # Get CDS start and end via coding_region_start method, should return the end of the coding region of this transcript in genomic coordinates
> # as written at: http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Transcript.html#a381da43072278012a2bc41244e91d340 
> 
>            my $cst = $trans->coding_region_start;
>            my $cen = $trans->coding_region_end;
> 
>            if ( $strand ne "-1" ){
>                $CDS_start = $cst;
>                $CDS_end = $cen;
>            }
>            else{
>                $CDS_start = $cen;
>                $CDS_end = $cst;
>            }
> 
>        print "Gene:$genstr,$strand,$start,$end Transcript: $transtr,$trst,$tren CDS: $CDS_start,$CDS_end\n";
> 
> 
> Instead I get negative values for the CDS start and end if the transcript is on the plus strand, and in general weird numbers no matter where my transcript lies.
> 
> Here a minimal example output of the above script:
> 
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000092728,3557880,3495969 CDS:248746,187777
> 
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000169460,3557732,3309059 CDS:248746,104550
> 
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000100088,3557732,3309340 CDS:248746,65375
> 
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000092734,3587948,3496683 CDS:248746,187777
> 
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000056385,3587948,3496683 CDS:248746,187777
> 
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000165719,3557942,3308332 CDS:248746,460
> 
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000105607,3587948,3495928 CDS:248746,187777
> 
> Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript: ENSMUST00000165510,3366076,3460176 CDS: -68110,21087
> 
> Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript: ENSMUST00000170680,3323356,3434998 CDS:-84985,-353
> 
> Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript: ENSMUST00000086896,3294063,3434998 CDS:-84985,-353
> 
> 
> 
> Could you please help me solve this problem as I tried hard for some time now and still have no clue what's wrong here.
> 
> Thanks a lot!
> 
> Joerg
> 
> BTW: I also tried to get the CDS start and end via the cdna_start and end method and map them back to genomic coordinates via the
> cdna2genomic method, but this didn't work for me either.
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/