[ensembl-dev] Problem when retrieving CDS coordinates

Joerg Fallmann fall at tbi.univie.ac.at
Wed Jun 27 13:59:17 BST 2012


Dear Ensembl Dev-team,

I'm using the Ensembl Perl API to get more information on the regions 
where I have mapped sequencing reads to.

I have a local MySQL database with the Ensembl Mus_musculus_67_37 
dataset and use the according Ensembl module 67.

What I want to find out is wheter a read could be mapped into 
exonic,intronic,intergenic region of annotated genes,
furthermore if they could be mapped into 5' or 3' UTR, CDS and so on, 
the more info the better.

Currently I'm running into problems when I try to get the genomic 
coordinates of the CDS start and end.
With this piece of code:

....
## I get the slice where the read could be mapped to like this:

      my $slice = $slice_adaptor -> fetch_by_region( 'chromosome', 
"$chromosome", "$readstart", "$readend","$strand");

## then I collect a list of all genes on that slice and the correct strand

     foreach my $gene ( @{ $slice -> get_all_Genes } ) {

         next unless ($gene->strand() eq $strand);
         my $genstr = $gene->display_id();
         my $start  = $gene->seq_region_start();
         my $end    = $gene->seq_region_end();

## Now I want a list of transcripts that overlap with my mapped read

         foreach my $trans ( @{$gene -> get_all_Transcripts} ){

             next unless ( $trans -> display_id() && $trans -> 
is_known);  ## Not sure if this is necessary

             my $transtr = $trans-> display_id() ;
             my $trst = $trans->seq_region_start();
             my $tren = $trans->seq_region_end();

             if ($strand eq "-1"){
             $tren = $trans->seq_region_start();
             $trst = $trans->seq_region_end();
             }

## Divide into coding and non coding transcripts

             my ( $CDS_start,$CDS_end );

             if ( $trans->biotype() eq "protein_coding" ){

# Get CDS start and end via coding_region_start method, should return 
the end of the coding region of this transcript in genomic coordinates
# as written at: 
http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Transcript.html#a381da43072278012a2bc41244e91d340 


             my $cst = $trans->coding_region_start;
             my $cen = $trans->coding_region_end;

             if ( $strand ne "-1" ){
                 $CDS_start = $cst;
                 $CDS_end = $cen;
             }
             else{
                 $CDS_start = $cen;
                 $CDS_end = $cst;
             }

         print "Gene:$genstr,$strand,$start,$end Transcript: 
$transtr,$trst,$tren CDS: $CDS_start,$CDS_end\n";


Instead I get negative values for the CDS start and end if the 
transcript is on the plus strand, and in general weird numbers no matter 
where my transcript lies.

Here a minimal example output of the above script:

  Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: 
ENSMUST00000092728,3557880,3495969 CDS:248746,187777

  Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: 
ENSMUST00000169460,3557732,3309059 CDS:248746,104550

  Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: 
ENSMUST00000100088,3557732,3309340 CDS:248746,65375

  Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: 
ENSMUST00000092734,3587948,3496683 CDS:248746,187777

  Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: 
ENSMUST00000056385,3587948,3496683 CDS:248746,187777

  Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: 
ENSMUST00000165719,3557942,3308332 CDS:248746,460

  Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: 
ENSMUST00000105607,3587948,3495928 CDS:248746,187777

  Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript: 
ENSMUST00000165510,3366076,3460176 CDS: -68110,21087

  Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript: 
ENSMUST00000170680,3323356,3434998 CDS:-84985,-353

  Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript: 
ENSMUST00000086896,3294063,3434998 CDS:-84985,-353



Could you please help me solve this problem as I tried hard for some 
time now and still have no clue what's wrong here.

Thanks a lot!

Joerg

BTW: I also tried to get the CDS start and end via the cdna_start and 
end method and map them back to genomic coordinates via the
cdna2genomic method, but this didn't work for me either.




More information about the Dev mailing list