[ensembl-dev] Problem when retrieving CDS coordinates
Joerg Fallmann
fall at tbi.univie.ac.at
Wed Jun 27 13:59:17 BST 2012
Dear Ensembl Dev-team,
I'm using the Ensembl Perl API to get more information on the regions
where I have mapped sequencing reads to.
I have a local MySQL database with the Ensembl Mus_musculus_67_37
dataset and use the according Ensembl module 67.
What I want to find out is wheter a read could be mapped into
exonic,intronic,intergenic region of annotated genes,
furthermore if they could be mapped into 5' or 3' UTR, CDS and so on,
the more info the better.
Currently I'm running into problems when I try to get the genomic
coordinates of the CDS start and end.
With this piece of code:
....
## I get the slice where the read could be mapped to like this:
my $slice = $slice_adaptor -> fetch_by_region( 'chromosome',
"$chromosome", "$readstart", "$readend","$strand");
## then I collect a list of all genes on that slice and the correct strand
foreach my $gene ( @{ $slice -> get_all_Genes } ) {
next unless ($gene->strand() eq $strand);
my $genstr = $gene->display_id();
my $start = $gene->seq_region_start();
my $end = $gene->seq_region_end();
## Now I want a list of transcripts that overlap with my mapped read
foreach my $trans ( @{$gene -> get_all_Transcripts} ){
next unless ( $trans -> display_id() && $trans ->
is_known); ## Not sure if this is necessary
my $transtr = $trans-> display_id() ;
my $trst = $trans->seq_region_start();
my $tren = $trans->seq_region_end();
if ($strand eq "-1"){
$tren = $trans->seq_region_start();
$trst = $trans->seq_region_end();
}
## Divide into coding and non coding transcripts
my ( $CDS_start,$CDS_end );
if ( $trans->biotype() eq "protein_coding" ){
# Get CDS start and end via coding_region_start method, should return
the end of the coding region of this transcript in genomic coordinates
# as written at:
http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Transcript.html#a381da43072278012a2bc41244e91d340
my $cst = $trans->coding_region_start;
my $cen = $trans->coding_region_end;
if ( $strand ne "-1" ){
$CDS_start = $cst;
$CDS_end = $cen;
}
else{
$CDS_start = $cen;
$CDS_end = $cst;
}
print "Gene:$genstr,$strand,$start,$end Transcript:
$transtr,$trst,$tren CDS: $CDS_start,$CDS_end\n";
Instead I get negative values for the CDS start and end if the
transcript is on the plus strand, and in general weird numbers no matter
where my transcript lies.
Here a minimal example output of the above script:
Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript:
ENSMUST00000092728,3557880,3495969 CDS:248746,187777
Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript:
ENSMUST00000169460,3557732,3309059 CDS:248746,104550
Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript:
ENSMUST00000100088,3557732,3309340 CDS:248746,65375
Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript:
ENSMUST00000092734,3587948,3496683 CDS:248746,187777
Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript:
ENSMUST00000056385,3587948,3496683 CDS:248746,187777
Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript:
ENSMUST00000165719,3557942,3308332 CDS:248746,460
Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript:
ENSMUST00000105607,3587948,3495928 CDS:248746,187777
Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript:
ENSMUST00000165510,3366076,3460176 CDS: -68110,21087
Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript:
ENSMUST00000170680,3323356,3434998 CDS:-84985,-353
Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript:
ENSMUST00000086896,3294063,3434998 CDS:-84985,-353
Could you please help me solve this problem as I tried hard for some
time now and still have no clue what's wrong here.
Thanks a lot!
Joerg
BTW: I also tried to get the CDS start and end via the cdna_start and
end method and map them back to genomic coordinates via the
cdna2genomic method, but this didn't work for me either.
More information about the Dev
mailing list