[ensembl-dev] Problem when retrieving CDS coordinates
Andy Yates
ayates at ebi.ac.uk
Wed Jun 27 14:52:10 BST 2012
Hi Joerg,
You seem to have stumbled onto a section of the API which is a little bit unclear about how it deals with retrieval of sequences on sub-slices. When you do this you are requesting features which overlap this on the requested region but then projected to the given locations. Those -ve numbers are telling you the value is in a location prior to this slice. If you want a more natural view of these coordinates then you will have to transfer the Gene to the full length chromosome before proceeding with anything else:
foreach my $raw_gene ( @{ $slice->get_all_Genes } ) {
my $gene = $raw_gene->transfer($slice->seq_region_Slice());
}
Hope this helps
Andy
Andrew Yates Ensembl Core Software Project Leader
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensembl.org/
On 27 Jun 2012, at 13:59, Joerg Fallmann wrote:
> Dear Ensembl Dev-team,
>
> I'm using the Ensembl Perl API to get more information on the regions where I have mapped sequencing reads to.
>
> I have a local MySQL database with the Ensembl Mus_musculus_67_37 dataset and use the according Ensembl module 67.
>
> What I want to find out is wheter a read could be mapped into exonic,intronic,intergenic region of annotated genes,
> furthermore if they could be mapped into 5' or 3' UTR, CDS and so on, the more info the better.
>
> Currently I'm running into problems when I try to get the genomic coordinates of the CDS start and end.
> With this piece of code:
>
> ....
> ## I get the slice where the read could be mapped to like this:
>
> my $slice = $slice_adaptor -> fetch_by_region( 'chromosome', "$chromosome", "$readstart", "$readend","$strand");
>
> ## then I collect a list of all genes on that slice and the correct strand
>
> foreach my $gene ( @{ $slice -> get_all_Genes } ) {
>
> next unless ($gene->strand() eq $strand);
> my $genstr = $gene->display_id();
> my $start = $gene->seq_region_start();
> my $end = $gene->seq_region_end();
>
> ## Now I want a list of transcripts that overlap with my mapped read
>
> foreach my $trans ( @{$gene -> get_all_Transcripts} ){
>
> next unless ( $trans -> display_id() && $trans -> is_known); ## Not sure if this is necessary
>
> my $transtr = $trans-> display_id() ;
> my $trst = $trans->seq_region_start();
> my $tren = $trans->seq_region_end();
>
> if ($strand eq "-1"){
> $tren = $trans->seq_region_start();
> $trst = $trans->seq_region_end();
> }
>
> ## Divide into coding and non coding transcripts
>
> my ( $CDS_start,$CDS_end );
>
> if ( $trans->biotype() eq "protein_coding" ){
>
> # Get CDS start and end via coding_region_start method, should return the end of the coding region of this transcript in genomic coordinates
> # as written at: http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Transcript.html#a381da43072278012a2bc41244e91d340
>
> my $cst = $trans->coding_region_start;
> my $cen = $trans->coding_region_end;
>
> if ( $strand ne "-1" ){
> $CDS_start = $cst;
> $CDS_end = $cen;
> }
> else{
> $CDS_start = $cen;
> $CDS_end = $cst;
> }
>
> print "Gene:$genstr,$strand,$start,$end Transcript: $transtr,$trst,$tren CDS: $CDS_start,$CDS_end\n";
>
>
> Instead I get negative values for the CDS start and end if the transcript is on the plus strand, and in general weird numbers no matter where my transcript lies.
>
> Here a minimal example output of the above script:
>
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000092728,3557880,3495969 CDS:248746,187777
>
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000169460,3557732,3309059 CDS:248746,104550
>
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000100088,3557732,3309340 CDS:248746,65375
>
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000092734,3587948,3496683 CDS:248746,187777
>
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000056385,3587948,3496683 CDS:248746,187777
>
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000165719,3557942,3308332 CDS:248746,460
>
> Gene:ENSMUSG00000000766,-1,3308332,3588035 Transcript: ENSMUST00000105607,3587948,3495928 CDS:248746,187777
>
> Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript: ENSMUST00000165510,3366076,3460176 CDS: -68110,21087
>
> Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript: ENSMUST00000170680,3323356,3434998 CDS:-84985,-353
>
> Gene:ENSMUSG00000064065,1,3294063,3460745 Transcript: ENSMUST00000086896,3294063,3434998 CDS:-84985,-353
>
>
>
> Could you please help me solve this problem as I tried hard for some time now and still have no clue what's wrong here.
>
> Thanks a lot!
>
> Joerg
>
> BTW: I also tried to get the CDS start and end via the cdna_start and end method and map them back to genomic coordinates via the
> cdna2genomic method, but this didn't work for me either.
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list