[ensembl-dev] question regarding refseq exons retreival

Kieron Taylor ktaylor at ebi.ac.uk
Tue Mar 10 15:22:00 GMT 2015


Dear Duarte,

The issue you have exposed is subtle. You seem to be printing “exon stable IDs” but expecting them to be RefSeq accessions. Our mistake was to use the RefSeq IDs as arbitrary identifiers for internal use, but I must stress the what Ensembl calls a Stable ID must never be assumed to have any meaning outside of an Ensembl database. What you want are display labels. The exon labels were generated by picking only the first of any possible RefSeq IDs, hence you cannot get everything you want in this way.

The correct way to handle this in your code is to fetch the transcript name and print that in each exon, as RefSeq IDs refer to transcripts and not exons.


Regards,

Kieron


Kieron Taylor PhD.
Ensembl Core senior software developer

EMBL, European Bioinformatics Institute





> On 10 Mar 2015, at 11:57, Duarte Molha <duartemolha at gmail.com> wrote:
> 
> Dear developers 
> 
> I have a script that I wrote (in attachment)  that gets me the refseq exons for give input gene 
> 
> However when I use this code using the gene ASXL1 as an example is:
> 
> test_query.pl ASXL1
> 
> QueryName	feature_type	common_name	Biotype	id	chr	start	end	strand
> ASXL1	Exon	ASXL1	protein_coding	NM_001164603.1.1	chr20	30946147	30946635	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_001164603.1.2	chr20	30954187	30954269	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_001164603.1.3	chr20	30955530	30955532	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_001164603.1.4	chr20	30956818	30956926	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.5	chr20	31015931	31016051	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.6	chr20	31016128	31016225	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.7	chr20	31017141	31017234	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.8	chr20	31017704	31017856	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.9	chr20	31019124	31019287	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.10	chr20	31019386	31019482	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.11	chr20	31020683	31020788	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.12	chr20	31021087	31021720	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.13	chr20	31022235	31027122	+	
> 
> 
> As you can see, I am missing some of the exons for transcript NM_015338.5
> In this case, the 1st 3 exons of transcript  NM_015338.5 are identical to NM_001164603.1, but I would expect to have them listed as :
> 
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.1	chr20	30946147	30946635	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.2	chr20	30954187	30954269	+	
> ASXL1	Exon	ASXL1	protein_coding	NM_015338.5.3	chr20	30955530	30955532	+
> 
> Can you tell me what is wrong with my approach and how I can retrieve the missing data?
> 
> Best regards
> 
> Duarte
> <test_query.pl>_______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list