[ensembl-dev] Fwd: Ensembl API - pseudoautosomal regions

Wed Dec 7 15:08:47 GMT 2016

Dear Julia,

I have just responded to your similar query on Biostars: https://www.biostars.org/p/225802/#226037 <https://www.biostars.org/p/225802/#226037>

But to directly answer your three questions:

-How come the positions are identical for some of the genes? 
They have identical coordinates in some cases, as this is how the PARs are mapped between X & Y:
Y:10001-2781479 to X:10001-2781479
and
Y:56887903-57217415 to X:155701383-156030895

-Why do I get these duplicate gene entries?
You are seeing duplicates as these genes occur on both the X & Y chromosomes, in the PARs. 

-How can I prevent this? 
If you do not want to see these duplicate genes, but would rather see one copy, for example on the X chromosome, you can run your query to only search the Y chromosome outside of these coordinates which map to the X.

Let me know if you have further questions.

Best wishes,

Helen
_ _ 

Helen Sparrow
Ensembl Outreach Officer
European Bioinformatics Institute (EMBL-EBI) 
Wellcome Trust Genome Campus, Hinxton, Cambridge
CB10 1SD
UK

> On 7 Dec 2016, at 13:33, Julia Söllner <julia.f.soellner at gmail.com> wrote:
> 
> Dear Ensembl developers,
> 
> I access Ensembl data via the Perl API and retrieve information on genes, transcripts etc. I have made the observation that if I get data from the database's gene table there are genes which occur twice, once on the X and once on the Y chromosome. This affects 45 human genes, for 34/45 genes the start and end positions on X and Y are identical.
> 
> Two examples:
> 
> geneID	biotype	chromosome	start	end
> ENSG00000002586	protein_coding	X	2691179	2741309
> ENSG00000002586	protein_coding	Y	2691179	2741309
> ENSG00000124333	protein_coding	X	155881293	155943769
> ENSG00000124333	protein_coding	Y	57067813	57130289
> 
> When querying some of these genes via the Ensembl website it turned out that they are mapped to pseudoautosomal regions (identical sequence on X and Y).
> 
> 
> Some more information on how I retrieve the data:
> 
> I use the API version 86.
> 
> To speed things up I iterate over chromosomes in parallel and retrieve all genes as follows:
> 
> $slice = $slice_adaptor -> fetch_by_region('chromosome', $chr_name);
> my @genes = @{$slice -> get_all_Genes()};
> So basically ENSG00000124333 is in @genes when querying information on X and when querying information on Y. If I, however, go via the gene I only get the X chromosome:
> 
> my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
> 
> my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000124333');
> 
> print $gene->seq_region_name(); # => X
> On http://lists.ensembl.org/pipermail/dev/2010-October/000214.html <http://lists.ensembl.org/pipermail/dev/2010-October/000214.html> they say that a gene might exceed a pseudoautosomal region and thus extend into a region unique to the Y chromosome. This could be a reason why a gene shows up for X and Y. However, I checked this and there is no overlap between unique regions of Y and the gene coordinates. PAR-Coordinates from http://www.ensembl.org/info/genome/genebuild/assembly.html <http://www.ensembl.org/info/genome/genebuild/assembly.html> were used.
> 
> Questions
> 
> How come the positions are identical for some of the genes?
> Why do I get these duplicate gene entries?
> How can I prevent this?
> 
> Thanks in advance and kind regards,
> Julia 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20161207/984ccdca/attachment.html>