[ensembl-dev] Fwd: Ensembl API - pseudoautosomal regions
Helen Sparrow
hsparrow at ebi.ac.uk
Wed Dec 7 15:08:47 GMT 2016
Dear Julia,
I have just responded to your similar query on Biostars: https://www.biostars.org/p/225802/#226037 <https://www.biostars.org/p/225802/#226037>
But to directly answer your three questions:
-How come the positions are identical for some of the genes?
They have identical coordinates in some cases, as this is how the PARs are mapped between X & Y:
Y:10001-2781479 to X:10001-2781479
and
Y:56887903-57217415 to X:155701383-156030895
-Why do I get these duplicate gene entries?
You are seeing duplicates as these genes occur on both the X & Y chromosomes, in the PARs.
-How can I prevent this?
If you do not want to see these duplicate genes, but would rather see one copy, for example on the X chromosome, you can run your query to only search the Y chromosome outside of these coordinates which map to the X.
Let me know if you have further questions.
Best wishes,
Helen
_ _
Helen Sparrow
Ensembl Outreach Officer
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton, Cambridge
CB10 1SD
UK
> On 7 Dec 2016, at 13:33, Julia Söllner <julia.f.soellner at gmail.com> wrote:
>
> Dear Ensembl developers,
>
> I access Ensembl data via the Perl API and retrieve information on genes, transcripts etc. I have made the observation that if I get data from the database's gene table there are genes which occur twice, once on the X and once on the Y chromosome. This affects 45 human genes, for 34/45 genes the start and end positions on X and Y are identical.
>
> Two examples:
>
> geneID biotype chromosome start end
> ENSG00000002586 protein_coding X 2691179 2741309
> ENSG00000002586 protein_coding Y 2691179 2741309
> ENSG00000124333 protein_coding X 155881293 155943769
> ENSG00000124333 protein_coding Y 57067813 57130289
>
> When querying some of these genes via the Ensembl website it turned out that they are mapped to pseudoautosomal regions (identical sequence on X and Y).
>
>
> Some more information on how I retrieve the data:
>
> I use the API version 86.
>
> To speed things up I iterate over chromosomes in parallel and retrieve all genes as follows:
>
> $slice = $slice_adaptor -> fetch_by_region('chromosome', $chr_name);
> my @genes = @{$slice -> get_all_Genes()};
> So basically ENSG00000124333 is in @genes when querying information on X and when querying information on Y. If I, however, go via the gene I only get the X chromosome:
>
> my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
>
> my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000124333');
>
> print $gene->seq_region_name(); # => X
> On http://lists.ensembl.org/pipermail/dev/2010-October/000214.html <http://lists.ensembl.org/pipermail/dev/2010-October/000214.html> they say that a gene might exceed a pseudoautosomal region and thus extend into a region unique to the Y chromosome. This could be a reason why a gene shows up for X and Y. However, I checked this and there is no overlap between unique regions of Y and the gene coordinates. PAR-Coordinates from http://www.ensembl.org/info/genome/genebuild/assembly.html <http://www.ensembl.org/info/genome/genebuild/assembly.html> were used.
>
> Questions
>
> How come the positions are identical for some of the genes?
> Why do I get these duplicate gene entries?
> How can I prevent this?
>
> Thanks in advance and kind regards,
> Julia
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20161207/984ccdca/attachment.html>
More information about the Dev
mailing list