[ensembl-dev] Human transcript choice for Compara gene trees

Fri Feb 11 16:20:15 GMT 2011

Hi all,

I'm having trouble identifying the cause of a strange inconsistency in the
choice of human transcript for Compara's gene trees from release to release.
Maybe someone from the Compara team or elsewhere can shed some light.

Looking at the gene tree image for the human gene WDFY3 (
http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?db=core;g=ENSG00000163625;r=4:85590704-85887544),
it's obvious that the human sequence (along with cow) is strangely truncated
in comparison to the rest of the well-aligned and complete homologs in the
gene tree. This is understandably common in non-model organisms and
low-quality genomes, but surprising to see in human.

If you go to the prior release (v60, here:
http://nov2010.archive.ensembl.org/Homo_sapiens/Gene/Compara_Tree?db=core;g=ENSG00000163625;r=4:85590704-85887544)
human is in the correct place and has a complete transcript!

It seems that a very short transcript (protein ID ENSP00000422256) was
chosen to be included in the Compara pipeline for release 61. This
transcript is neither the longest nor the CCDS transcript for the gene,
which were the criteria I thought were being used to choose transcripts for
Compara's pipeline.

Have there been any recent changes to the Compara pipeline that might have
caused this? Is this problem more widespread, or limited to isolated cases?

I'm fine using the old November 2010 (e60) release for now, but it would
give me more confidence in the pipeline if there weren't such drastic
changes in relatively well-behaved gene families and alignments (such as
this example) from one release to the next.

Cheers,
 Greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110211/bf72d468/attachment.html>