[ensembl-dev] Human transcript choice for Compara gene trees

Fri Feb 18 17:33:17 GMT 2011

Hi Greg

Thank you for reporting this. We have been investigating this issue in detail. 
Indeed, there has been a glitch in the selection of the canonical transcript. 
As you mentioned, the choice depends on various factors, including the CCDS. 
It would appear that the script used to define the canonical transcript was run 
before the information on CCDS was uploaded in the database. This has created 
a few oddities in the selection of the canonical transcript and hence in the 
selection of the transcript used for the GeneTrees.

We believe we have corrected the error by removing the dependency from the 
update of the CCDS in our production database. Starting from next release, the 
CCDS data will be taken from our original copy of the CCDS database to avoid 
this problem.

I hope this hasn't cause you too much trouble and thank again you for 
reporting this.

Cheers

Javier

On Friday 11 Feb 2011 16:20:15 Gregory Jordan wrote:
> Hi all,
> 
> I'm having trouble identifying the cause of a strange inconsistency in the
> choice of human transcript for Compara's gene trees from release to
> release. Maybe someone from the Compara team or elsewhere can shed some
> light.
> 
> Looking at the gene tree image for the human gene WDFY3 (
> http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?db=core;g=ENSG0000016
> 3625;r=4:85590704-85887544), it's obvious that the human sequence (along
> with cow) is strangely truncated in comparison to the rest of the
> well-aligned and complete homologs in the gene tree. This is
> understandably common in non-model organisms and low-quality genomes, but
> surprising to see in human.
> 
> If you go to the prior release (v60, here:
> http://nov2010.archive.ensembl.org/Homo_sapiens/Gene/Compara_Tree?db=core;g
> =ENSG00000163625;r=4:85590704-85887544) human is in the correct place and
> has a complete transcript!
> 
> It seems that a very short transcript (protein ID ENSP00000422256) was
> chosen to be included in the Compara pipeline for release 61. This
> transcript is neither the longest nor the CCDS transcript for the gene,
> which were the criteria I thought were being used to choose transcripts for
> Compara's pipeline.
> 
> Have there been any recent changes to the Compara pipeline that might have
> caused this? Is this problem more widespread, or limited to isolated cases?
> 
> I'm fine using the old November 2010 (e60) release for now, but it would
> give me more confidence in the pipeline if there weren't such drastic
> changes in relatively well-behaved gene families and alignments (such as
> this example) from one release to the next.
> 
> Cheers,
>  Greg

-- 
Javier Herrero, PhD
Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK