[ensembl-dev] algorithm of FindSplitGenesOnTree

Pengcheng Yang pengchy at gmail.com
Wed Mar 6 13:20:36 GMT 2013

Dear Matthieu,

Thank you for your reply. I am clear now.

I have noted that 236 genes in 20322 were split genes in Horse genome 
(ensembl 2009 paper). Do you have a report of the results from the 
Find*OnTree modules for Ensembl species. Another question is how do you 
treat the split genes. Manually correct or other method.


On 2013/3/6 4:31, Matthieu Muffato wrote:
> Dear Pengcheng Yang,
> The FindSplitGenesOnTree module, as long as the FindCoreRegionLength, 
> FindPartialGenesOnTree and FindSingleGenesOnTree modules, are part of 
> an ongoing project to identify partial / split genes. These modules 
> are not yet used in our production pipelines.
> We are currently using 
> Bio::EnsEMBL::Compara::RunnableDB::ProteinTrees::FindContiguousSplitGenes 
> (which contains more comments).
> Right after the multiple alignment step, we search, in every tree, 
> pairs of genes from the same species that satisfy one of the two 
> conditions:
>  - The two sequences do not overlap at all, and the genes are close to 
> each other (less than 1 Mb), with at most 1 gene in between
>  - The two sequences slightly overlap, and the genes are consecutive 
> in the genome and less than 500 kb apart
> All those pairs are grouped under "split_gene" nodes in the gene 
> trees, and tagged as "contiguous_gene_split" homologies.
> Hope this helps,
> Best regards,
> Matthieu
> On 05/03/13 17:49, Pengcheng Yang wrote:
>> Hi,
>> I want to know the algorithm of the FindSplitGenesOnTree class, so I
>> read the comments in the file
>> http://www.ensembl.org/info/docs/Doxygen/compara-api/FindSplitGenesOnTree_8pm_source.html. 
>> However, I still unclear of the algorithm background of it.
>> My understanding is:
>> 1. for the genes in one family, do multiple alignment and construct tree
>> using TreeBeST
>> 2. find the gene ids with shortest (A) and longest (B) length.
>> 3. get the gene ids (C) that next to gene A in the same branch in the 
>> tree
>> 4. check whether C and A have overlap greater than x aa in the multiple
>> alignment. If not, they may be one split_gene pair.
>> Is it? And where to found the documentation of the algorithm? I know one
>> way was to read the source code, but it will be understood quickly if
>> there is a documentation.
>> Thank you.
>> Best,
>> Pengcheng Yang
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: 
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

More information about the Dev mailing list