[ensembl-dev] algorithm of FindSplitGenesOnTree

Tue Mar 5 20:31:54 GMT 2013

Dear Pengcheng Yang,

The FindSplitGenesOnTree module, as long as the FindCoreRegionLength, 
FindPartialGenesOnTree and FindSingleGenesOnTree modules, are part of an 
ongoing project to identify partial / split genes. These modules are not 
yet used in our production pipelines.

We are currently using 
Bio::EnsEMBL::Compara::RunnableDB::ProteinTrees::FindContiguousSplitGenes (which 
contains more comments).

Right after the multiple alignment step, we search, in every tree, pairs 
of genes from the same species that satisfy one of the two conditions:
  - The two sequences do not overlap at all, and the genes are close to 
each other (less than 1 Mb), with at most 1 gene in between
  - The two sequences slightly overlap, and the genes are consecutive in 
the genome and less than 500 kb apart

All those pairs are grouped under "split_gene" nodes in the gene trees, 
and tagged as "contiguous_gene_split" homologies.

Hope this helps,

Best regards,
Matthieu

On 05/03/13 17:49, Pengcheng Yang wrote:
> Hi,
>
> I want to know the algorithm of the FindSplitGenesOnTree class, so I
> read the comments in the file
> http://www.ensembl.org/info/docs/Doxygen/compara-api/FindSplitGenesOnTree_8pm_source.html.
>
> However, I still unclear of the algorithm background of it.
> My understanding is:
> 1. for the genes in one family, do multiple alignment and construct tree
> using TreeBeST
> 2. find the gene ids with shortest (A) and longest (B) length.
> 3. get the gene ids (C) that next to gene A in the same branch in the tree
> 4. check whether C and A have overlap greater than x aa in the multiple
> alignment. If not, they may be one split_gene pair.
>
> Is it? And where to found the documentation of the algorithm? I know one
> way was to read the source code, but it will be understood quickly if
> there is a documentation.
>
> Thank you.
>
> Best,
> Pengcheng Yang
>