[ensembl-dev] algorithm of FindSplitGenesOnTree

Wed Mar 6 13:20:36 GMT 2013

Dear Matthieu,

Thank you for your reply. I am clear now.

I have noted that 236 genes in 20322 were split genes in Horse genome 
(ensembl 2009 paper). Do you have a report of the results from the 
Find*OnTree modules for Ensembl species. Another question is how do you 
treat the split genes. Manually correct or other method.

Best,
Pengcheng

On 2013/3/6 4:31, Matthieu Muffato wrote:
> Dear Pengcheng Yang,
>
> The FindSplitGenesOnTree module, as long as the FindCoreRegionLength, 
> FindPartialGenesOnTree and FindSingleGenesOnTree modules, are part of 
> an ongoing project to identify partial / split genes. These modules 
> are not yet used in our production pipelines.
>
> We are currently using 
> Bio::EnsEMBL::Compara::RunnableDB::ProteinTrees::FindContiguousSplitGenes 
> (which contains more comments).
>
> Right after the multiple alignment step, we search, in every tree, 
> pairs of genes from the same species that satisfy one of the two 
> conditions:
>  - The two sequences do not overlap at all, and the genes are close to 
> each other (less than 1 Mb), with at most 1 gene in between
>  - The two sequences slightly overlap, and the genes are consecutive 
> in the genome and less than 500 kb apart
>
> All those pairs are grouped under "split_gene" nodes in the gene 
> trees, and tagged as "contiguous_gene_split" homologies.
>
> Hope this helps,
>
> Best regards,
> Matthieu
>
> On 05/03/13 17:49, Pengcheng Yang wrote:
>> Hi,
>>
>> I want to know the algorithm of the FindSplitGenesOnTree class, so I
>> read the comments in the file
>> http://www.ensembl.org/info/docs/Doxygen/compara-api/FindSplitGenesOnTree_8pm_source.html. 
>>
>>
>> However, I still unclear of the algorithm background of it.
>> My understanding is:
>> 1. for the genes in one family, do multiple alignment and construct tree
>> using TreeBeST
>> 2. find the gene ids with shortest (A) and longest (B) length.
>> 3. get the gene ids (C) that next to gene A in the same branch in the 
>> tree
>> 4. check whether C and A have overlap greater than x aa in the multiple
>> alignment. If not, they may be one split_gene pair.
>>
>> Is it? And where to found the documentation of the algorithm? I know one
>> way was to read the source code, but it will be understood quickly if
>> there is a documentation.
>>
>> Thank you.
>>
>> Best,
>> Pengcheng Yang
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: 
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/