[ensembl-dev] algorithm of FindSplitGenesOnTree
Matthieu Muffato
muffato at ebi.ac.uk
Tue Mar 5 20:31:54 GMT 2013
Dear Pengcheng Yang,
The FindSplitGenesOnTree module, as long as the FindCoreRegionLength,
FindPartialGenesOnTree and FindSingleGenesOnTree modules, are part of an
ongoing project to identify partial / split genes. These modules are not
yet used in our production pipelines.
We are currently using
Bio::EnsEMBL::Compara::RunnableDB::ProteinTrees::FindContiguousSplitGenes (which
contains more comments).
Right after the multiple alignment step, we search, in every tree, pairs
of genes from the same species that satisfy one of the two conditions:
- The two sequences do not overlap at all, and the genes are close to
each other (less than 1 Mb), with at most 1 gene in between
- The two sequences slightly overlap, and the genes are consecutive in
the genome and less than 500 kb apart
All those pairs are grouped under "split_gene" nodes in the gene trees,
and tagged as "contiguous_gene_split" homologies.
Hope this helps,
Best regards,
Matthieu
On 05/03/13 17:49, Pengcheng Yang wrote:
> Hi,
>
> I want to know the algorithm of the FindSplitGenesOnTree class, so I
> read the comments in the file
> http://www.ensembl.org/info/docs/Doxygen/compara-api/FindSplitGenesOnTree_8pm_source.html.
>
> However, I still unclear of the algorithm background of it.
> My understanding is:
> 1. for the genes in one family, do multiple alignment and construct tree
> using TreeBeST
> 2. find the gene ids with shortest (A) and longest (B) length.
> 3. get the gene ids (C) that next to gene A in the same branch in the tree
> 4. check whether C and A have overlap greater than x aa in the multiple
> alignment. If not, they may be one split_gene pair.
>
> Is it? And where to found the documentation of the algorithm? I know one
> way was to read the source code, but it will be understood quickly if
> there is a documentation.
>
> Thank you.
>
> Best,
> Pengcheng Yang
>
More information about the Dev
mailing list