[ensembl-dev] algorithm of FindSplitGenesOnTree
muffato at ebi.ac.uk
Tue Mar 5 20:31:54 GMT 2013
Dear Pengcheng Yang,
The FindSplitGenesOnTree module, as long as the FindCoreRegionLength,
FindPartialGenesOnTree and FindSingleGenesOnTree modules, are part of an
ongoing project to identify partial / split genes. These modules are not
yet used in our production pipelines.
We are currently using
contains more comments).
Right after the multiple alignment step, we search, in every tree, pairs
of genes from the same species that satisfy one of the two conditions:
- The two sequences do not overlap at all, and the genes are close to
each other (less than 1 Mb), with at most 1 gene in between
- The two sequences slightly overlap, and the genes are consecutive in
the genome and less than 500 kb apart
All those pairs are grouped under "split_gene" nodes in the gene trees,
and tagged as "contiguous_gene_split" homologies.
Hope this helps,
On 05/03/13 17:49, Pengcheng Yang wrote:
> I want to know the algorithm of the FindSplitGenesOnTree class, so I
> read the comments in the file
> However, I still unclear of the algorithm background of it.
> My understanding is:
> 1. for the genes in one family, do multiple alignment and construct tree
> using TreeBeST
> 2. find the gene ids with shortest (A) and longest (B) length.
> 3. get the gene ids (C) that next to gene A in the same branch in the tree
> 4. check whether C and A have overlap greater than x aa in the multiple
> alignment. If not, they may be one split_gene pair.
> Is it? And where to found the documentation of the algorithm? I know one
> way was to read the source code, but it will be understood quickly if
> there is a documentation.
> Thank you.
> Pengcheng Yang
More information about the Dev