[ensembl-dev] algorithm of FindSplitGenesOnTree

Matthieu Muffato muffato at ebi.ac.uk
Wed Mar 6 22:37:15 GMT 2013


Dear Pengcheng

Please find at the end of this email a summary for the version 70 data. 
There is for each species the total number of genes flagged as split 
genes, and the number of groups (pairs, triplets, etc) they form.

Be aware that those figures, although quite stable in practice, can in 
theory change every release as we recompute the multiple alignments.

There is no curation of the data / trees. The pipeline automatically:
  - detects the split genes
  - replace each pair / triplet with a single fake gene whose sequence 
is the merge of all the sequences from that group
  - builds the tree
  - unpack each fake gene by the underlying true gene models and create 
"gene_split" nodes in the tree

Best,
Matthieu

+----------------------------+----------+---------+
| name                       | n_groups | n_genes |
+----------------------------+----------+---------+
| choloepus_hoffmanni        |        4 |       8 |
| tarsius_syrichta           |        6 |      12 |
| oreochromis_niloticus      |        7 |      14 |
| macropus_eugenii           |        8 |      16 |
| erinaceus_europaeus        |        8 |      16 |
| homo_sapiens               |        8 |      16 |
| dasypus_novemcinctus       |        8 |      17 |
| sorex_araneus              |        9 |      18 |
| saccharomyces_cerevisiae   |       11 |      25 |
| procavia_capensis          |       12 |      25 |
| mus_musculus               |       13 |      27 |
| pan_troglodytes            |       15 |      30 |
| dipodomys_ordii            |       17 |      34 |
| caenorhabditis_elegans     |       16 |      36 |
| vicugna_pacos              |       17 |      36 |
| echinops_telfairi          |       20 |      42 |
| ochotona_princeps          |       22 |      47 |
| tursiops_truncatus         |       26 |      54 |
| tupaia_belangeri           |       26 |      55 |
| microcebus_murinus         |       32 |      65 |
| pteropus_vampyrus          |       33 |      67 |
| xenopus_tropicalis         |       42 |      88 |
| gorilla_gorilla            |       44 |      89 |
| takifugu_rubripes          |       43 |      90 |
| petromyzon_marinus         |       44 |      90 |
| xiphophorus_maculatus      |       49 |     100 |
| pongo_abelii               |       49 |     101 |
| bos_taurus                 |       49 |     103 |
| otolemur_garnettii         |       52 |     106 |
| cavia_porcellus            |       60 |     125 |
| canis_familiaris           |       65 |     131 |
| oryctolagus_cuniculus      |       74 |     153 |
| myotis_lucifugus           |       78 |     159 |
| ailuropoda_melanoleuca     |       82 |     171 |
| callithrix_jacchus         |       89 |     181 |
| mustela_putorius_furo      |       84 |     182 |
| rattus_norvegicus          |       90 |     186 |
| gallus_gallus              |       98 |     205 |
| monodelphis_domestica      |       97 |     207 |
| danio_rerio                |      109 |     232 |
| gasterosteus_aculeatus     |      113 |     233 |
| gadus_morhua               |      114 |     236 |
| equus_caballus             |      119 |     248 |
| sarcophilus_harrisii       |      128 |     263 |
| nomascus_leucogenys        |      136 |     287 |
| oryzias_latipes            |      148 |     307 |
| loxodonta_africana         |      150 |     309 |
| pelodiscus_sinensis        |      167 |     343 |
| tetraodon_nigroviridis     |      175 |     360 |
| latimeria_chalumnae        |      181 |     371 |
| ciona_savignyi             |      188 |     393 |
| ciona_intestinalis         |      189 |     410 |
| meleagris_gallopavo        |      223 |     459 |
| taeniopygia_guttata        |      238 |     491 |
| macaca_mulatta             |      248 |     530 |
| felis_catus                |      243 |     576 |
| anolis_carolinensis        |      288 |     593 |
| ictidomys_tridecemlineatus |      300 |     606 |
| ornithorhynchus_anatinus   |      356 |     775 |
| sus_scrofa                 |      467 |    1024 |
+----------------------------+----------+---------+

On 06/03/13 13:20, Pengcheng Yang wrote:
> Dear Matthieu,
>
> Thank you for your reply. I am clear now.
>
> I have noted that 236 genes in 20322 were split genes in Horse genome
> (ensembl 2009 paper). Do you have a report of the results from the
> Find*OnTree modules for Ensembl species. Another question is how do you
> treat the split genes. Manually correct or other method.
>
> Best,
> Pengcheng
>
> On 2013/3/6 4:31, Matthieu Muffato wrote:
>> Dear Pengcheng Yang,
>>
>> The FindSplitGenesOnTree module, as long as the FindCoreRegionLength,
>> FindPartialGenesOnTree and FindSingleGenesOnTree modules, are part of
>> an ongoing project to identify partial / split genes. These modules
>> are not yet used in our production pipelines.
>>
>> We are currently using
>> Bio::EnsEMBL::Compara::RunnableDB::ProteinTrees::FindContiguousSplitGenes
>> (which contains more comments).
>>
>> Right after the multiple alignment step, we search, in every tree,
>> pairs of genes from the same species that satisfy one of the two
>> conditions:
>>  - The two sequences do not overlap at all, and the genes are close to
>> each other (less than 1 Mb), with at most 1 gene in between
>>  - The two sequences slightly overlap, and the genes are consecutive
>> in the genome and less than 500 kb apart
>>
>> All those pairs are grouped under "split_gene" nodes in the gene
>> trees, and tagged as "contiguous_gene_split" homologies.
>>
>> Hope this helps,
>>
>> Best regards,
>> Matthieu
>>
>> On 05/03/13 17:49, Pengcheng Yang wrote:
>>> Hi,
>>>
>>> I want to know the algorithm of the FindSplitGenesOnTree class, so I
>>> read the comments in the file
>>> http://www.ensembl.org/info/docs/Doxygen/compara-api/FindSplitGenesOnTree_8pm_source.html.
>>>
>>>
>>> However, I still unclear of the algorithm background of it.
>>> My understanding is:
>>> 1. for the genes in one family, do multiple alignment and construct tree
>>> using TreeBeST
>>> 2. find the gene ids with shortest (A) and longest (B) length.
>>> 3. get the gene ids (C) that next to gene A in the same branch in the
>>> tree
>>> 4. check whether C and A have overlap greater than x aa in the multiple
>>> alignment. If not, they may be one split_gene pair.
>>>
>>> Is it? And where to found the documentation of the algorithm? I know one
>>> way was to read the source code, but it will be understood quickly if
>>> there is a documentation.
>>>
>>> Thank you.
>>>
>>> Best,
>>> Pengcheng Yang
>>>




More information about the Dev mailing list