[ensembl-dev] Confused by the target/query core database in whole genome alignment based gene build

Javier Herrero jherrero at ebi.ac.uk
Thu Apr 19 06:09:41 BST 2012


Dear Zhang

InnoDB is much faster for updates in large tables, but inserts are a 
different matter.

It might be a configuration error, make sure you can connect to all the 
databases you have configured in your pipeline

If you believe your MySQL server cannot cope with the load, you can 
reduce the hive_capacity for the process. The hive_capacity tells eHive 
roughly how many concurrent jobs for that particular analysis you are 
happy to run concurrently. The value is in the analysis_stats table. You 
can either modify the value manually in the database or change it in the 
PairAligner_conf file if you are re-starting the pipeline.

Kind regards

Javier

On 19/04/12 05:18, Zhang Di wrote:
> Hi Javier
>
> Thank you for the explanation.
>
> I've tried the new ensembl_66 API.
>
> Compared ti 64, it changed too much, both in the compara database 
> structure (such as, mysql store engine type MyISAM to Innodb) and the 
> analysis work flow.
>
> When I run it according to ensembl-compara/docs/README-pairaliger, 
> some workers failed in the lastz step complaining:"cannot connect to 
> mysql server."
>
> Sounds like mysql time out error, although the concurrent write of 
> Innodb table should be faster than that of MyISAM table.
>
> On Wed, Apr 18, 2012 at 9:32 PM, Javier Herrero <jherrero at ebi.ac.uk 
> <mailto:jherrero at ebi.ac.uk>> wrote:
>
>     Hi Zhang
>
>     In the PAIR_ALIGNER module, the reference_collection_name
>     corresponds to the target_collection_name. In the other two
>     modules (CHAIN_CONFIG and NET_CONFIG), the
>     reference_collection_name corresponds to the
>     query_collection_name. Part of the reason to change the name was
>     to be able to use the same name for all 3 modules.
>
>     In short, just use the same genome as the
>     reference_collection_name for all 3.
>
>     Note that we have changed the way we run this pipeline and this
>     configuration file is not longer supported. The documentation for
>     the new pipeline is available under
>     ensembl-compara/docs/README-pairaliger. Obviously, you can still
>     use ensembl_64 API and pipeline for as long as you want.
>
>     I hope this helps
>
>     Javier
>
>
>     On 18/04/12 13:33, Zhang Di wrote:
>>     Thank you, Javier
>>
>>     You mean I should set some thing like below in the
>>     compara_2x.conf to prepare the compara_db for gene projection.
>>
>>         {
>>         TYPE => PAIR_ALIGNER,
>>         reference_collection_name => 'human',
>>         non_reference_collection_name => 'my_genome',
>>         }
>>         {
>>         TYPE => CHAIN_CONFIG,
>>         non_reference_collection_name => 'human',
>>         reference_collection_name => 'my_genome',
>>         }
>>         {
>>         TYPE => NET_CONFIG,
>>         non_reference_collection_name => 'human',
>>         reference_collection_name => 'my_genome',
>>         }
>>
>>     By the way, I'm using ensembl-compara version 64.
>>
>>     On Wed, Apr 18, 2012 at 4:57 PM, Javier Herrero
>>     <jherrero at ebi.ac.uk <mailto:jherrero at ebi.ac.uk>> wrote:
>>
>>         Dear Zhang
>>
>>         The query/target naming has always been quite confusing in
>>         the pairwise alignment pipeline. When running an alignment,
>>         you use a sequence (query) to look for similar regions in
>>         another sequence (target genome). The pairwise alignment
>>         pipeline has three major steps: (a) raw alignments; (b)
>>         chaining; and (c) netting. The chaining step tries to link
>>         all the raw alignments that are in the same order and
>>         orientation to create a longer structure called chain. The
>>         netting step requires you to define a target genome such as
>>         the so-called nets are the subset of chains that form the
>>         best-in-genome alignment. In other words, the final set will
>>         provide you for each bp of the target genome with the best
>>         match on the other genome.
>>
>>         As I said earlier, query and target are often times confusing
>>         terms. To make the situation worse, we use to run the
>>         pipeline such as the query for the raw alignment step was the
>>         target for the chaining and netting steps and vice versa. We
>>         have now changed the way we refer to both sequences and call
>>         them reference and non-reference genomes. We find that
>>         nomenclature less confusing.
>>
>>         Kind regards
>>
>>         Javier
>>
>>
>>         On 18/04/12 09:43, Dan Barrell wrote:
>>>         Hi,
>>>
>>>         The document low_coverage_gene_build.txt is quite old and
>>>         possibly very out of date as we no longer build on low
>>>         coverage genomes in Ensembl. As far as I know, the reason
>>>         that the semantics of the reference and target terms got
>>>         swapped is to do with the importance of directionality in a
>>>         Net. When dealing with the low coverage genomes the idea was
>>>         that they wanted the species they were projecting onto as
>>>         the reference because it is important that each bp in the
>>>         target species aligns to at most one location in the
>>>         reference species.
>>>
>>>         I would suggest you also look at the Ensembl Compara
>>>         documentation which is maintained here:
>>>
>>>         ensembl-compara/docs/README-low-coverage-genome-aligner
>>>
>>>         Dan
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>         On 17/04/12 12:47, Zhang Di wrote:
>>>>         Hi,
>>>>
>>>>         I'm using ensembl pipeline for projection genebuild.
>>>>
>>>>         when I read the doc low_coverage_gene_build.txt, I was
>>>>         confused by the target/query genome terms.
>>>>
>>>>         It calls our newly sequenced genome the target, calls the
>>>>         reference genome the query.
>>>>
>>>>         It is contrary to lastz terms where target means reference
>>>>         and query means our sequences.
>>>>
>>>>         It just OK if I stick to this convention.
>>>>
>>>>         However,
>>>>
>>>>         In the whole genome alignment section in the same doc,
>>>>
>>>>         It says that :
>>>>
>>>>             "each bp in the target genome should be represented at
>>>>         most once."
>>>>
>>>>         What does it mean by saying "target"?
>>>>
>>>>         lastz-chain-net produces the lastz termed "target genome"
>>>>         with this property.
>>>>
>>>>         Does it mean that I should set my genome as the reference
>>>>         genome, while the genome from ensembl such as "human" as
>>>>         the non-reference in the compara/hive pipeline?
>>>>
>>>>         I can project human genes to my genome with this somewhat
>>>>         weird setting, in the next wga2genes step?
>>>>
>>>>         Some slice of human genome containing genes may exist
>>>>         several times in the compara_db, how can it produce gene
>>>>         projection right here?
>>>>
>>>>
>>>>         Thanks
>>>>
>>>>         Best Reguards
>>>>
>>>>         -- 
>>>>         Zhang Di
>>>>
>>>>
>>>>         _______________________________________________
>>>>         Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
>>>>         List admin (including subscribe/unsubscribe):http://lists.ensembl.org/mailman/listinfo/dev
>>>>         Ensembl Blog:http://www.ensembl.info/
>>>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
>>>         List admin (including subscribe/unsubscribe):http://lists.ensembl.org/mailman/listinfo/dev
>>>         Ensembl Blog:http://www.ensembl.info/
>>
>>         -- 
>>         Javier Herrero, PhD
>>         Ensembl Coordinator and Ensembl Compara Project Leader
>>         European Bioinformatics Institute (EMBL-EBI)
>>         Wellcome Trust Genome Campus, Hinxton
>>         Cambridge - CB10 1SD - UK
>>
>>
>>         _______________________________________________
>>         Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>         List admin (including subscribe/unsubscribe):
>>         http://lists.ensembl.org/mailman/listinfo/dev
>>         Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>>
>>     -- 
>>     Zhang Di
>>
>>
>>     _______________________________________________
>>     Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
>>     List admin (including subscribe/unsubscribe):http://lists.ensembl.org/mailman/listinfo/dev
>>     Ensembl Blog:http://www.ensembl.info/
>
>     -- 
>     Javier Herrero, PhD
>     Ensembl Coordinator and Ensembl Compara Project Leader
>     European Bioinformatics Institute (EMBL-EBI)
>     Wellcome Trust Genome Campus, Hinxton
>     Cambridge - CB10 1SD - UK
>
>
>     _______________________________________________
>     Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>     List admin (including subscribe/unsubscribe):
>     http://lists.ensembl.org/mailman/listinfo/dev
>     Ensembl Blog: http://www.ensembl.info/
>
>
>
>
> -- 
> Zhang Di
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-- 
Javier Herrero, PhD
Ensembl Coordinator and Ensembl Compara Project Leader
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20120419/92aa832d/attachment.html>


More information about the Dev mailing list