[ensembl-dev] Confused by the target/query core database in whole genome alignment based gene build

Zhang Di aureliano.jz at gmail.com
Thu Apr 19 06:33:27 BST 2012


Thank you for your kind help.

I'll try your suggestions.

Best Reguards



On Thu, Apr 19, 2012 at 1:09 PM, Javier Herrero <jherrero at ebi.ac.uk> wrote:

>  Dear Zhang
>
> InnoDB is much faster for updates in large tables, but inserts are a
> different matter.
>
> It might be a configuration error, make sure you can connect to all the
> databases you have configured in your pipeline
>
> If you believe your MySQL server cannot cope with the load, you can reduce
> the hive_capacity for the process. The hive_capacity tells eHive roughly
> how many concurrent jobs for that particular analysis you are happy to run
> concurrently. The value is in the analysis_stats table. You can either
> modify the value manually in the database or change it in the
> PairAligner_conf file if you are re-starting the pipeline.
>
> Kind regards
>
> Javier
>
>
> On 19/04/12 05:18, Zhang Di wrote:
>
> Hi Javier
>
>  Thank you for the explanation.
>
>  I've tried the new ensembl_66 API.
>
>  Compared ti 64, it changed too much, both in the compara database
> structure (such as, mysql store engine type MyISAM to Innodb) and the
> analysis work flow.
>
>  When I run it according to ensembl-compara/docs/README-pairaliger, some
> workers failed in the lastz step complaining:"cannot connect to mysql
> server."
>
>  Sounds like mysql time out error, although the concurrent write of
> Innodb table should be faster than that of MyISAM table.
>
>  On Wed, Apr 18, 2012 at 9:32 PM, Javier Herrero <jherrero at ebi.ac.uk>wrote:
>
>>  Hi Zhang
>>
>> In the PAIR_ALIGNER module, the reference_collection_name corresponds to
>> the target_collection_name. In the other two modules (CHAIN_CONFIG and
>> NET_CONFIG), the reference_collection_name corresponds to the
>> query_collection_name. Part of the reason to change the name was to be able
>> to use the same name for all 3 modules.
>>
>> In short, just use the same genome as the reference_collection_name for
>> all 3.
>>
>> Note that we have changed the way we run this pipeline and this
>> configuration file is not longer supported. The documentation for the new
>> pipeline is available under ensembl-compara/docs/README-pairaliger.
>> Obviously, you can still use ensembl_64 API and pipeline for as long as you
>> want.
>>
>> I hope this helps
>>
>> Javier
>>
>>
>> On 18/04/12 13:33, Zhang Di wrote:
>>
>> Thank you, Javier
>>
>>  You mean I should set some thing like below in the compara_2x.conf to
>> prepare the compara_db for gene projection.
>>
>>  {
>> TYPE => PAIR_ALIGNER,
>> reference_collection_name => 'human',
>> non_reference_collection_name => 'my_genome',
>> }
>>  {
>>  TYPE => CHAIN_CONFIG,
>>  non_reference_collection_name => 'human',
>>  reference_collection_name => 'my_genome',
>>  }
>>  {
>>  TYPE => NET_CONFIG,
>>  non_reference_collection_name => 'human',
>>  reference_collection_name => 'my_genome',
>>  }
>>
>>  By the way, I'm using ensembl-compara version 64.
>>
>> On Wed, Apr 18, 2012 at 4:57 PM, Javier Herrero <jherrero at ebi.ac.uk>wrote:
>>
>>>  Dear Zhang
>>>
>>> The query/target naming has always been quite confusing in the pairwise
>>> alignment pipeline. When running an alignment, you use a sequence (query)
>>> to look for similar regions in another sequence (target genome). The
>>> pairwise alignment pipeline has three major steps: (a) raw alignments; (b)
>>> chaining; and (c) netting. The chaining step tries to link all the raw
>>> alignments that are in the same order and orientation to create a longer
>>> structure called chain. The netting step requires you to define a target
>>> genome such as the so-called nets are the subset of chains that form the
>>> best-in-genome alignment. In other words, the final set will provide you
>>> for each bp of the target genome with the best match on the other genome.
>>>
>>> As I said earlier, query and target are often times confusing terms. To
>>> make the situation worse, we use to run the pipeline such as the query for
>>> the raw alignment step was the target for the chaining and netting steps
>>> and vice versa. We have now changed the way we refer to both sequences and
>>> call them reference and non-reference genomes. We find that nomenclature
>>> less confusing.
>>>
>>> Kind regards
>>>
>>> Javier
>>>
>>>
>>> On 18/04/12 09:43, Dan Barrell wrote:
>>>
>>> Hi,
>>>
>>> The document low_coverage_gene_build.txt is quite old and possibly very
>>> out of date as we no longer build on low coverage genomes in Ensembl. As
>>> far as I know, the reason that the semantics of the reference and target
>>> terms got swapped is to do with the importance of directionality in a Net.
>>> When dealing with the low coverage genomes the idea was that they wanted
>>> the species they were projecting onto as the reference because it is
>>> important that each bp in the target species aligns to at most one location
>>> in the reference species.
>>>
>>> I would suggest you also look at the Ensembl Compara documentation which
>>> is maintained here:
>>>
>>> ensembl-compara/docs/README-low-coverage-genome-aligner
>>>
>>> Dan
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 17/04/12 12:47, Zhang Di wrote:
>>>
>>> Hi,
>>>
>>>  I'm using ensembl pipeline for projection genebuild.
>>>
>>>  when I read the doc low_coverage_gene_build.txt, I was confused by the
>>> target/query genome terms.
>>>
>>>  It calls our newly sequenced genome the target, calls the reference
>>> genome the query.
>>>
>>>  It is contrary to lastz terms where target means reference and query
>>> means our sequences.
>>>
>>>  It just OK if I stick to this convention.
>>>
>>>  However,
>>>
>>>  In the whole genome alignment section in the same doc,
>>>
>>>  It says that :
>>>
>>>      "each bp in the target genome should be represented at most once."
>>>
>>>  What does it mean by saying "target"?
>>>
>>>  lastz-chain-net produces the lastz termed "target genome" with this
>>> property.
>>>
>>>  Does it mean that I should set my genome as the reference genome,
>>> while the genome from ensembl such as "human" as the non-reference in the
>>> compara/hive pipeline?
>>>
>>>  I can project human genes to my genome with this somewhat weird
>>> setting, in the next wga2genes step?
>>>
>>>  Some slice of human genome containing genes may exist several times in
>>> the compara_db, how can it produce gene projection right here?
>>>
>>>
>>>  Thanks
>>>
>>>  Best Reguards
>>>
>>>  --
>>> Zhang Di
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>   --
>>> Javier Herrero, PhD
>>> Ensembl Coordinator and Ensembl Compara Project Leader
>>> European Bioinformatics Institute (EMBL-EBI)
>>> Wellcome Trust Genome Campus, Hinxton
>>> Cambridge - CB10 1SD - UK
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>>
>>  --
>> Zhang Di
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>> --
>> Javier Herrero, PhD
>> Ensembl Coordinator and Ensembl Compara Project Leader
>> European Bioinformatics Institute (EMBL-EBI)
>> Wellcome Trust Genome Campus, Hinxton
>> Cambridge - CB10 1SD - UK
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
>  --
> Zhang Di
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
> --
> Javier Herrero, PhD
> Ensembl Coordinator and Ensembl Compara Project Leader
> European Bioinformatics Institute (EMBL-EBI)
> Wellcome Trust Genome Campus, Hinxton
> Cambridge - CB10 1SD - UK
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
Zhang Di
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20120419/6ae95769/attachment.html>


More information about the Dev mailing list