[ensembl-dev] Confused by the target/query core database in whole genome alignment based gene build

Zhang Di aureliano.jz at gmail.com
Wed Apr 18 13:33:56 BST 2012


Thank you, Javier

You mean I should set some thing like below in the compara_2x.conf to
prepare the compara_db for gene projection.

{
TYPE => PAIR_ALIGNER,
reference_collection_name => 'human',
non_reference_collection_name => 'my_genome',
}
{
TYPE => CHAIN_CONFIG,
non_reference_collection_name => 'human',
reference_collection_name => 'my_genome',
}
{
TYPE => NET_CONFIG,
non_reference_collection_name => 'human',
reference_collection_name => 'my_genome',
}

By the way, I'm using ensembl-compara version 64.

On Wed, Apr 18, 2012 at 4:57 PM, Javier Herrero <jherrero at ebi.ac.uk> wrote:

>  Dear Zhang
>
> The query/target naming has always been quite confusing in the pairwise
> alignment pipeline. When running an alignment, you use a sequence (query)
> to look for similar regions in another sequence (target genome). The
> pairwise alignment pipeline has three major steps: (a) raw alignments; (b)
> chaining; and (c) netting. The chaining step tries to link all the raw
> alignments that are in the same order and orientation to create a longer
> structure called chain. The netting step requires you to define a target
> genome such as the so-called nets are the subset of chains that form the
> best-in-genome alignment. In other words, the final set will provide you
> for each bp of the target genome with the best match on the other genome.
>
> As I said earlier, query and target are often times confusing terms. To
> make the situation worse, we use to run the pipeline such as the query for
> the raw alignment step was the target for the chaining and netting steps
> and vice versa. We have now changed the way we refer to both sequences and
> call them reference and non-reference genomes. We find that nomenclature
> less confusing.
>
> Kind regards
>
> Javier
>
>
> On 18/04/12 09:43, Dan Barrell wrote:
>
> Hi,
>
> The document low_coverage_gene_build.txt is quite old and possibly very
> out of date as we no longer build on low coverage genomes in Ensembl. As
> far as I know, the reason that the semantics of the reference and target
> terms got swapped is to do with the importance of directionality in a Net.
> When dealing with the low coverage genomes the idea was that they wanted
> the species they were projecting onto as the reference because it is
> important that each bp in the target species aligns to at most one location
> in the reference species.
>
> I would suggest you also look at the Ensembl Compara documentation which
> is maintained here:
>
> ensembl-compara/docs/README-low-coverage-genome-aligner
>
> Dan
>
>
>
>
>
>
>
>
>
> On 17/04/12 12:47, Zhang Di wrote:
>
> Hi,
>
>  I'm using ensembl pipeline for projection genebuild.
>
>  when I read the doc low_coverage_gene_build.txt, I was confused by the
> target/query genome terms.
>
>  It calls our newly sequenced genome the target, calls the reference
> genome the query.
>
>  It is contrary to lastz terms where target means reference and query
> means our sequences.
>
>  It just OK if I stick to this convention.
>
>  However,
>
>  In the whole genome alignment section in the same doc,
>
>  It says that :
>
>      "each bp in the target genome should be represented at most once."
>
>  What does it mean by saying "target"?
>
>  lastz-chain-net produces the lastz termed "target genome" with this
> property.
>
>  Does it mean that I should set my genome as the reference genome, while
> the genome from ensembl such as "human" as the non-reference in the
> compara/hive pipeline?
>
>  I can project human genes to my genome with this somewhat weird setting,
> in the next wga2genes step?
>
>  Some slice of human genome containing genes may exist several times in
> the compara_db, how can it produce gene projection right here?
>
>
>  Thanks
>
>  Best Reguards
>
>  --
> Zhang Di
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
> --
> Javier Herrero, PhD
> Ensembl Coordinator and Ensembl Compara Project Leader
> European Bioinformatics Institute (EMBL-EBI)
> Wellcome Trust Genome Campus, Hinxton
> Cambridge - CB10 1SD - UK
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
Zhang Di
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20120418/fe890f35/attachment.html>


More information about the Dev mailing list