[ensembl-dev] accessing the tilepath entries programatically

Duarte Molha duartemolha at gmail.com
Wed Jul 8 15:43:22 BST 2015


Thank you for your clear explanation Thibaut , but running the risk that
you are probably going to hate me by the end of this thread, your script is
not working very well for me.

Here is how I implemented it (assume @input_queries is just an array of
clone names ):

 foreach my $query (@input_queries){ chomp $query; if
($parameters->{assembly} == "GRCh37"){ my $clones =
$mf_adaptor->fetch_all_by_attribute_type_value( 'clone_name', $query );
while ( my $clone = shift @{$clones} ) { foreach my $attribute
(@{$clone->get_all_Attributes}) { if ($attribute->name eq 'embl_acc'){ my
$slice = $slice_adaptor->fetch_by_region('clone', $attribute->value); # You
need to project on toplevel to get the proper start and end my $projection
= $slice->project('toplevel'); foreach my $segment (@$projection) { print
join("\t", $segment->to_Slice->seq_region_name, $segment->to_Slice->start,
$segment->to_Slice->end, $query."\n"); } } } } }else{ my $clones =
$mf_adaptor->fetch_all_by_attribute_type_value( 'name', $query ); while (
my $clone = shift @{$clones} ) { my $slice = $clone->feature_Slice; my
$projection = $slice->project('clone'); my $clone_slice; foreach my
$segment (@$projection) { #print join("\t", $segment->to_Slice->name,
$segment->to_Slice->start, $segment->to_Slice->end, $query."\n"); if
(!defined $clone_slice or $clone_slice->length <
$segment->to_Slice->length) { $clone_slice = $segment->to_Slice; # Take the
longest as the clone can overlap other clones } } if ($clone_slice){ my
$projection_clone = $clone_slice->project('toplevel'); foreach my
$segment_clone (@$projection_clone) {
print join("\t", "chr".$slice->seq_region_name,
$segment_clone->to_Slice->start, $segment_clone->to_Slice->end,
$query."\n"); #print join("\t", $segment_clone->to_Slice->name,
$segment_clone->to_Slice->start, $segment_clone->to_Slice->end,
$query."\n"); # $segment_clone->to_Slice->name does not give me just the
chr name I wish as bed format output } } } } }


The first issue is that for GRCh37 the script outputs nothing!

The second issue is that querying the GRCh38 dataset the script outputs
what I want, but outputs more than 1 entry per clone. I assumed that it
would only retain the largest one. But for example the clone RP11-109F5
outputs:

chr5    7492637     7622836     RP11-109F5
chr5    7492637     7622844     RP11-109F5

when it should really only output

chr5    7492637     7622844     RP11-109F5 (the largest)

Can you elaborate on the issues I am having ?

Many thanks and once again sorry for all the questions.

Duarte



=========================
     Duarte Miguel Paulo Molha
         http://about.me/duarte
=========================

On 8 July 2015 at 13:59, Thibaut Hourlier <thibaut at ebi.ac.uk> wrote:

> Hi Duarte,
> This need a little bit of magic. First a little explanation about the
> tilepath. It represents the "full length" clones in the assembly. = and +
> represents the clones in the assembly, - and * represents the tilepath
>    1              2                          3                    4
> +++++===========+++++++++++++==========
> ---------                    ---------------------------
>           *****************                      *******************
>
> When you use your code with the MiscFeatureAdaptor for clone number 3, it
> will give you the start and end of the tilepath. But as you can see above,
> it overlaps clone 2 and 3 in the assembly. You will need to get the INSDC
> accession so that you can simply ask for a slice with this accession and
> you will have the coordinate in the assembly of this clone.
> But, this is another data problem which will be fixed for release 82, you
> need 2 different way of doing if you are working on release < 76 (GRCh37)
> or release > 76 (GRCh38).
>
> In GRCh37, the INSDC accession was store in the attribute 'embl_acc'.
>  foreach my $attribute (@{$clone->get_all_Attributes}) {
>    if ($attribute->name eq 'embl_acc') {
>       my $slice = $slice_adaptor->fetch_by_region('clone',
> $attribute->value);
>       # You need to project on toplevel to get the proper start and end
>         my $projection = $slice->project('toplevel');
>         foreach my $segment (@$projection) {
>             print join("\t", $segment->to_Slice->seq_region_name,
> $segment->to_Slice->start, $segment->to_Slice->end, $query), "\n";
>         }
>    }
>  }
>
> In GRCh38, you will need to project your clone slice into the clone
> coordinate system:
> my $slice = $clone->feature_Slice;
> my $projection = $slice->project('clone');
> my $clone_slice;
> foreach my $segment (@$projection) {
>     print join("\t", $segment->to_Slice->name, $segment->to_Slice->start,
> $segment->to_Slice->end), "\n";
>     if (!defined $clone_slice or $clone_slice->length <
> $segment->to_Slice->length) {
>         $clone_slice = $segment->to_Slice; # Take the longest as the clone
> can overlap other clones
>     }
> }
> my $projection_clone = $clone_slice->project('toplevel');
> foreach my $segment_clone (@$projection_clone) {
>     print join("\t", $segment_clone->to_Slice->name,
> $segment_clone->to_Slice->start, $segment_clone->to_Slice->end), "\n";
> }
>
> I hope I replied to the question correctly this time
>
> Thanks
> Thibaut
>
> > On 7 Jul 2015, at 10:24, Duarte Molha <duartemolha at gmail.com> wrote:
> >
> > Thanks Thibaut
> >
> > what I want to know is:  "the genomic coordinates of the clones in
> assembly".
> >
> > Cheers
> >
> > Duarte
> >
> > =========================
> >      Duarte Miguel Paulo Molha
> >          http://about.me/duarte
> > =========================
> >
> > On 7 July 2015 at 09:55, Thibaut Hourlier <thibaut at ebi.ac.uk> wrote:
> > Hi Duarte,
> > Sorry for the late reply. This is not a problem with the API but a data
> problem.
> > When you do the call $mf_adaptor->fetch_all_by_attribute_type_value(
> 'Name', $query ) you're asking for all the clones (misc features) which
> have the attribute 'Name'. So if we haven't store any data with this
> attribute, the API will return nothing but it will have done it's job. We
> used a different attribute in this case which was a error. It will be fixed
> for release 82. Sorry for the problem it's causing you.
> >
> > Before replying to your previous question about my code snippet, I may
> have misunderstood what you wanted to do. By printing the clone
> coordinates, do you want to know the coordinates of the clones in the
> assembly or do you want to know where the clones overlap the assembly (this
> is what the tilepath is)?
> >
> > Regards
> > Thibaut
> >
> >> On 7 Jul 2015, at 09:20, Duarte Molha <duartemolha at gmail.com> wrote:
> >>
> >> Anyone? Could you help me understand why the changing behavior of the
> same API between datasets?
> >>
> >> =========================
> >>      Duarte Miguel Paulo Molha
> >>          http://about.me/duarte
> >> =========================
> >>
> >> On 3 July 2015 at 17:01, Duarte Molha <duartemolha at gmail.com> wrote:
> >> Thanks Magali
> >>
> >>
> >> Can you explain something to me?
> >>
> >>
> >> You are now keeping the api compatible with both GRCH37 and GRCH38.
> This is great because I can use my scripts with the latest API and not
> worry about having to use an older API to query the older assembly. However
> I do not understand why, in this case changing ‘clone_name’ to ‘Name’ works
> when querying GRCh38 but fails when querying GRCh37.
> >>
> >>
> >>  Shouldn't the API calls be the same for both datasets. This means  I
> have to change my code depending on what database I am querying. Isn't this
> what the move to update the api for both datasets is trying to avoid?
> >>
> >>
> >>
> >> Best regards
> >>
> >>
> >>
> >> Duarte
> >>
> >>
> >>
> >>
> >> =========================
> >>      Duarte Miguel Paulo Molha
> >>          http://about.me/duarte
> >> =========================
> >>
> >> On 2 July 2015 at 16:40, mag <mr6 at ebi.ac.uk> wrote:
> >> Hi Duarte,
> >>
> >> Replacing 'clone_name' with 'Name' as Thibaut suggested works for me
> for GRCh38.
> >>
> >> my $clones =  $mf_adaptor->fetch_all_by_attribute_type_value( 'Name',
> $query );
> >>
> >> while ( my $clone = shift @{$clones} ) {
> >>   my $slice = $clone->slice();
> >>   print join "\t", ("chr".$slice->seq_region_name(), $clone->start(),
> $clone->end() , $query."\n");
> >> }
> >>
> >>
> >> Regards,
> >> Magali
> >>
> >>
> >> On 01/07/2015 18:15, Duarte Molha wrote:
> >>> I would still appreciate some help with this query. If possible.
> >>>
> >>> On 30 Jun 2015 16:29, "Duarte Molha" <duartemolha at gmail.com> wrote:
> >>> Thibaut... Could you expand on how I can change my script to make it
> work with the new assembly?
> >>> I have just realised that the reason I am no getting 60 BAC entries is
> because their are only present in GRCh38 and not on the GRCh37
> >>>
> >>> Can you tell me how I can modify my script to work with the new
> assembly?
> >>>
> >>> I don't seem to understand the projection method you are using.
> >>> Here is the relevant part of my script
> >>>
> >>> my $mf_adaptor         = $registry->get_adaptor( 'Human', 'Core',
> 'MiscFeature' );
> >>>
> >>> open (IN, ,"<", $options->{list})|| die "Could not open
> ".$options->{list}." for reading \n";
> >>> my @input_queries = <IN>;
> >>> close IN;
> >>>
> >>> foreach my $query (@input_queries){
> >>>  chomp $query;
> >>>  my $clones =  $mf_adaptor->fetch_all_by_attribute_type_value(
> 'clone_name', $query );
> >>>
> >>>  while ( my $clone = shift @{$clones} ) {
> >>>  my $slice = $clone->slice();
> >>>  print join "\t", ("chr".$slice->seq_region_name(), $clone->start(),
> $clone->end() , $query."\n");
> >>>  }
> >>> }
> >>>
> >>>
> >>> Best regards
> >>>
> >>> Duarte
> >>>
> >>> =========================
> >>>      Duarte Miguel Paulo Molha
> >>>          http://about.me/duarte
> >>> =========================
> >>>
> >>> On 30 June 2015 at 15:46, Duarte Molha <duartemolha at gmail.com> wrote:
> >>> no. That does not get anything.
> >>>
> >>>
> >>>
> >>> =========================
> >>>      Duarte Miguel Paulo Molha
> >>>          http://about.me/duarte
> >>> =========================
> >>>
> >>> On 30 June 2015 at 14:50, Thibaut Hourlier <thibaut at ebi.ac.uk> wrote:
> >>> If you use name instead of clone_name, does it fetches the missing one?
> >>>
> >>> Cheers
> >>> Thibaut
> >>>
> >>>> On 30 Jun 2015, at 14:27, Duarte Molha <duartemolha at gmail.com> wrote:
> >>>>
> >>>> Yes I am using the GRCh37 Thibaut  ... so I am ok for now... but it
> is good to know this does not work with the latest assembly.
> >>>> However... can you please answer my question regarding the missing
> clones like  RP11-155D3 ... why can I not fetch this when it is clearly on
> the database?
> >>>>
> >>>> Thanks
> >>>>
> >>>> Duarte
> >>>>
> >>>>
> >>>>
> >>>> =========================
> >>>>      Duarte Miguel Paulo Molha
> >>>>          http://about.me/duarte
> >>>> =========================
> >>>>
> >>>> On 30 June 2015 at 14:12, Thibaut Hourlier <thibaut at ebi.ac.uk> wrote:
> >>>> My first question should have been which assembly are you using...
> >>>>
> >>>> So yes this will work for GRCh37. Unfortunately it will not work for
> GRCh38 but this is something that we will fix for release 82.
> >>>>
> >>>> So in the case of GRCh38, it is still possible but more complicated.
> It should work by getting the slice then projecting on the clone coordinate
> system
> >>>>
> >>>> $subSlice = $misc_clone->feature_Slice;
> >>>> $projectionSegment = $subSlice->project('clone')
> >>>>
> >>>> Cheers
> >>>> Thibaut
> >>>>
> >>>>> On 30 Jun 2015, at 13:56, Duarte Molha <duartemolha at gmail.com>
> wrote:
> >>>>>
> >>>>> Nevermind... after searching for miscFeatures information I found
> the relevant part in the api tutorial
> >>>>>
> >>>>> Just for reference to anyone that has the same difficulties here is
> the relevant portion of the code I used:
> >>>>> (please let me know if there is something I did wrong Thibaut)
> >>>>>
> >>>>> my $mf_adaptor         = $registry->get_adaptor( 'Human', 'Core',
> 'MiscFeature' );
> >>>>>
> >>>>> open (IN, ,"<", $options->{list})|| die "Could not open
> ".$options->{list}." for reading \n";
> >>>>> my @input_queries = <IN>;
> >>>>> close IN;
> >>>>>
> >>>>> foreach my $query (@input_queries){
> >>>>>
> >>>>>
> >>>>> chomp $query;
> >>>>>
> >>>>>
> >>>>> my $clones =  $mf_adaptor->fetch_all_by_attribute_type_value(
> 'clone_name', $query );
> >>>>>
> >>>>>
> >>>>>
> >>>>> while ( my $clone = shift @{$clones} ) {
> >>>>>
> >>>>>
> >>>>> my $slice = $clone->slice();
> >>>>>
> >>>>>
> >>>>> print join "\t", ("chr".$slice->seq_region_name(), $clone->start(),
> $clone->end() , $query."\n");
> >>>>>
> >>>>>
> >>>>> }
> >>>>> }
> >>>>>
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>> Duarte
> >>>>>
> >>>>> =========================
> >>>>>      Duarte Miguel Paulo Molha
> >>>>>          http://about.me/duarte
> >>>>> =========================
> >>>>>
> >>>>> On 30 June 2015 at 13:26, Duarte Molha <duartemolha at gmail.com>
> wrote:
> >>>>> Many thanks Thibaut
> >>>>>
> >>>>> So... in regards to your question...
> >>>>>
> >>>>> How can I query a specific clone and its correct coordinates if I
> know  the clone ID.
> >>>>>
> >>>>> For example
> >>>>>
> >>>>> assuming this clone:
> >>>>>  RP11-100N21
> >>>>>
> >>>>> In other words , how to I query the underlying clone dataset and
> output those clones in genomic coordinates?
> >>>>>
> >>>>> Many thanks
> >>>>>
> >>>>> Duarte
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> =========================
> >>>>>      Duarte Miguel Paulo Molha
> >>>>>          http://about.me/duarte
> >>>>> =========================
> >>>>>
> >>>>> On 30 June 2015 at 13:15, Thibaut Hourlier <thibaut at ebi.ac.uk>
> wrote:
> >>>>> Hi Duarte,
> >>>>> The clone names are stored in the misc_* tables. So you need to use
> the MiscFeatureAdaptor,
> http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1DBSQL_1_1MiscFeatureAdaptor.html
> :
> >>>>>
> >>>>> my $misc_clones = $mfa->fetch_all_by_Slice_and_set_code('tilepath');
> >>>>> foreach my $clone (@$misc_clones) {
> >>>>>  print join("\t", $clone->slice->seq_region_name, $clone->start,
> $clone->end, @{$clone->get_all_attribute_values('name')}), "\n";
> >>>>> }
> >>>>>
> >>>>> A warning though, this is the tilepath so the boundaries of the
> clones are different from the contigs/clones in the assembly as sometimes
> they didn't use the entire clone for the assembly
> >>>>>
> >>>>> Hope this help
> >>>>>
> >>>>> Thibaut
> >>>>>
> >>>>> > On 30 Jun 2015, at 11:50, Duarte Molha <duartemolha at gmail.com>
> wrote:
> >>>>> >
> >>>>> > I used this code to get all the gebnomic coordinates of your
> subcontigs:
> >>>>> >
> >>>>> >
> >>>>> > my @slices = @{ $slice_adaptor->fetch_all('clone') };
> >>>>> > foreach my $slice (@slices){
> >>>>> >       $progress->update();
> >>>>> >       my $clone_name =  $slice->seq_region_name();
> >>>>> >       my $projection = $slice->project('toplevel');
> >>>>> >       foreach my $segment ( @{$projection} ) {
> >>>>> >               my $to_slice = $segment->to_Slice();
> >>>>> >               print join "\t",
> ("chr".$to_slice->seq_region_name(), $to_slice->start(), $to_slice->end(),
> $clone_name."\n");
> >>>>> >       }
> >>>>> > }
> >>>>> >
> >>>>> > However, by doing this, the database does not fetch the original
> clone name
> >>>>> >
> >>>>> > for example.. using this script I get
> >>>>> > chr4    47567235        47733411        AC092597.1
> >>>>> >
> >>>>> > However I would like to get :
> >>>>> >
> >>>>> > chr4    47567235        47733411        RP11-100N21
> >>>>> >
> >>>>> > Can someone explain what I am doing wrong?
> >>>>> >
> >>>>> > Thanks
> >>>>> >
> >>>>> > Duarte
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > =========================
> >>>>> >      Duarte Miguel Paulo Molha
> >>>>> >          http://about.me/duarte
> >>>>> > =========================
> >>>>> >
> >>>>> > On 30 June 2015 at 09:45, Duarte Molha <duartemolha at gmail.com>
> wrote:
> >>>>> > Dear devs
> >>>>> >
> >>>>> > How can I search for a specific clone id present on your tilepath
> >>>>> >
> >>>>> > for example this: RP5-892C22
> >>>>> >
> >>>>> > I would like to use the perl API if possible
> >>>>> >
> >>>>> > Many thanks
> >>>>> >
> >>>>> > Duarte
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > =========================
> >>>>> >      Duarte Miguel Paulo Molha
> >>>>> >          http://about.me/duarte
> >>>>> > =========================
> >>>>> >
> >>>>> > _______________________________________________
> >>>>> > Dev mailing list    Dev at ensembl.org
> >>>>> > Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>>> > Ensembl Blog: http://www.ensembl.info/
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Dev mailing list    Dev at ensembl.org
> >>>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>>> Ensembl Blog: http://www.ensembl.info/
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Dev mailing list    Dev at ensembl.org
> >>>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>>> Ensembl Blog: http://www.ensembl.info/
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Dev mailing list    Dev at ensembl.org
> >>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>> Ensembl Blog: http://www.ensembl.info/
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Dev mailing list    Dev at ensembl.org
> >>>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>>> Ensembl Blog: http://www.ensembl.info/
> >>>
> >>>
> >>> _______________________________________________
> >>> Dev mailing list    Dev at ensembl.org
> >>> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >>> Ensembl Blog: http://www.ensembl.info/
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Dev mailing list
> >>> Dev at ensembl.org
> >>>
> >>> Posting guidelines and subscribe/unsubscribe info:
> >>> http://lists.ensembl.org/mailman/listinfo/dev
> >>>
> >>> Ensembl Blog:
> >>> http://www.ensembl.info/
> >>
> >>
> >> _______________________________________________
> >> Dev mailing list    Dev at ensembl.org
> >> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >> Ensembl Blog: http://www.ensembl.info/
> >>
> >>
> >>
> >> _______________________________________________
> >> Dev mailing list    Dev at ensembl.org
> >> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> >> Ensembl Blog: http://www.ensembl.info/
> >
> >
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> > Ensembl Blog: http://www.ensembl.info/
> >
> >
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> > Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150708/b3f486a4/attachment.html>


More information about the Dev mailing list