[ensembl-dev] Unexpected behaviour of fetch_all_by_outward_search

Asier Gonzalez gonzaleza at ebi.ac.uk
Thu Jan 23 12:18:12 GMT 2020


Hi all,

I am bringing this old issue up again because I have found another 
behaviour of the function that I cannot understand. Variant rs112156520 
<https://www.ensembl.org/Homo_sapiens/Variation/Explore?r=9:7873518-7874518;v=rs112156520;vdb=variation;vf=693077069> 
overlaps one of the three transcripts of DMAC (ENSG00000137038) but 
calling fetch_by_outward_search with the following parameters does not 
return ENSG00000137038:

my @gene_list_for_feature  = @{$gene_adaptor->fetch_all_by_outward_search(
                                                    -FEATURE => $var_feature,
                                                    -RANGE =>10000,
                                                    -MAX_RANGE =>500000,
                                                    -LIMIT =>40,
                                                    -FIVE_PRIME =>1)};

The list of genes that it returns is:

ENSG00000226376
ENSG00000231902
ENSG00000273056
ENSG00000178654
ENSG00000230207
ENSG00000153707

Strangely enough, if I remove the option to measure the distance to the 
5' end I get ENSG00000137038 in the response:

my @gene_list_for_feature  = @{$gene_adaptor->fetch_all_by_outward_search(
-FEATURE => $var_feature,
                                                            -RANGE => 10000,
-MAX_RANGE => 500000,
                                                            -LIMIT => 40)};

I have looked into the code, namely the fetch_all_nearest_by_Feature 
<https://github.com/Ensembl/ensembl/blob/release/99/modules/Bio/EnsEMBL/DBSQL/BaseFeatureAdaptor.pm#L1507-L1571> 
called by fetch_by_outward_search but I have not been able to identify 
why "FIVE_PRIME" would affect whether the overlapping gene is returned 
or not.

In addition, ENSG00000153707 is another case of a gene whose 5' end is 
outside the search range but is still returned. I think that when the 
"FIVE_PRIME" option is set to true this gene should not be returned 
because although the 3' end is within the 500 kb search window (440228 
bp) the 5' end is 2738705 bp away from the feature. Do you have any 
thoughts about this?

Thanks you,
Asier

On 09/08/2019 13:46, Asier Gonzalez wrote:
>
> Hi Brandon,
>
> Apologies for the delay but I have finally had time to make the 
> changes and open a PR (https://github.com/Ensembl/ensembl/pull/407).
>
> Please let me know should there be any issues. I can confirm that I 
> have used my version of the code and it now works as expected.
>
> Best wishes,
> Asier
>
> On 26/06/2019 11:35, Brandon Walts wrote:
>>
>> Hi Asier
>>
>> It's great to hear that you've devised a fix. We would welcome a PR 
>> against master, or to see your changes, whichever you'd prefer. We'll 
>> be able to look at it in the next few days.
>>
>> Best
>> -Brandon
>>
>> On 26/06/2019 11:03, Asier Gonzalez wrote:
>>>
>>> Hi Brandon,
>>>
>>> Thank you for your response. Do you have an idea of when it could be 
>>> fixed? I mean, are we talking about weeks or months? I use a tool 
>>> that calls this function at least every two months so I have amended 
>>> the code to do what I believe it is supposed to do. I could share it 
>>> with you if it would help you, or I could open a PR if you accept 
>>> them. I understand that you may have other priorities, but at least 
>>> I want to make sure that the future version will do what mine 
>>> already does.
>>>
>>> Best wishes,
>>> Asier
>>>
>>> On 26/06/2019 10:56, Brandon Walts wrote:
>>>>
>>>> Hi Asier
>>>>
>>>> We've had a chance to look into it and you are correct, this 
>>>> function is not working as described. As currently implemented, it 
>>>> will return more results than expected. It's on our list to fix, 
>>>> and we plan to get to it in the near future.
>>>>
>>>> Best
>>>> -Brandon
>>>>
>>>> On 26/06/2019 09:51, Asier Gonzalez wrote:
>>>>>
>>>>> Hi Brandon,
>>>>>
>>>>> Do you have any updates about this?
>>>>>
>>>>> Thanks,
>>>>> Asier
>>>>>
>>>>> On 07/06/2019 16:42, Brandon Walts wrote:
>>>>>>
>>>>>> Hi Asier
>>>>>>
>>>>>> Thanks for bringing this up. We will look into what's going on 
>>>>>> and see if there is a bug, if the documentation needs 
>>>>>> improvement, or both.
>>>>>>
>>>>>> Best
>>>>>> -Brandon
>>>>>>
>>>>>> On 07/06/2019 13:47, Asier Gonzalez wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm troubleshooting a Perl tool that calls the Ensembl API with 
>>>>>>> a variant id and tries to find the gene with the closest 5' end 
>>>>>>> within a 500 kb window. The tool was written by a colleague and 
>>>>>>> it uses 
>>>>>>> Bio::EnsEMBL::DBSQL::BaseFeatureAdaptor::fetch_all_by_outward_search() 
>>>>>>> like this:
>>>>>>>
>>>>>>> my @gene_list_for_feature  = @{$gene_adaptor->fetch_all_by_outward_search(
>>>>>>>                                                     -FEATURE => $var_feature,
>>>>>>>                                                     -RANGE =>10000,
>>>>>>>                                                     -MAX_RANGE =>500000,
>>>>>>>                                                     -LIMIT =>40,
>>>>>>>                                                     -FIVE_PRIME =>1)};
>>>>>>>
>>>>>>> According to the documentation of this function 
>>>>>>> (http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1DBSQL_1_1BaseFeatureAdaptor.html#a76a51bc70828aaccb9435eda9a44b20a), 
>>>>>>> it "Searches for features within the suggested -RANGE, and if it 
>>>>>>> finds none, expands the search area until it satisfies -LIMIT or 
>>>>>>> hits -MAX_RANGE". My understanding is that in my case it should 
>>>>>>> search first in a 10 kb window and, if there are no genes, 
>>>>>>> progressively expand it to up to 500 kb unless it finds 40 
>>>>>>> features before. However, this is not the behaviour I am seeing, 
>>>>>>> the search range grows like this: 10k, 20k, 60k, 240k and 1.20M. 
>>>>>>> Is this a bug or have I misundertood what it does?
>>>>>>>
>>>>>>> I have looked into the code of this subroutine 
>>>>>>> (https://github.com/Ensembl/ensembl/blob/release/96/modules/Bio/EnsEMBL/DBSQL/BaseFeatureAdaptor.pm#L1441-L1469) 
>>>>>>> and the search window growths exponentially because it 
>>>>>>> multiplies the previous value instead of the initial value:
>>>>>>>
>>>>>>> [L1452] $search_range = $search_range * $factor;
>>>>>>>
>>>>>>> In addition, it is not true that it only expands the range if it 
>>>>>>> does not find any features in the initial window, which is 
>>>>>>> obvious from looking into the while statement:
>>>>>>>
>>>>>>> [L1451] while (scalar @results < $limit && $search_range <= 
>>>>>>> $max_range) {
>>>>>>>
>>>>>>> I am also confused by the fact that, apparently, the found 
>>>>>>> features only need to be partially within the range. For 
>>>>>>> instance, ENSG00000150394 (CDH8) is found with the above 
>>>>>>> parameters although its 5' prime end is 1,338,771 bp away from 
>>>>>>> the variant according to the distance reported by the function. 
>>>>>>> So, it seems that the feature is found because its 3' end is 
>>>>>>> within the range although the 5' prime end, which is what I am 
>>>>>>> interested in, is not. This somehow contradicts what the 
>>>>>>> documentation says 
>>>>>>> (https://github.com/Ensembl/ensembl/blob/release/96/modules/Bio/EnsEMBL/DBSQL/BaseFeatureAdaptor.pm#L1490-L1491): 
>>>>>>> "When looking beyond the boundaries of the source Feature, the 
>>>>>>> distance is measured to the nearest end of that Feature to the 
>>>>>>> nearby Feature's nearest end."
>>>>>>>
>>>>>>> Any help will be much appreciated. I am happy to share code if 
>>>>>>> you think it would be useful.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Asier
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Dev mailing listDev at ensembl.org
>>>>>>> Posting guidelines and subscribe/unsubscribe info:https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>>>>>>> Ensembl Blog:http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20200123/21a83f39/attachment.html>


More information about the Dev mailing list