[ensembl-dev] Speeding up Bio::DB::Fasta::subseq (was Re: Thoughts on Speeding up the Variant Effect Predictor)
Rocky Bernstein
rocky.bernstein at gmail.com
Sat Jan 10 02:38:25 GMT 2015
On Fri, Jan 9, 2015 at 4:50 AM, Andrew Yates <ayates at ebi.ac.uk> wrote:
> That’s also my opinion on any C extensions that make their way into
> Ensembl that in all cases code can fall back to a pure Perl implementation.
>
Ok. I guess everyone is in agreement. :-)
> This technique is already in operation in the core API as a library we
> wrote (ensembl-xs*) reimplements a number of methods including rearrange.
>
> Andy
>
> * https://github.com/Ensembl/ensembl-xs
>
I looked at this the functions selected there were good ones to have in C
as they showed up as bottlenecks. I did a test run and got a 10% reduction
in time.
For the record, the places for biggest improvement in Perl code seem to be
Bio:BaseVariation::Feature::OverlapAllele and Bio::EnsEMBL::Feature which
are in different phases of the code.
As always, it would be great to give independent confirmation.
>
>
> On 8 Jan 2015, at 17:49, Fields, Christopher J <cjfields at illinois.edu>
> wrote:
>
> Speaking from the Bioperl end, I think any improvements like this are
> more than welcome. The main issue is ensuring pure perl fallback method
> work fine if Inline::C is not available.
>
> chris
>
> On Jan 8, 2015, at 10:56 AM, Sarah Hunt <seh at ebi.ac.uk> wrote:
>
>
> Hi Rocky,
>
> Thanks for this - it's good to see you are noticing significant
> improvements when switching these functions. As Will said, we have started
> discussing possible VEP optimisation strategies recently, so are very keen
> to hear your ideas. That said, half of our team - including Will who is the
> lead on VEP - are away at the moment, so we won't be planning any big
> changes until next month.
>
> All the best,
>
> Sarah
>
> On 07/01/2015 21:29, Rocky Bernstein wrote:
>
> Some corrections. Coding overlap in C is about a 100% improvement. (And
> the code looks about the same in C if you remove '$'s and change "and" to
> "&&").
>
> If I have this right, (and I may not so please double check with your
> tests), this reduced the time taken
> in Bio::EnsEMBL::Variation::Utils::VariationEffect from executing 8574029
> statements in 19.6s to executing 5631868 statements in 12.5s. The specific
> time in _intron_effects went from 7.87s to 5.11s, while the time in overlap
> went from 1.45s to .638ms. I don't understand how the speedup in overlap
> appears to cause a bigger speedup overall.
>
> Details of the benchmark changes are at
> https://gist.github.com/rocky/6083dedc752c197875ca
> while the overall run is
> http://dustyfeet.com:8001/VEP-prof-5000-Inline-Overlap-C/
>
> On Fri, Jan 2, 2015 at 11:39 PM, Rocky Bernstein <
> rocky.bernstein at gmail.com> wrote:
>
>> A big-picture question before some small nuts and bolts.
>>
>> SnpEff http://snpeff.sourceforge.net/ is about an order of magnitude
>> faster than VEP. Yes, I realize they work at different levels, but isn't
>> the level of difficulty and size data sizes that they work on roughly
>> equivalent? I've heard people express the feeling that because the problems
>> "feel" about the same in size and complexity they VEP should be running at
>> about competitive speed. Or with in a factor or so.
>>
>> I honestly don't know, and I'd like to understand this better. So I'd
>> appreciate thoughts and comments on this.
>>
>>
>> Okay. now to nuts and bolts. Occasionally I'll look at VEP performance
>> data mentioned before. And this has led me to look
>> at Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap . See:
>> http://dustyfeet.com:8001/VEP-prof-chrom1/Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#531
>>
>> The overlap code is basically returning the "and" of two integer
>> comparisons. I tried coding overalp in C and got a 6% speedup - not that
>> great. But now consider
>> this code in
>> Bio::EnsEMBL::Variation::BaseTranscriptVariation::_intron_effects that
>> calls overlap:
>>
>> if ( overlap($vf_start, $vf_end, $intron_start-3,
>> $intron_start-1) or
>> overlap($vf_start, $vf_end, $intron_start+2,
>> $intron_start+7) or
>> overlap($vf_start, $vf_end, $intron_end-7,
>> $intron_end-2 ) or
>> overlap($vf_start, $vf_end, $intron_end+1,
>> $intron_end+3 ) or
>> ($insertion && (
>> $vf_start == $intron_start ||
>> $vf_end == $intron_end ||
>> $vf_start == $intron_start+2 ||
>> $vf_end == $intron_end-2
>> ) )) {
>>
>> If you inline the overlap code here, you'd get basically
>> * 4 comparisons of $vf_start with $intron_start
>> * 3 comparisons of $vf_end with $intron_end
>> * 2 comparisons of $vf_start with $intron_end
>>
>> And since $intron_end is not less than $intron_start , it is possible
>> that if $vf_start is greater than $intron_end, it will also have to be
>> greater than $intron_start as well, eliminating possibly 4 comparisons.
>>
>> So the logic could be rewritten here. Having good tests of that
>> replaced logic is in my opinion crucial, as is keeping the old code above
>> around in case one wants to change or experiment with things.
>>
>> But what I might consider doing is coding it all in C and combine the 4
>> overlap calls into one. My guess is that C will also benefit from keeping
>> the values referred to above in registers.
>>
>> Again all of this is messy so I invite thoughts on this before
>> undertaking something this messy.
>>
>> In closing though I'll mention that the Inline C code has been merged
>> into bioperl_live.
>> https://github.com/bioperl/bioperl-live/commit/01ec10dda23b6c5ed7592808cff4ae0d34278611
>>
>> The way that works is that there is a recommended dependency on
>> C::Inline. If C::Inline is around, a Perl subroutine gets overwritten with
>> a routine of the same name implemented in C. I imagine that if this goes
>> forward, it would do likewise.
>>
>> Okay, enough babble. Time to hear from you all...
>>
>> But
>>
>>
>> On Tue, Dec 23, 2014 at 4:28 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>
>>> Thanks again Rocky, your work on this is really appreciated, and great
>>> to see such an improvement for such a minor change!
>>>
>>> If there's any other code you'd like to share, or any changes to ours,
>>> please feel free to send us more details or put in a pull request on GitHub.
>>>
>>> Thanks
>>>
>>> Will
>>>
>>> On 23 December 2014 at 03:26, Rocky Bernstein <
>>> rocky.bernstein at gmail.com> wrote:
>>>
>>>> Just a follow-up to my earlier post.
>>>>
>>>> I ran a Variant Effect Prediction run on a VCF file of 5000 entries
>>>> (which is what fits in one buffer read) with one small change. With that,
>>>> I was able to significantly significantly reduce the time bottleneck in the
>>>> Fasta code. The time spent here went from 7.76 seconds to 2.32 seconds.
>>>>
>>>> Compare the top line of:
>>>> http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html
>>>> with:
>>>>
>>>> http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html
>>>>
>>>> You get a 50% reduction just by the fact that one transformation is
>>>> needed to remove both \n and \r rather than two transformations. But even
>>>> beyond this, the C code for one run is still faster than the corresponding
>>>> Perl s///.
>>>>
>>>> The specific change that I made can be found at
>>>> https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff
>>>> You'll also see benchmarks for other variations of that code.
>>>>
>>>> But.... in order to see the effect in a run you need to have Perl
>>>> module Inline::C installed. Otherwise you get a lesser improvement outlined
>>>> in my original posting. Again this speeds things up by compiling once Perl
>>>> regular expressions used to match \n and \r.
>>>>
>>>> In the spirit of open scientific review, I am curious to learn of
>>>> others experience the same kind of improvement I saw.
>>>>
>>>> I have a pull request for this change to the bioperl-live repository.
>>>> See https://github.com/bioperl/bioperl-live/issues/95 . However I note
>>>> that the Bio::DB code used by Variant Effect Predictor is a different
>>>> (back-level) from the code in that git repository. The diff file in the
>>>> gist cited above is for the Fasta.pm code that is in Ensembl ; of course,
>>>> the pull request uses the current Bio::DB code.
>>>>
>>>>
>>>> Lastly http://dustyfeet.com:8001 has the profile results other kinds
>>>> of runs which I hope will clarify my other remarks about where things are
>>>> slow.
>>>>
>>>>
>>>> On Thu, Dec 18, 2014 at 12:48 AM, Rocky Bernstein <
>>>> rocky.bernstein at gmail.com> wrote:
>>>>>
>>>>> Running the Variant Effect Predictor on a Human Genome VCF file
>>>>> (130780 lines) with a local Fasta cache (--offline) takes about 50 minutes
>>>>> on a quad-core Ubuntu box.
>>>>>
>>>>> I could give more details, but I don't think they are that
>>>>> important.
>>>>>
>>>>> In looking at how to speed this up, it looks like VEP goes through
>>>>> the VCF file, is sorted by chromosome, and processes each
>>>>> Chromosome independently. The first obvious way to speed this up would
>>>>> be to do some sort of 24-way map/reduce.
>>>>> There is of course the --fork option on the
>>>>> variant_effect_predictor.pl program which is roughly the same idea,
>>>>> but it parallelizes only across the cores of a single computer rather than
>>>>> make use of multiple ones.
>>>>>
>>>>> To pinpoint the slowness better, I used Devel::NYTProf. For those of
>>>>> you who haven't used it recently, it now has flame graphs and it makes it
>>>>> very easy to see what's going on.
>>>>>
>>>>> The first thing that came out was a slowness in code to remove
>>>>> carriage returns and line feeds. This is in Bio::DB::Fasta ::subseq:
>>>>>
>>>>> $data =~ s/\n//g;
>>>>> $data =~ s/\r//g;
>>>>>
>>>>> Compiling the regexp, e.g:
>>>>>
>>>>> my $nl = qr/\n/;
>>>>> my $cr = qr/\r/;
>>>>>
>>>>> sub subseq {
>>>>> ....
>>>>> $data =~ s/$nl//g;
>>>>> $data =~ s/$cr//g;
>>>>> }
>>>>>
>>>>> Speeds up the subseq method by about 15%. I can elaborate more or
>>>>> describe the other methods I tried and how they fared, if there's interest.
>>>>> But since this portion is really part of BioPerl and not Bio::EnsEMBL, I'll
>>>>> try to work up a git pull request ont that repository.
>>>>>
>>>>> So now I come to the meat of what I have to say. I should have put
>>>>> this at the top -- I hope some of you are still with me.
>>>>>
>>>>> The NYTProf graphs seem to say that there is a *lot* of overhead in
>>>>> object lookup and type testing. I think some of this is already known as
>>>>> there already are calls to "weaken" and "new_fast" object creators. And
>>>>> there is this comment in
>>>>> Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:
>>>>>
>>>>>
>>>>> # this method is a major bottle neck in the effect calculation
>>>>> code so
>>>>> # we cache results and use local variables instead of method calls
>>>>> where
>>>>> # possible to speed things up - caveat bug-fixer!
>>>>>
>>>>> In the few cases guided by NYTProf, I've been able to make
>>>>> reasonable speed ups at the expense of eliminating the tests
>>>>> and object overhead.
>>>>>
>>>>> For example, in EnsEMBL::Variation::BaseTranscriptVariation
>>>>> changing:
>>>>>
>>>>>
>>>>> sub transcript {
>>>>> my ($self, $transcript) = @_;
>>>>> assert_ref($transcript, 'Bio::EnsEMBL::Transcript') if
>>>>> $transcript;
>>>>> return $self->SUPER::feature($transcript, 'Transcript');
>>>>> }
>>>>>
>>>>> to:
>>>>>
>>>>> sub transcript {
>>>>> my ($self, $transcript) = @_;
>>>>> return $self->{feature};
>>>>>
>>>>> Gives a noticeable speed up. But you may ask: if that happens, then
>>>>> we lose type safety and there is a potential for bugs?
>>>>> And here's my take on how to address these valid concerns. First, I
>>>>> think there could be two sets of the Perl modules, such as for
>>>>> EnsEMBL::Variation::BaseTranscriptVariation - those with all of the
>>>>> checks and those that are fast. A configuration parameter might specify
>>>>> which version to use. In development or by default, one might use the ones
>>>>> that check types.
>>>>>
>>>>> Second and perhaps more import, there are the tests! If more need to
>>>>> be added, then let's add them. And one can always add a test to make sure
>>>>> the results of the two versions gives the same result.
>>>>>
>>>>> One last avenue of optimization that I'd like to explore is using
>>>>> say Inline::C or basically coding in C hot spots. In particular, consider
>>>>> Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap which looks
>>>>> like this:
>>>>>
>>>>> my ( $f1_start, $f1_end, $f2_start, $f2_end ) = @_;
>>>>> return ( ($f1_end >= $f2_start) and ($f1_start <= $f2_end) );
>>>>>
>>>>> I haven't tried it on this hot spot, but this is something that
>>>>> might benefit from getting coded in C. Again the trade off for speed here
>>>>> is a dependency on compiling C. In my view anyone installing this locally
>>>>> or installing CPAN modules probably already does, but it does add
>>>>> complexity.
>>>>>
>>>>> Typically, this is handled in Perl by providing both versions,
>>>>> perhaps as separate modules.
>>>>>
>>>>> Thought or comments?
>>>>>
>>>>> Thanks,
>>>>> rocky
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Dev mailing list Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150109/4c67404c/attachment.html>
More information about the Dev
mailing list