[ensembl-dev] Speeding up Bio::DB::Fasta::subseq (was Re: Thoughts on Speeding up the Variant Effect Predictor)

Wed Jan 7 21:29:13 GMT 2015

Some corrections. Coding overlap in C is about a 100% improvement. (And the
code looks about the same in C if you remove '$'s and change "and" to
"&&").

If I have this right, (and I may not so please double check with your
tests), this reduced the time taken
in Bio::EnsEMBL::Variation::Utils::VariationEffect from  executing 8574029
statements in 19.6s to executing 5631868 statements in 12.5s. The specific
time in _intron_effects went from 7.87s to 5.11s, while the time in overlap
went from 1.45s to .638ms. I don't understand how the speedup in overlap
appears to cause a bigger speedup overall.

Details of the benchmark changes are at
https://gist.github.com/rocky/6083dedc752c197875ca
while the overall run is
http://dustyfeet.com:8001/VEP-prof-5000-Inline-Overlap-C/

On Fri, Jan 2, 2015 at 11:39 PM, Rocky Bernstein <rocky.bernstein at gmail.com>
wrote:

> A big-picture question before some small nuts and bolts.
>
> SnpEff  http://snpeff.sourceforge.net/ is about an order of magnitude
> faster than VEP. Yes, I realize they work at different levels, but isn't
> the level of difficulty and size data sizes that they work on roughly
> equivalent? I've heard people express the feeling that because the problems
> "feel" about the same in size and complexity they VEP should be running at
> about competitive speed. Or with in a factor or so.
>
> I honestly don't know, and I'd like to understand this better. So I'd
> appreciate thoughts and comments on this.
>
>
> Okay. now to nuts and bolts. Occasionally I'll look at VEP performance
> data mentioned before. And this has led me to look
> at Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap . See:
> http://dustyfeet.com:8001/VEP-prof-chrom1/Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#531
>
> The overlap code is basically returning the "and" of two integer
> comparisons. I tried coding overalp in C and got a 6% speedup - not that
> great. But now consider
> this code in
> Bio::EnsEMBL::Variation::BaseTranscriptVariation::_intron_effects that
> calls overlap:
>
>         if ( overlap($vf_start, $vf_end, $intron_start-3, $intron_start-1)
> or
>                  overlap($vf_start, $vf_end, $intron_start+2,
> $intron_start+7) or
>                  overlap($vf_start, $vf_end, $intron_end-7,
> $intron_end-2  ) or
>                  overlap($vf_start, $vf_end, $intron_end+1,
> $intron_end+3  ) or
>                  ($insertion && (
>                      $vf_start == $intron_start ||
>                      $vf_end == $intron_end ||
>                      $vf_start == $intron_start+2 ||
>                      $vf_end == $intron_end-2
>                     ) )) {
>
> If you inline the overlap code here, you'd get basically
> * 4 comparisons of $vf_start with $intron_start
> * 3 comparisons of $vf_end with $intron_end
> * 2 comparisons of $vf_start with $intron_end
>
> And since $intron_end is not less than $intron_start , it is possible that
> if $vf_start is greater than $intron_end, it will also have to be greater
> than $intron_start as well, eliminating possibly 4 comparisons.
>
> So the logic could be rewritten here. Having good tests of that replaced
> logic is in my opinion crucial, as is keeping the old code above around in
> case one wants to change or experiment with things.
>
> But what I might consider doing is coding it all in C and combine the 4
> overlap calls into one. My guess is that C will also benefit from keeping
> the values referred to above in registers.
>
> Again all of this is messy so I invite thoughts on this before undertaking
> something this messy.
>
> In closing though I'll mention that the Inline C code has been merged into
> bioperl_live.
> https://github.com/bioperl/bioperl-live/commit/01ec10dda23b6c5ed7592808cff4ae0d34278611
>
> The way that works is that there is a recommended dependency on C::Inline.
> If C::Inline is around, a Perl subroutine gets overwritten with a routine
> of the same name implemented in C. I imagine that if this goes forward, it
> would do likewise.
>
> Okay, enough babble. Time to hear from you all...
>
> But
>
>
> On Tue, Dec 23, 2014 at 4:28 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>
>> Thanks again Rocky, your work on this is really appreciated, and great to
>> see such an improvement for such a minor change!
>>
>> If there's any other code you'd like to share, or any changes to ours,
>> please feel free to send us more details or put in a pull request on GitHub.
>>
>> Thanks
>>
>> Will
>>
>> On 23 December 2014 at 03:26, Rocky Bernstein <rocky.bernstein at gmail.com>
>> wrote:
>>
>>> Just a follow-up to my earlier post.
>>>
>>> I ran a Variant Effect Prediction  run on a VCF file of 5000 entries
>>> (which is what fits in one buffer read)  with one small change. With that,
>>> I was able to significantly significantly reduce the time bottleneck in the
>>> Fasta code. The time spent here went from 7.76 seconds to 2.32 seconds.
>>>
>>> Compare the top line of:
>>> http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html
>>> with:
>>>
>>> http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html
>>>
>>> You get a 50% reduction just by the fact that one transformation is
>>> needed to remove both \n and \r rather than two transformations. But even
>>> beyond this, the C code for one run is still faster than the corresponding
>>> Perl s///.
>>>
>>> The specific change that I made can be found at
>>> https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff
>>> You'll also see benchmarks for other variations of that code.
>>>
>>> But.... in order to see the effect in a run you need to have Perl module
>>> Inline::C installed. Otherwise you get a lesser improvement outlined in my
>>> original posting.  Again this speeds things up by compiling once Perl
>>> regular expressions used to match \n and \r.
>>>
>>> In the spirit of open scientific review, I am curious to learn of others
>>> experience the same kind of improvement I saw.
>>>
>>> I have a pull request for this change to the bioperl-live repository.
>>> See https://github.com/bioperl/bioperl-live/issues/95 . However I note
>>> that the Bio::DB code used by  Variant Effect Predictor is a different
>>> (back-level) from the code in that git repository. The diff file in the
>>> gist cited above is for the Fasta.pm code that is in Ensembl ; of course,
>>> the pull request uses the current Bio::DB code.
>>>
>>>
>>> Lastly http://dustyfeet.com:8001 has the profile results other kinds of
>>> runs which I hope will clarify my other remarks about where things are
>>> slow.
>>>
>>>
>>> On Thu, Dec 18, 2014 at 12:48 AM, Rocky Bernstein <
>>> rocky.bernstein at gmail.com> wrote:
>>>>
>>>> Running the Variant Effect Predictor on a Human Genome VCF file (130780
>>>> lines)  with a local Fasta cache (--offline) takes about 50 minutes on a
>>>> quad-core Ubuntu box.
>>>>
>>>> I could give more details, but I don't think they are that important.
>>>>
>>>> In looking at how to speed this up, it looks like VEP goes through the
>>>> VCF file,  is sorted by chromosome, and processes each
>>>> Chromosome independently. The first obvious way to speed this up would
>>>> be to do some sort of 24-way map/reduce.
>>>> There is of course the --fork option on the variant_effect_predictor.pl
>>>> program which is roughly the same idea, but it parallelizes only across the
>>>> cores of a single computer rather than make use of multiple ones.
>>>>
>>>> To pinpoint the slowness better, I used Devel::NYTProf. For those of
>>>> you who haven't used it recently, it now has flame graphs and it makes it
>>>> very easy to see what's going on.
>>>>
>>>> The first thing that came out was a slowness in code to remove carriage
>>>> returns and line feeds. This is in Bio::DB::Fasta ::subseq:
>>>>
>>>>      $data =~ s/\n//g;
>>>>      $data =~ s/\r//g;
>>>>
>>>> Compiling the regexp, e.g:
>>>>
>>>>      my $nl = qr/\n/;
>>>>      my $cr = qr/\r/;
>>>>
>>>>      sub subseq {
>>>>          ....
>>>>         $data =~ s/$nl//g;
>>>>         $data =~ s/$cr//g;
>>>>      }
>>>>
>>>> Speeds up the subseq method by about 15%. I can elaborate more or
>>>> describe the other methods I tried and how they fared, if there's interest.
>>>> But since this portion is really part of BioPerl and not Bio::EnsEMBL, I'll
>>>> try to work up a git pull request ont that repository.
>>>>
>>>> So now I come to the meat of what I have to say. I should have put this
>>>> at the top -- I hope some of you are still with me.
>>>>
>>>> The NYTProf graphs seem to say that there is a *lot* of overhead in
>>>> object lookup and type testing. I think some of this is already known as
>>>> there already are calls to "weaken" and "new_fast" object creators. And
>>>> there is this comment in
>>>>  Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:
>>>>
>>>>
>>>>     # this method is a major bottle neck in the effect calculation code
>>>> so
>>>>     # we cache results and use local variables instead of method calls
>>>> where
>>>>     # possible to speed things up - caveat bug-fixer!
>>>>
>>>> In the few cases guided by NYTProf, I've been able to make reasonable
>>>> speed ups at the expense of eliminating the tests
>>>> and object overhead.
>>>>
>>>> For example, in EnsEMBL::Variation::BaseTranscriptVariation changing:
>>>>
>>>>
>>>>  sub transcript {
>>>>      my ($self, $transcript) = @_;
>>>>      assert_ref($transcript, 'Bio::EnsEMBL::Transcript') if $transcript;
>>>>      return $self->SUPER::feature($transcript, 'Transcript');
>>>> }
>>>>
>>>> to:
>>>>
>>>>      sub transcript {
>>>>          my ($self, $transcript) = @_;
>>>>         return $self->{feature};
>>>>
>>>> Gives a noticeable speed up. But you may ask: if that happens, then we
>>>> lose type safety and there is a potential for bugs?
>>>> And here's my take on how to address these valid concerns. First, I
>>>> think there could be two sets of the Perl modules, such as for
>>>> EnsEMBL::Variation::BaseTranscriptVariation - those with all of the
>>>> checks and those that are fast.  A configuration parameter might specify
>>>> which version to use. In development or by default, one might use the ones
>>>> that check types.
>>>>
>>>> Second and perhaps more import, there are the tests! If more need to be
>>>> added, then let's add them. And one can always add a test to make sure the
>>>> results of the two versions gives the same result.
>>>>
>>>> One last avenue of optimization that I'd like to explore is using say
>>>> Inline::C or basically coding in C hot spots. In particular, consider
>>>> Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap which looks
>>>> like this:
>>>>
>>>>          my ( $f1_start, $f1_end, $f2_start, $f2_end ) = @_;
>>>>          return ( ($f1_end >= $f2_start) and ($f1_start <= $f2_end) );
>>>>
>>>> I haven't tried it on this hot spot, but this is something that might
>>>> benefit from getting coded in C. Again the trade off for speed here is a
>>>> dependency on compiling C. In my view anyone installing this locally or
>>>> installing CPAN modules probably already does, but it does add complexity.
>>>>
>>>> Typically, this is handled in Perl by providing both versions, perhaps
>>>> as separate modules.
>>>>
>>>> Thought or comments?
>>>>
>>>> Thanks,
>>>>    rocky
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150107/23d5886b/attachment.html>