[ensembl-dev] Speeding up Bio::DB::Fasta::subseq (was Re: Thoughts on Speeding up the Variant Effect Predictor)

Sat Jan 10 02:29:32 GMT 2015

On Thu, Jan 8, 2015 at 12:49 PM, Fields, Christopher J <
cjfields at illinois.edu> wrote:

>  Speaking from the Bioperl end, I think any improvements like this are
> more than welcome.
>

My understanding of where we left off on this is that you had Perl bindings
that someone gave you had for  the FASTA/FASTQ Parser in C.
http://lh3lh3.users.sourceforge.net/parsefastq.shtml .

Since you hadn't gotten back, I assumed that you couldn't' find it. But it
really doesn't matter because I looked at this and it doesn't look like the
API that the C code provides covers much (if any) of the API that is in
Bioperl for FASTA.

More importantly and more broadly, as far as I can tell none of the rest of
Bioperl is using that much time. Oddly, something as simple and stupid as
removing carriage returns and line feeds was taking even more time that was
spent in reading a FASTA file!

Of course, there are no doubt many applications using Bioperl other than
VEP. But unless those are slow and because of something in Bioperl, best to
leave things as they are with respect to speed.

By the way, Francisco J. Ossandon did a fantastic job integrating the code
in.

> The main issue is ensuring pure perl fallback method work fine if
> Inline::C is not available.
>
>  chris
>
>  On Jan 8, 2015, at 10:56 AM, Sarah Hunt <seh at ebi.ac.uk> wrote:
>
>
> Hi Rocky,
>
> Thanks for this - it's good to see you are noticing significant
> improvements when switching these functions. As Will said, we have started
> discussing possible VEP optimisation strategies recently, so are very keen
> to hear your ideas. That said, half of our team - including Will who is the
> lead on VEP - are away at the moment, so we won't be planning any big
> changes until next month.
>
> All the best,
>
> Sarah
>
> On 07/01/2015 21:29, Rocky Bernstein wrote:
>
> Some corrections. Coding overlap in C is about a 100% improvement. (And
> the code looks about the same in C if you remove '$'s and change "and" to
> "&&").
>
>  If I have this right, (and I may not so please double check with your
> tests), this reduced the time taken
> in Bio::EnsEMBL::Variation::Utils::VariationEffect from  executing 8574029
> statements in 19.6s to executing 5631868 statements in 12.5s. The specific
> time in _intron_effects went from 7.87s to 5.11s, while the time in overlap
> went from 1.45s to .638ms. I don't understand how the speedup in overlap
> appears to cause a bigger speedup overall.
>
>  Details of the benchmark changes are at
> https://gist.github.com/rocky/6083dedc752c197875ca
> while the overall run is
> http://dustyfeet.com:8001/VEP-prof-5000-Inline-Overlap-C/
>
> On Fri, Jan 2, 2015 at 11:39 PM, Rocky Bernstein <
> rocky.bernstein at gmail.com> wrote:
>
>>  A big-picture question before some small nuts and bolts.
>>
>>  SnpEff  http://snpeff.sourceforge.net/ is about an order of magnitude
>> faster than VEP. Yes, I realize they work at different levels, but isn't
>> the level of difficulty and size data sizes that they work on roughly
>> equivalent? I've heard people express the feeling that because the problems
>> "feel" about the same in size and complexity they VEP should be running at
>> about competitive speed. Or with in a factor or so.
>>
>>  I honestly don't know, and I'd like to understand this better. So I'd
>> appreciate thoughts and comments on this.
>>
>>
>>  Okay. now to nuts and bolts. Occasionally I'll look at VEP performance
>> data mentioned before. And this has led me to look
>> at Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap . See:
>> http://dustyfeet.com:8001/VEP-prof-chrom1/Bio-EnsEMBL-Variation-BaseTranscriptVariation-pm-218-line.html#531
>>
>>  The overlap code is basically returning the "and" of two integer
>> comparisons. I tried coding overalp in C and got a 6% speedup - not that
>> great. But now consider
>> this code in
>> Bio::EnsEMBL::Variation::BaseTranscriptVariation::_intron_effects that
>> calls overlap:
>>
>>          if ( overlap($vf_start, $vf_end, $intron_start-3,
>> $intron_start-1) or
>>                  overlap($vf_start, $vf_end, $intron_start+2,
>> $intron_start+7) or
>>                  overlap($vf_start, $vf_end, $intron_end-7,
>> $intron_end-2  ) or
>>                  overlap($vf_start, $vf_end, $intron_end+1,
>> $intron_end+3  ) or
>>                  ($insertion && (
>>                      $vf_start == $intron_start ||
>>                      $vf_end == $intron_end ||
>>                      $vf_start == $intron_start+2 ||
>>                      $vf_end == $intron_end-2
>>                     ) )) {
>>
>>  If you inline the overlap code here, you'd get basically
>> * 4 comparisons of $vf_start with $intron_start
>> * 3 comparisons of $vf_end with $intron_end
>> * 2 comparisons of $vf_start with $intron_end
>>
>>  And since $intron_end is not less than $intron_start , it is possible
>> that if $vf_start is greater than $intron_end, it will also have to be
>> greater than $intron_start as well, eliminating possibly 4 comparisons.
>>
>>  So the logic could be rewritten here. Having good tests of that
>> replaced logic is in my opinion crucial, as is keeping the old code above
>> around in case one wants to change or experiment with things.
>>
>>  But what I might consider doing is coding it all in C and combine the 4
>> overlap calls into one. My guess is that C will also benefit from keeping
>> the values referred to above in registers.
>>
>>  Again all of this is messy so I invite thoughts on this before
>> undertaking something this messy.
>>
>>  In closing though I'll mention that the Inline C code has been merged
>> into bioperl_live.
>> https://github.com/bioperl/bioperl-live/commit/01ec10dda23b6c5ed7592808cff4ae0d34278611
>>
>>  The way that works is that there is a recommended dependency on
>> C::Inline. If C::Inline is around, a Perl subroutine gets overwritten with
>> a routine of the same name implemented in C. I imagine that if this goes
>> forward, it would do likewise.
>>
>>  Okay, enough babble. Time to hear from you all...
>>
>>  But
>>
>>
>> On Tue, Dec 23, 2014 at 4:28 AM, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>
>>> Thanks again Rocky, your work on this is really appreciated, and great
>>> to see such an improvement for such a minor change!
>>>
>>>  If there's any other code you'd like to share, or any changes to ours,
>>> please feel free to send us more details or put in a pull request on GitHub.
>>>
>>>  Thanks
>>>
>>>  Will
>>>
>>>  On 23 December 2014 at 03:26, Rocky Bernstein <
>>> rocky.bernstein at gmail.com> wrote:
>>>
>>>>  Just a follow-up to my earlier post.
>>>>
>>>>  I ran a Variant Effect Prediction  run on a VCF file of 5000 entries
>>>> (which is what fits in one buffer read)  with one small change. With that,
>>>> I was able to significantly significantly reduce the time bottleneck in the
>>>> Fasta code. The time spent here went from 7.76 seconds to 2.32 seconds.
>>>>
>>>>  Compare the top line of:
>>>> http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html
>>>> with:
>>>>
>>>> http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html
>>>>
>>>>  You get a 50% reduction just by the fact that one transformation is
>>>> needed to remove both \n and \r rather than two transformations. But even
>>>> beyond this, the C code for one run is still faster than the corresponding
>>>> Perl s///.
>>>>
>>>>  The specific change that I made can be found at
>>>> https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff
>>>> You'll also see benchmarks for other variations of that code.
>>>>
>>>>  But.... in order to see the effect in a run you need to have Perl
>>>> module Inline::C installed. Otherwise you get a lesser improvement outlined
>>>> in my original posting.  Again this speeds things up by compiling once Perl
>>>> regular expressions used to match \n and \r.
>>>>
>>>>  In the spirit of open scientific review, I am curious to learn of
>>>> others experience the same kind of improvement I saw.
>>>>
>>>>  I have a pull request for this change to the bioperl-live repository.
>>>> See https://github.com/bioperl/bioperl-live/issues/95 . However I note
>>>> that the Bio::DB code used by  Variant Effect Predictor is a different
>>>> (back-level) from the code in that git repository. The diff file in the
>>>> gist cited above is for the Fasta.pm code that is in Ensembl ; of course,
>>>> the pull request uses the current Bio::DB code.
>>>>
>>>>
>>>>  Lastly http://dustyfeet.com:8001 has the profile results other kinds
>>>> of runs which I hope will clarify my other remarks about where things are
>>>> slow.
>>>>
>>>>
>>>> On Thu, Dec 18, 2014 at 12:48 AM, Rocky Bernstein <
>>>> rocky.bernstein at gmail.com> wrote:
>>>>>
>>>>> Running the Variant Effect Predictor on a Human Genome VCF file
>>>>> (130780 lines)  with a local Fasta cache (--offline) takes about 50 minutes
>>>>> on a quad-core Ubuntu box.
>>>>>
>>>>>  I could give more details, but I don't think they are that
>>>>> important.
>>>>>
>>>>>  In looking at how to speed this up, it looks like VEP goes through
>>>>> the VCF file,  is sorted by chromosome, and processes each
>>>>> Chromosome independently. The first obvious way to speed this up would
>>>>> be to do some sort of 24-way map/reduce.
>>>>> There is of course the --fork option on the
>>>>> variant_effect_predictor.pl program which is roughly the same idea,
>>>>> but it parallelizes only across the cores of a single computer rather than
>>>>> make use of multiple ones.
>>>>>
>>>>>  To pinpoint the slowness better, I used Devel::NYTProf. For those of
>>>>> you who haven't used it recently, it now has flame graphs and it makes it
>>>>> very easy to see what's going on.
>>>>>
>>>>>  The first thing that came out was a slowness in code to remove
>>>>> carriage returns and line feeds. This is in Bio::DB::Fasta ::subseq:
>>>>>
>>>>>       $data =~ s/\n//g;
>>>>>      $data =~ s/\r//g;
>>>>>
>>>>>  Compiling the regexp, e.g:
>>>>>
>>>>>       my $nl = qr/\n/;
>>>>>      my $cr = qr/\r/;
>>>>>
>>>>>       sub subseq {
>>>>>          ....
>>>>>         $data =~ s/$nl//g;
>>>>>         $data =~ s/$cr//g;
>>>>>      }
>>>>>
>>>>>  Speeds up the subseq method by about 15%. I can elaborate more or
>>>>> describe the other methods I tried and how they fared, if there's interest.
>>>>> But since this portion is really part of BioPerl and not Bio::EnsEMBL, I'll
>>>>> try to work up a git pull request ont that repository.
>>>>>
>>>>>  So now I come to the meat of what I have to say. I should have put
>>>>> this at the top -- I hope some of you are still with me.
>>>>>
>>>>>  The NYTProf graphs seem to say that there is a *lot* of overhead in
>>>>> object lookup and type testing. I think some of this is already known as
>>>>> there already are calls to "weaken" and "new_fast" object creators. And
>>>>> there is this comment in
>>>>>  Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:
>>>>>
>>>>>
>>>>>      # this method is a major bottle neck in the effect calculation
>>>>> code so
>>>>>     # we cache results and use local variables instead of method calls
>>>>> where
>>>>>     # possible to speed things up - caveat bug-fixer!
>>>>>
>>>>>  In the few cases guided by NYTProf, I've been able to make
>>>>> reasonable speed ups at the expense of eliminating the tests
>>>>> and object overhead.
>>>>>
>>>>>  For example, in EnsEMBL::Variation::BaseTranscriptVariation
>>>>> changing:
>>>>>
>>>>>
>>>>>   sub transcript {
>>>>>      my ($self, $transcript) = @_;
>>>>>      assert_ref($transcript, 'Bio::EnsEMBL::Transcript') if
>>>>> $transcript;
>>>>>      return $self->SUPER::feature($transcript, 'Transcript');
>>>>> }
>>>>>
>>>>>  to:
>>>>>
>>>>>       sub transcript {
>>>>>          my ($self, $transcript) = @_;
>>>>>         return $self->{feature};
>>>>>
>>>>>  Gives a noticeable speed up. But you may ask: if that happens, then
>>>>> we lose type safety and there is a potential for bugs?
>>>>> And here's my take on how to address these valid concerns. First, I
>>>>> think there could be two sets of the Perl modules, such as for
>>>>> EnsEMBL::Variation::BaseTranscriptVariation - those with all of the
>>>>> checks and those that are fast.  A configuration parameter might specify
>>>>> which version to use. In development or by default, one might use the ones
>>>>> that check types.
>>>>>
>>>>>  Second and perhaps more import, there are the tests! If more need to
>>>>> be added, then let's add them. And one can always add a test to make sure
>>>>> the results of the two versions gives the same result.
>>>>>
>>>>>  One last avenue of optimization that I'd like to explore is using
>>>>> say Inline::C or basically coding in C hot spots. In particular, consider
>>>>> Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap which looks
>>>>> like this:
>>>>>
>>>>>           my ( $f1_start, $f1_end, $f2_start, $f2_end ) = @_;
>>>>>          return ( ($f1_end >= $f2_start) and ($f1_start <= $f2_end) );
>>>>>
>>>>>  I haven't tried it on this hot spot, but this is something that
>>>>> might benefit from getting coded in C. Again the trade off for speed here
>>>>> is a dependency on compiling C. In my view anyone installing this locally
>>>>> or installing CPAN modules probably already does, but it does add
>>>>> complexity.
>>>>>
>>>>>  Typically, this is handled in Perl by providing both versions,
>>>>> perhaps as separate modules.
>>>>>
>>>>>  Thought or comments?
>>>>>
>>>>>  Thanks,
>>>>>    rocky
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>  _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info:
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>  _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150109/405d4478/attachment.html>