[ensembl-dev] Speeding up Bio::DB::Fasta::subseq (was Re: Thoughts on Speeding up the Variant Effect Predictor)

Will McLaren wm2 at ebi.ac.uk
Tue Dec 23 09:28:14 GMT 2014


Thanks again Rocky, your work on this is really appreciated, and great to
see such an improvement for such a minor change!

If there's any other code you'd like to share, or any changes to ours,
please feel free to send us more details or put in a pull request on GitHub.

Thanks

Will

On 23 December 2014 at 03:26, Rocky Bernstein <rocky.bernstein at gmail.com>
wrote:

> Just a follow-up to my earlier post.
>
> I ran a Variant Effect Prediction  run on a VCF file of 5000 entries
> (which is what fits in one buffer read)  with one small change. With that,
> I was able to significantly significantly reduce the time bottleneck in the
> Fasta code. The time spent here went from 7.76 seconds to 2.32 seconds.
>
> Compare the top line of:
> http://dustyfeet.com:8001/VEP-prof-5000/Bio-DB-Fasta-pm-323-line.html
> with:
>
> http://dustyfeet.com:8001/VEP-prof-5000-Inline-C/Bio-DB-Fasta-pm-323-line.html
>
> You get a 50% reduction just by the fact that one transformation is needed
> to remove both \n and \r rather than two transformations. But even beyond
> this, the C code for one run is still faster than the corresponding Perl
> s///.
>
> The specific change that I made can be found at
> https://gist.github.com/rocky/61f929d58a286189a758#file-fasta-pm-diff
> You'll also see benchmarks for other variations of that code.
>
> But.... in order to see the effect in a run you need to have Perl module
> Inline::C installed. Otherwise you get a lesser improvement outlined in my
> original posting.  Again this speeds things up by compiling once Perl
> regular expressions used to match \n and \r.
>
> In the spirit of open scientific review, I am curious to learn of others
> experience the same kind of improvement I saw.
>
> I have a pull request for this change to the bioperl-live repository. See
> https://github.com/bioperl/bioperl-live/issues/95 . However I note that
> the Bio::DB code used by  Variant Effect Predictor is a different
> (back-level) from the code in that git repository. The diff file in the
> gist cited above is for the Fasta.pm code that is in Ensembl ; of course,
> the pull request uses the current Bio::DB code.
>
>
> Lastly http://dustyfeet.com:8001 has the profile results other kinds of
> runs which I hope will clarify my other remarks about where things are
> slow.
>
>
> On Thu, Dec 18, 2014 at 12:48 AM, Rocky Bernstein <
> rocky.bernstein at gmail.com> wrote:
>>
>> Running the Variant Effect Predictor on a Human Genome VCF file (130780
>> lines)  with a local Fasta cache (--offline) takes about 50 minutes on a
>> quad-core Ubuntu box.
>>
>> I could give more details, but I don't think they are that important.
>>
>> In looking at how to speed this up, it looks like VEP goes through the
>> VCF file,  is sorted by chromosome, and processes each
>> Chromosome independently. The first obvious way to speed this up would be
>> to do some sort of 24-way map/reduce.
>> There is of course the --fork option on the variant_effect_predictor.pl
>> program which is roughly the same idea, but it parallelizes only across the
>> cores of a single computer rather than make use of multiple ones.
>>
>> To pinpoint the slowness better, I used Devel::NYTProf. For those of you
>> who haven't used it recently, it now has flame graphs and it makes it very
>> easy to see what's going on.
>>
>> The first thing that came out was a slowness in code to remove carriage
>> returns and line feeds. This is in Bio::DB::Fasta ::subseq:
>>
>>      $data =~ s/\n//g;
>>      $data =~ s/\r//g;
>>
>> Compiling the regexp, e.g:
>>
>>      my $nl = qr/\n/;
>>      my $cr = qr/\r/;
>>
>>      sub subseq {
>>          ....
>>         $data =~ s/$nl//g;
>>         $data =~ s/$cr//g;
>>      }
>>
>> Speeds up the subseq method by about 15%. I can elaborate more or
>> describe the other methods I tried and how they fared, if there's interest.
>> But since this portion is really part of BioPerl and not Bio::EnsEMBL, I'll
>> try to work up a git pull request ont that repository.
>>
>> So now I come to the meat of what I have to say. I should have put this
>> at the top -- I hope some of you are still with me.
>>
>> The NYTProf graphs seem to say that there is a *lot* of overhead in
>> object lookup and type testing. I think some of this is already known as
>> there already are calls to "weaken" and "new_fast" object creators. And
>> there is this comment in
>>  Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:
>>
>>
>>     # this method is a major bottle neck in the effect calculation code so
>>     # we cache results and use local variables instead of method calls
>> where
>>     # possible to speed things up - caveat bug-fixer!
>>
>> In the few cases guided by NYTProf, I've been able to make reasonable
>> speed ups at the expense of eliminating the tests
>> and object overhead.
>>
>> For example, in EnsEMBL::Variation::BaseTranscriptVariation changing:
>>
>>
>>  sub transcript {
>>      my ($self, $transcript) = @_;
>>      assert_ref($transcript, 'Bio::EnsEMBL::Transcript') if $transcript;
>>      return $self->SUPER::feature($transcript, 'Transcript');
>> }
>>
>> to:
>>
>>      sub transcript {
>>          my ($self, $transcript) = @_;
>>         return $self->{feature};
>>
>> Gives a noticeable speed up. But you may ask: if that happens, then we
>> lose type safety and there is a potential for bugs?
>> And here's my take on how to address these valid concerns. First, I think
>> there could be two sets of the Perl modules, such as for
>> EnsEMBL::Variation::BaseTranscriptVariation - those with all of the
>> checks and those that are fast.  A configuration parameter might specify
>> which version to use. In development or by default, one might use the ones
>> that check types.
>>
>> Second and perhaps more import, there are the tests! If more need to be
>> added, then let's add them. And one can always add a test to make sure the
>> results of the two versions gives the same result.
>>
>> One last avenue of optimization that I'd like to explore is using say
>> Inline::C or basically coding in C hot spots. In particular, consider
>> Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap which looks like
>> this:
>>
>>          my ( $f1_start, $f1_end, $f2_start, $f2_end ) = @_;
>>          return ( ($f1_end >= $f2_start) and ($f1_start <= $f2_end) );
>>
>> I haven't tried it on this hot spot, but this is something that might
>> benefit from getting coded in C. Again the trade off for speed here is a
>> dependency on compiling C. In my view anyone installing this locally or
>> installing CPAN modules probably already does, but it does add complexity.
>>
>> Typically, this is handled in Perl by providing both versions, perhaps as
>> separate modules.
>>
>> Thought or comments?
>>
>> Thanks,
>>    rocky
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141223/e27acbdb/attachment.html>


More information about the Dev mailing list