[ensembl-dev] Thoughts on Speeding up the Variant Effect Predictor

Thu Dec 18 15:51:41 GMT 2014

Running the Variant Effect Predictor on a Human Genome VCF file (130780
lines)  with a local Fasta cache (--offline) takes about 50 minutes on a
quad-core Ubuntu box.

I could give more details, but I don't think they are that important.

In looking at how to speed this up, it looks like VEP goes through the VCF
file,  is sorted by chromosome, and processes each
Chromosome independently. A simple and obvious way to speed this up would
be to do some sort of 24-way map/reduce.
There is of course the --fork option on the variant_effect_predictor.pl program
which is roughly the same idea, but it parallelizes only across the cores
of a single computer rather than make use of multiple ones.

To pinpoint the slowness better, I used Devel::NYTProf. For those of you
who haven't used it recently, it now has flame graphs and it makes it very
easy to see what's going on.

The first thing that came out was a slowness in code to remove carriage
returns and line feeds. This is in Bio::DB::Fasta ::subseq:

     $data =~ s/\n//g;
     $data =~ s/\r//g;

Compiling the regexp, e.g:

     my $nl = qr/\n/;
     my $cr = qr/\r/;

     sub subseq {
         ....
        $data =~ s/$nl//g;
        $data =~ s/$cr//g;
     }

Speeds up the subseq method by about 15%. I can elaborate more or describe
the other methods I tried and how they fared, if there's interest. But
since this portion is really part of BioPerl and not Bio::EnsEMBL, I'll try
to work up a git pull request on that repository.

So now I come to the meat of what I have to say. I should have put this at
the top -- I hope some of you are still with me.

The NYTProf graphs seem to say that there is a *lot* of overhead in object
lookup and type testing. I think some of this is already known as there
already are calls to "weaken" and "new_fast" object creators. And there is
this comment in
 Bio::EnsEMBL::Variation::BaseTranscriptVariation:_intron_effects:

    # this method is a major bottle neck in the effect calculation code so
    # we cache results and use local variables instead of method calls where
    # possible to speed things up - caveat bug-fixer!

In the few cases guided by NYTProf that I have looked at, I've been able to
make reasonable speed ups at the expense of eliminating the tests
and object overhead.

For example, in EnsEMBL::Variation::BaseTranscriptVariation changing:

 sub transcript {
     my ($self, $transcript) = @_;
     assert_ref($transcript, 'Bio::EnsEMBL::Transcript') if $transcript;
     return $self->SUPER::feature($transcript, 'Transcript');
}

to:

     sub transcript {
         my ($self, $transcript) = @_;
        return $self->{feature};

Gives a noticeable speed up. But you may ask: if that happens, then we lose
type safety and there is a potential for bugs?
Here ist how to address these valid concerns.

First, I think there could be two sets of the Perl modules, such as for
EnsEMBL::Variation::BaseTranscriptVariation. One set with all of the checks
and another without that are fast.  A configuration parameter might specify
which version to use. In development or by default, one might use the ones
that check types.

Second and perhaps more import, there are the tests! If more need to be
added, then let's add them. And one can always add a test to make sure the
results of the two versions gives the same result.

One last avenue of optimization that I'd like to explore is using say
Inline::C or basically coding in C hot spots. In particular, consider
Bio::EnsEMBL::Variation::Utils::VariationEffect::overlap which looks like
this:

         my ( $f1_start, $f1_end, $f2_start, $f2_end ) = @_;
         return ( ($f1_end >= $f2_start) and ($f1_start <= $f2_end) );

I haven't tried it on this hot spot, but this is something that might
benefit from getting coded in C. Again the trade off for speed here is a
dependency on compiling C. In my view anyone installing this locally or
installing CPAN modules probably already does, but it does add complexity.

Typically, this is handled in Perl by providing both versions, perhaps as
separate modules.

Thoughts or comments?

Thanks,
   rocky
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141218/834ea6b5/attachment.html>