[ensembl-dev] Split a vcf site by transcript

Tue Apr 5 10:07:08 BST 2016

Hi Mateo,

I don't think this solution solves all of your requirements, but it may
come close enough.

You can use the filter_vep.pl script [1] to filter on a list of transcripts
(provided in a file or a comma-separated list on the command line), and to
remove non-matching "blocks" of annotation from the VCF line.

After producing normal VEP output in my_vep_output.vcf (don't use --pick),
the command would look something like:

perl filter_vep.pl -i my_vep_output.vcf -f "Feature in transcript_list.txt"
-only_matched > my_vep_output_filtered.vcf

This will work to an extent, but it will still produce some lines of VCF
with more than one transcript annotated if your transcript_list.txt
contains sets of transcripts that overlap (i.e. from the same gene).

If you really want VCF with one line per transcript, you could write a
simple perl or awk script to post-process the output, perhaps something
like:

perl -e'while(<>) { chomp; if(m/(.+?)(CSQ|ANN)\=([^;^\s]+)(.*)/) { foreach
my $s(split ",", $3) { print "$1$2\=$s\;$4\n"}}}'  my_vep_output.vcf >
my_vep_output_split.vcf

You could then use the filter script as above to reduce to your transcript
set of interest, or even simply use "grep -f transcript_list.txt"

Regards

Will McLaren
Ensembl Variation

[1] http://www.ensembl.org/info/docs/tools/vep/script/vep_filter.html

On 5 April 2016 at 09:26, Matteo <matteodeg at gmail.com> wrote:

> Hi there,
>
> I'm using variant effect predictor on local, release 83. I have a question
> for you.
> This is my command line:
>
> perl variant_effect_predictor.pl -i $path/20160317_Cardio_GATK_Filter.vcf
> -o $path/20160317_ANN_Cardio.vcf --stats_file
> $path/20160317_ANN_Cardio.html --cache --assembly GRCh37 --offline
> --force_overwrite -v --variant_class --sift b --poly b --vcf_info_field ANN
> --hgvs --protein --canonical --check_existing --gmaf --pubmed --species
> homo_sapiens --failed 1 --vcf --plugin LoFtool
>
> I'm trying to get a vcf output with one line for each transcript, in fact,
> I'd like to filter after the annotation using a list of transcripts like
> this:
>
> ENST00000299421
> ENST00000372980
> ENST00000310706
> ENST00000252321
> ENST00000315987
>
> Using --pick option, I get only one trascript for each position but I'm
> looking for all transcripts splitted on different lines, like the .txt
> output but in a vcf format like this example:
>
> chr1    25890050    .    A    G    20396.77    PASS
> AC=15;AF=0.625;AN=24;BaseQRankSum=3.48;ClippingRankSum=0.973;DP=909;ExcessHet=5.5287;FS=0.000;InbreedingCoeff=-0.2456;MLEAC=15;MLEAF=0.625;MQ=60.00;MQRankSum=0.167;QD=23.05;ReadPosRankSum=0.394;SOR=0.709;ANN=G|upstream_gene_variant|MODIFIER|LDLRAP1|ENSG00000157978|Transcript|ENST00000470950|processed_transcript||||||||||rs6661159|1|1032|1|SNV|HGNC|18640||||||G:0.4393||||
> GT:AD:DP:GQ:PL    0/1:34,54:88:99:1817,0,946
>
>
> chr1    25890050    .    A    G    20396.77    PASS
> AC=15;AF=0.625;AN=24;BaseQRankSum=3.48;ClippingRankSum=0.973;DP=909;ExcessHet=5.5287;FS=0.000;InbreedingCoeff=-0.2456;MLEAC=15;MLEAF=0.625;MQ=60.00;MQRankSum=0.167;QD=23.05;ReadPosRankSum=0.394;SOR=0.709;ANN=G|intron_variant&non_coding_transcript_variant|MODIFIER|LDLRAP1|ENSG00000157978|Transcript|ENST00000484476|processed_transcript||1/3|ENST00000484476.1:n.339-102A>G|||||||rs6661159|1||1|SNV|HGNC|18640||||||G:0.4393||||
> GT:AD:DP:GQ:PL    0/1:34,54:88:99:1817,0,946
>
> Last question: is it possible to split also multiallelic sites annotation
> or do you suggest to normalize them before?
>
> Thank you for your help,
>
> Matteo
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160405/43e1b2cb/attachment.html>