how to know INDELs length ?
3
0
Entering edit mode
6.5 years ago
chengyi31000 ▴ 10

I have several vcf files. Can I want to know the indels length? I had saw the following words from paper, and wanting to know how is it in my data. "They ranged from 1 bp to 10,000 bp in length and followed a size distribution in which the majority of INDELs were <100 bp in length"

indels • 3.9k views
ADD COMMENT
1
Entering edit mode
6.5 years ago

Via BEDOPS vcf2bed, bedops, and bedmap, one can build a BED file of insertions and deletions, and map their sizes to some reference genome, piping a list of sizes to a histogram/statistical summary tool, such as the hist tool in bashplotlib.

First, to build a list of INDELs:

$ vcf2bed --insertions < variants.vcf > insertions.bed
$ vcf2bed --deletions < variants.vcf > deletions.bed
$ bedops --everything insertions.bed deletions.bed > union.bed

Or via a one-liner:

$ bedops --everything <(vcf2bed --insertions < variants.vcf) <(vcf2bed --deletions < variants.vcf) > union.bed

To get chromosome bounds for hg38 (adjust for your reference genome), you can use Kent Utilities fetchChromSizes:

$ fetchChromSizes hg38 | awk -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - > hg38.bed

To get a quick size distribution summary, one can pipe the output of bedmap --echo-map-size to the hist tool in bashplotlib.

Here's an example of output from a random subset of 1000Genome SNPs off of chromosome X:

$ bedmap --echo-map-size --multidelim '\n' --skip-unmapped hg38.bed union.bed | hist

 29|  o         
 28|  o         
 26|  o         
 25|  o         
 23|  o         
 22|  o         
 20|  o         
 19|  o         
 17|  o         
 16|  o         
 14|  o         
 13|  o         
 11| oo         
 10| oo         
  8| oo         
  7| oo         
  5| oo         
  4| oo         
  2| oo         
  1| ooo       o
    -----------

----------------------------------
|            Summary             |
----------------------------------
|        observations: 44        |
|      min value: 1.000000       |
|        mean : 84.522727        |
|     max value: 3105.000000     |
----------------------------------

Such a list of sizes could instead be piped into a Python or R script (for example) to generate a summary in any other desired format.

ADD COMMENT
0
Entering edit mode
6.5 years ago

Can I want to know the indels length?

yes length(ALT)-length(REF) or the INFO/SVLEN attribute

ADD COMMENT
0
Entering edit mode

what is this command's tools? vcftools?

ADD REPLY
0
Entering edit mode
6.5 years ago

post an example here. In awk, it would be (make sure that VCF file is flat):

$ awk -v OFS="\t" ' BEGIN {print "Sequence","Length"}!/#/ {if(length($5) > length($4)) print $5, length($5)}' test.vcf
ADD COMMENT
0
Entering edit mode

my result as following (list few), can I summary this?

A,G 3 C,G 3 T,TT 4 TTA 3 GTTTTTT,GTTTTTTA,GTTTTT 23 A,T 3 TCTTCTTCTC 10 CA 2 AAT 3 TTGTA 5 T,TT 4 GAG,GAGG,GACG,GA 16 C,A 3 T,C 3 G,C 3 CA 2 C,A 3 T,TT 4 CA 2 GCGTA,GCATA 11 T,TG 4 C,CCCA 6 G,A 3 C,T 3 GCACACACACACACACACA,ACACACACACACACACACACA,GCACACACACACACACAC 60 CA,CAC 6 T,TT 4 A,T 3 T,A 3 GCC,GCCA 8 T,C 3 G,A 3 AAAG,GAA 8 G,T 3 T,A 3 T,AAATT 7 T,TT,TTATT 10 T,TATT,TATTA 12 T,TT 4 T,TA 4 TG 2

ADD REPLY
0
Entering edit mode

No, VCF is not flat. use bcftools norm for indels on VCF. Then run the command.

ADD REPLY
0
Entering edit mode

using following command? But it is not work

bcftools norm -f chr4.vcf -O z > out.vcf.gz

ADD REPLY

Login before adding your answer.

Traffic: 1730 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6