Question

how to know INDELs length ?

0

Entering edit mode

7.3 years ago

chengyi31000 ▴ 10

I have several vcf files. Can I want to know the indels length? I had saw the following words from paper, and wanting to know how is it in my data. "They ranged from 1 bp to 10,000 bp in length and followed a size distribution in which the majority of INDELs were <100 bp in length"

indels • 4.6k views

ADD COMMENT • link updated 7.3 years ago by Alex Reynolds 36k • written 7.3 years ago by chengyi31000 ▴ 10

score 1 · Answer 1 · 2018-06-04

Via BEDOPS vcf2bed, bedops, and bedmap, one can build a BED file of insertions and deletions, and map their sizes to some reference genome, piping a list of sizes to a histogram/statistical summary tool, such as the hist tool in bashplotlib.

First, to build a list of INDELs:

$ vcf2bed --insertions < variants.vcf > insertions.bed
$ vcf2bed --deletions < variants.vcf > deletions.bed
$ bedops --everything insertions.bed deletions.bed > union.bed

Or via a one-liner:

$ bedops --everything <(vcf2bed --insertions < variants.vcf) <(vcf2bed --deletions < variants.vcf) > union.bed

To get chromosome bounds for hg38 (adjust for your reference genome), you can use Kent Utilities fetchChromSizes:

$ fetchChromSizes hg38 | awk -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - > hg38.bed

To get a quick size distribution summary, one can pipe the output of bedmap --echo-map-size to the hist tool in bashplotlib.

Here's an example of output from a random subset of 1000Genome SNPs off of chromosome X:

$ bedmap --echo-map-size --multidelim '\n' --skip-unmapped hg38.bed union.bed | hist

 29|  o         
 28|  o         
 26|  o         
 25|  o         
 23|  o         
 22|  o         
 20|  o         
 19|  o         
 17|  o         
 16|  o         
 14|  o         
 13|  o         
 11| oo         
 10| oo         
  8| oo         
  7| oo         
  5| oo         
  4| oo         
  2| oo         
  1| ooo       o
    -----------

----------------------------------
|            Summary             |
----------------------------------
|        observations: 44        |
|      min value: 1.000000       |
|        mean : 84.522727        |
|     max value: 3105.000000     |
----------------------------------

Such a list of sizes could instead be piped into a Python or R script (for example) to generate a summary in any other desired format.

score 0 · Answer 2 · 2018-06-04

0

Entering edit mode

7.3 years ago

Pierre Lindenbaum 166k

Can I want to know the indels length?

yes length(ALT)-length(REF) or the INFO/SVLEN attribute

ADD COMMENT • link 7.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

what is this command's tools? vcftools?

ADD REPLY • link 7.3 years ago by chengyi31000 ▴ 10

score 0 · Answer 3 · 2018-06-04

0

Entering edit mode

7.3 years ago

cpad0112 21k

post an example here. In awk, it would be (make sure that VCF file is flat):

$ awk -v OFS="\t" ' BEGIN {print "Sequence","Length"}!/#/ {if(length($5) > length($4)) print $5, length($5)}' test.vcf

ADD COMMENT • link 7.3 years ago by cpad0112 21k

0

Entering edit mode

my result as following (list few), can I summary this?

A,G 3 C,G 3 T,TT 4 TTA 3 GTTTTTT,GTTTTTTA,GTTTTT 23 A,T 3 TCTTCTTCTC 10 CA 2 AAT 3 TTGTA 5 T,TT 4 GAG,GAGG,GACG,GA 16 C,A 3 T,C 3 G,C 3 CA 2 C,A 3 T,TT 4 CA 2 GCGTA,GCATA 11 T,TG 4 C,CCCA 6 G,A 3 C,T 3 GCACACACACACACACACA,ACACACACACACACACACACA,GCACACACACACACACAC 60 CA,CAC 6 T,TT 4 A,T 3 T,A 3 GCC,GCCA 8 T,C 3 G,A 3 AAAG,GAA 8 G,T 3 T,A 3 T,AAATT 7 T,TT,TTATT 10 T,TATT,TATTA 12 T,TT 4 T,TA 4 TG 2

ADD REPLY • link 7.3 years ago by chengyi31000 ▴ 10

0

Entering edit mode

No, VCF is not flat. use bcftools norm for indels on VCF. Then run the command.

ADD REPLY • link 7.3 years ago by cpad0112 21k

0

Entering edit mode

using following command? But it is not work

bcftools norm -f chr4.vcf -O z > out.vcf.gz

ADD REPLY • link 7.3 years ago by chengyi31000 ▴ 10