Question

Filter VCF based on INFO column values in R

0

Entering edit mode

5 months ago

Stavroula • 0

Hello all,

I was wondering if anyone can help me, I have a table with the following format:

FORMAT                                                  
GT:DP:HF:CILOW:CIUP:SDP      (column V9)          


Info
0/1:4282:0.001:0.0:0.003:5;0.   (column V10)

and I want to filter in R for values that are HF<0.1 and HF>0.99 without caring about the rest of the info.

Is there a way to do that?

I have been trying with this command:

Control1MTL1_Filtered2<-filter(Control1MTL1_Filtered, V10= c(::<=0.1::)) but it does't recognise the format.

Any ideas would be more than appreciated.

Best, Stavroula

R vcf • 641 views

ADD COMMENT • link updated 5 months ago by Michael 55k • written 5 months ago by Stavroula • 0

0

Entering edit mode

Why do you want to use R for something that is better addressed by purpose-built utilities such as bcftools?

On second thought, it doesn't look like you have the VCF, just a tab delimited file with VCF columns. You're going to need to do some wrangling.

First off, V10 is not Info. INFO is a completely separate column, probably V8. Call V10 "sample" or something. Split V9 and V10 using : as the delimiter and then create a key-value pair with split V9 as the keys and split V10 as the values. It's going to take some serious dplyr/tidyr gymnastics to do this, so rpolicastro is probably the person that can help you there.

ADD REPLY • link 5 months ago by Ram 44k

0

Entering edit mode

Indeed! That looks like a genotype column.

ADD REPLY • link 5 months ago by Michael 55k

0

Entering edit mode

I agree with the others on using bcftools and defining a proper filter, especially if you want to export and use the vcf file later on.

ADD REPLY • link 5 months ago by Michael 55k

score 2 · Answer 1 · 2024-06-18

Use dedicated tool for the job - bcftools.

But if you must use R, then re-read the delimited column with a new separator, then subset as usual, see example:

#example data
d <- data.frame(
  V9 = c("GT:DP:HF:CILOW:CIUP:SDP"),
  V10 =  c("0/1:1:0.001:0.0:0.003:5", 
           "1/1:2:1:0.0:0.003:5", 
           "1/0:3:1:0.0:0.003:5", 
           "0/0:4:0.005:0.0:0.003:5"), 
  V11 = 1:4,
  V12 = 5:8)
#                        V9                     V10 V11 V12
# 1 GT:DP:HF:CILOW:CIUP:SDP 0/1:1:0.001:0.0:0.003:5   1   5
# 2 GT:DP:HF:CILOW:CIUP:SDP     1/1:2:1:0.0:0.003:5   2   6
# 3 GT:DP:HF:CILOW:CIUP:SDP     1/0:3:1:0.0:0.003:5   3   7
# 4 GT:DP:HF:CILOW:CIUP:SDP 0/0:4:0.005:0.0:0.003:5   4   8

#use read.table with new delimiter, and cbind it back to other columns.
x <- cbind(
  read.table(text = d$V10, sep = ":", 
             col.names = unlist(strsplit(c("GT:DP:HF:CILOW:CIUP:SDP"), ":"))),
  d[, c("V11", "V12")])

# then subset as usual
x[ x$HF > 0.1,  ]
#    GT DP HF CILOW  CIUP SDP V11 V12
# 2 1/1  2  1     0 0.003   5   2   6
# 3 1/0  3  1     0 0.003   5   3   7