Converting VCF file to table format
0
0
Entering edit mode
2.4 years ago

Hello,

I am trying to perform a selection scan analysis following the scripts from https://github.com/ngarud/SelectionHapStats

According to the guidelines, I have converted my VCF files for each chromosome into a text file separately. I used the following R-codes for file conversion:

library(pegas)
library(ape)
vcf <- read.vcf("file.vcf", from=1, to=3000000)
vcf.t <- t(as.matrix(vcf))
# transform homozygous sites
vcf.t <- sub("A/A", "A", vcf.t, perl=TRUE)
vcf.t <- sub("T/T", "T", vcf.t, perl=TRUE)
vcf.t <- sub("C/C", "C", vcf.t, perl=TRUE)
vcf.t <- sub("G/G", "G", vcf.t, perl=TRUE)
# transform missing sites
vcf.t <- sub("\\./\\.", "N", vcf.t, perl=TRUE)
# transform heterozygous sites
vcf.t <- sub("[A/T/G/C]/[A/T/G/C]", ".", vcf.t, perl=TRUE)
# get positions
pos <- system(paste("grep -v '#'", "file.vcf", "| awk '{print$2}'", sep=" "),
              intern=T)
rownames(vcf.t) <- pos
write.table(vcf.t, "chr12", quote=F, sep=",", col.names=F)

After this conversion, I ran the script for the selection analysis and it threw me an error. Later, I checked the converted file and it looks like this:

8304,C|T,N,C,N,T|T,T|T,C,C|T,C,C,C|T,C,C,C,C,C,C,N,C,N,N,C,C,C,C,N,C,N,C,N,C,C,N,N,N,C,C,N,C,C,N,C,C,N,C,N,N,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,N,C,C,C,C,C,C,C,C,C,C,C,C,C,N,C,N,N,C,N,N,N,N,N,C,N,C,C,C,C,C,C,C,N,C,C
8325,C,N,C,C,C,C,C,C,T|T,N,C,N,N,N,N,C,C,N,C,N,N,C,C,C,C,N,C,C,C,C,C,.,C,N,N,C,C|T,N,N,N,N,C,C,N,C,N,N,N,N,T,N,N,C,N,C,C,C,N,C,C,C,C,C,N,C,C,C,.,.,.,C,C,C,N,C,C,C,N,C,C,N,C,N,N,N,N,N,C,N,C,C,N,C,C,C,C,N,C,C

The problem here is the | symbol for the heterozygous SNPs. I have transformed those in my R-script but still, some persist. The selection scan script does not recognize this | symbol and throws an error.

Can you please suggest to me how to correct these | symbols and obtain a clean output? My VCF file is unphased.

Thank you.

SNP bash VCF R genome • 1.7k views
ADD COMMENT
1
Entering edit mode

Switch to gsub from sub where appropriate. Also, what is this regex supposed to do - it seems unnecessarily convoluted: "[A/T/G/C]/[A/T/G/C]"

Following up, I really think you should be doing this using bcftools and other shell utilities, not R. Even with R, your code is quite inefficient. If you're using regex, do it efficiently: sub("([ATGC])[/]\\1)", "\\1", vcf.t, perl=TRUE) or sub("A/A", "A", vcf.t, fixed=TRUE); sub("G/G", "G", vcf.t, fixed=TRUE);.... Don't use perl regex unless you need it, especially when you're replacing fixed strings.

On second thought, stop using R and use grep -E/grep -F on the shell.

ADD REPLY
0
Entering edit mode

Ok, thank you very much for your comments and suggestions. I will do it accordingly and see if it works.

ADD REPLY
0
Entering edit mode

Hi Ram, I ran this code as you mentioned. But when I use the sub("([ATGC])[/]\\1)", "\\1", vcf.t, perl=TRUE) I get an error mentioning incorrect use of Regular Expression. Moreover, the same problem still persists where I have quite a few | in the output file. Although I have found another solution using a bash script to convert the data file, it would be nice to get a solution for this code.

ADD REPLY
1
Entering edit mode

Oops, I made a mistake with the regex (easy enough to correct if the error message was read properly - there's an extra ) at the end of the regex. See corrected and tested expression below:

> gsub("([ATGC])[/]\\1)", "\\1", "A/A", perl=TRUE)
Error in gsub("([ATGC])[/]\\1)", "\\1", "A/A", perl = TRUE) : 
  invalid regular expression '([ATGC])[/]\1)'
In addition: Warning message:
In gsub("([ATGC])[/]\\1)", "\\1", "A/A", perl = TRUE) :
  PCRE pattern compilation error
    'unmatched closing parenthesis'
    at ')'

> sub("([ATGC])[/]\\1", "\\1", "A/A", perl=TRUE)
[1] "A"
> sub("([ATGC])[/]\\1", "\\1", c("A/A", "G/G", "A/G"), perl=TRUE)
[1] "A"   "G"   "A/G"
ADD REPLY
0
Entering edit mode

Oh, I am embarrassed right now. Sorry for overlooking this tiny issue. However, now the original problem escalated. Previously, I just had the pipe | symbol in my output file. But after using vcf.t <- gsub("([ATGC])[/]\\1", "\\1", vcf.t, perl=TRUE) I have both the pipe | and / in the output file.

ADD REPLY
0
Entering edit mode

I think there's a mix up between your requirements and your expectations of the code. The regex we worked on will only reduce the 4 lines in your "transform homozygous sites" part of the code to a single line. It does not address missing GTs or heterozygous sites, let alone multi-allelic sites.

The | character is quite odd. Are you sure your VCF is unphased? Can you check if there are any phased GT calls and also if you have any multi-allelic sites?

Also, don't use gsub in the regex above, use sub. You know you only want to replace the first [ATGC]/[ATGC] with [ATGC], using gsub might result in bugs where the data is unpredictable (multi-allelic sites, for example).

ADD REPLY
0
Entering edit mode

Hi Ram, finally it worked. Indeed, there were some phased genotypes in my VCF. So, I converted all the | into /. Then the code worked fine. Thanks a lot for your solutions.

ADD REPLY

Login before adding your answer.

Traffic: 1655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6