Grep Position List
3
0
Entering edit mode
7.5 years ago
BAGeno ▴ 190

Hi,

I have VCF file and list of position file. I want to grep those records from VCF which have position corresponding to position present in position file. For this purpose I used this command.

grep -wfF Position.txt myVCF > insec.vcf

here is the list of position.

`POS
10248
10321
10327
10492
10583
12783
13116
13118
13273
13302
14354
14542
14907
14930
15029
15118
15190
15208
15211
15274
`

Here is the result

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LP6005441-DNA_G11
chr1    10492   rs55998931      C       T       182.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=2.356;ClippingRankSum=0;DB;DP=4
chr1    10583   rs58108140      G       A       170.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-3.006;ClippingRankSum=0;DB;DP=
chr1    12783   rs62635284      G       A       522.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=4.292;ClippingRankSum=0;DB;DP=2
chr1    13116   rs62635286      T       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.301;ClippingRankSum=0;DB;DP=
chr1    13118   rs62028691      A       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.009;ClippingRankSum=0;DB;DP=
chr1    13273   rs531730856     G       C       675.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-2.577;ClippingRankSum=0;DB;DP=
chr1    13302   rs180734498     C       T       135.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=1.88;ClippingRankSum=0;DB;DP=41
chr1    13417   rs777038595     C       CGAGA   606.73  PASS   AC=1;AF=0.5;AN=2;BaseQRankSum=-2.053;ClippingRankSum=0;DB;DP=
chr1    13896   rs201696125     C       A       195.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.19;ClippingRankSum=0;DB;DP=5
chr1    14699   rs62635298      C       G       160.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=1.278;ClippingRankSum=0;DB;DP=3
chr1    14907   rs6682375       A       G       922.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.114;ClippingRankSum=0;DB;DP=

But it is giving some extra records as well like record containing 13415,13896. These positions are not present in list of positions. What shall I do to overcome this problem?

grep VCF • 5.7k views
ADD COMMENT
0
Entering edit mode

Hi,

Use same command (without -w) by including <tab>YOUR_POSITION<tab> in 'Position.txt' file. This might work.

Example to add tabs:

sed -i 's/^/\t/' Position.txt

sed -i 's/$/\t/' Position.txt
ADD REPLY
0
Entering edit mode

It worked :) Thank you

ADD REPLY
0
Entering edit mode
7.5 years ago

Grep is not all that well suited for filtering VCF files, or in general, column-oriented files where you'd want to apply the filter on just one column. Perhaps you could extend the pattern to include the leading and trailing tabs.

See bcftools for a whole range of options for filtering that initially may look a little more complicated yet will serve you well as soon as you want to apply more diverse filters:

https://samtools.github.io/bcftools/bcftools.html

ADD COMMENT
0
Entering edit mode

I had also used bcftools isec but it gave empty intersection files. I want to compare two VCF. These files my have same position but different rsIDs on that position.

ADD REPLY
0
Entering edit mode

Have you tried vcf-compare from vcftools?

ADD REPLY
0
Entering edit mode

On the theme of filtering column-oriented files, if you are prepared to code stuff up yourself, consider awk within the shell, or modules like csv within python etc.

ADD REPLY
0
Entering edit mode
7.5 years ago
dschika ▴ 320

What about vcftools:

--positions

Include or exclude a set of sites on the basis of a list of positions in a file. Each line of the input file should contain a (tab-separated) chromosome and position. The file can have comment lines that start with a "#", they will be ignored.

ADD COMMENT
0
Entering edit mode

I tried this command but it did not work.

 vcftools --vcf myNew.vcf --positions Chr.pos.tsv --out Filtered_File
ADD REPLY
0
Entering edit mode

Try adding:

--recode
ADD REPLY
0
Entering edit mode

This is giving VCF file with just header.

ADD REPLY
0
Entering edit mode

Hm...could you please double check:

How does Chr.pos.tsv look like? Have you included chromosomes? Are chromosome and position tab-separated?

How does myNew.vcf look like? Is there more than just the header-section? Are there chromosome-position pairs matching your Chr.pos.tsv file?

ADD REPLY
0
Entering edit mode

This is what myNew.vcf look like.

##bcftools_filterCommand=filter -i DP>=20 myNew.raw30.vcf
##bcftools_filterCommand=filter -i GQ>=30 New.DP20.vcf
##liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/)
##liftOverTime=April01,2017
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LP6005441-DNA_G11
chr1    10492   rs55998931      C       T       182.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=2.356;ClippingRankSum=0;DB;DP=42;ExcessHet=3.0103;FS=0;MLEAC=1;
chr1    10583   rs58108140      G       A       170.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-3.006;ClippingRankSum=0;DB;DP=37;ExcessHet=3.0103;FS=0;MLEAC=1
chr1    12783   rs62635284      G       A       522.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=4.292;ClippingRankSum=0;DB;DP=29;ExcessHet=3.0103;FS=0;MLEAC=1;
chr1    13116   rs62635286      T       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.301;ClippingRankSum=0;DB;DP=28;ExcessHet=3.0103;FS=8.671;MLE
chr1    13118   rs62028691      A       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.009;ClippingRankSum=0;DB;DP=27;ExcessHet=3.0103;FS=5.521;MLE
chr1    13273   rs531730856     G       C       675.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-2.577;ClippingRankSum=0;DB;DP=46;ExcessHet=3.0103;FS=6.947;MLE

This is what Chr.pos.tsv look like.

##reference=file:///groups/reich/reference-genomes/hs37d5/unzipped/hs37d5.fa
##bcftools_viewVersion=1.2+htslib-1.2.1
##bcftools_viewCommand=view -h /n/data1/hms/genetics/reich/1000Genomes/cteam_remap/A-samples/LP6005441-DNA_G11/annotvcf/LP6005441-DNA_G11.annotated.vcf.bgz
##liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/)
##liftOverFile=/data1/Sarah/database/hg19ToHg38.over.chain.gz
##new_reference_genome=/data1/Sarah/database/new_hg38.fa
##liftOverTime=May22,2017
#CHROM  POS
1       10248
1       10321
1       10327
1       10492
1       10583
1       12783
1       13116
1       13118
1       13273
ADD REPLY
0
Entering edit mode

Well, "chr1" is not equal "1" ...

ADD REPLY
0
Entering edit mode
7.5 years ago

Why not use R with the merge command?

data1 <- read.table("VCF File", header=F, sep="\t")

data2 <- read.table("Position File", header=F, sep="\t") 
merge.results <- merge(data2, data1, by.x="V1", by.y="V2", all=F)

write.table(merge.results, "VCF Subset", col.names=F, row.names=F, quote=F, sep="\t")
ADD COMMENT
0
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY

Login before adding your answer.

Traffic: 2590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6