vcf files - extract SNPs
1
0
Entering edit mode
6.7 years ago
paraskevopou ▴ 20

Hi there all! I have 2 vcf files with filtered SNPs and now I want to extract a. The genes that are shared between the 2 vcf files b. The SNPs that are different on the shared genes between these two vcf files. for example 1.vcf

TRINITY_DN17985_c2_g1    539            0      G     A
TRINITY_DN17985_c2_g1   576         0   G   A 
TRINITY_DN17985_c2_g1   710         0   A   T
TRINITY_DN17985_c2_g1   722         0   T   C
TRINITY_DN17985_c2_g1   822         0   A   G
TRINITY_DN17985_c2_g1   1608    0   T   G
TRINITY_DN17985_c2_g1   2353    0   T   A
TRINITY_DN17985_c2_g1   4371    0   T   A
TRINITY_DN17985_c2_g1   6200    0   C   T
TRINITY_DN17985_c2_g1   7444    0   G   T
TRINITY_DN17985_c2_g1   7456    0   A   T
TRINITY_DN17985_c2_g1   2434    0   G   T
TRINITY_DN17985_c2_g1   4190    0   G   A
TRINITY_DN17985_c2_g1   4275    0   A   G
TRINITY_DN15509_c1_g1   626         0   A   C
TRINITY_DN15509_c1_g1   1647    0   G   A
TRINITY_DN15425_c1_g4   741         0   G   A
TRINITY_DN12988_c0_g1   1553    0   T   A
TRINITY_DN14521_c2_g1   826         0   T   A

2.vcf

TRINITY_DN17985_c2_g1   539         0      G     T
TRINITY_DN17985_c2_g1   576         0   G   T 
TRINITY_DN17985_c2_g1   710         0   A   G
TRINITY_DN17985_c2_g1   722         0   T   C
TRINITY_DN17985_c2_g1   822         0   A   G
TRINITY_DN17985_c2_g1   1608    0   T   G
TRINITY_DN17985_c2_g1   2353    0   T   A
TRINITY_DN17985_c2_g1   4371    0   T   A
TRINITY_DN17985_c2_g1   6200    0   C   T
TRINITY_DN17985_c2_g1   7444    0   G   T
TRINITY_DN17985_c2_g1   7456    0   A   T
TRINITY_DN16743_c0_g1   5739    0   G   A
TRINITY_DN16743_c0_g1   5764    0   T   A
TRINITY_DN16743_c0_g2   199         0   A   T
TRINITY_DN16295_c0_g1   982         0   A   G

and I want to extract only the following for which the ALT allele is different between 1.vcf and 2.vcf files.

TRINITY_DN17985_c2_g1   539         0      G     A
TRINITY_DN17985_c2_g1   576         0   G   A 
TRINITY_DN17985_c2_g1   710         0   A   T

TRINITY_DN17985_c2_g1   539         0      G     T
TRINITY_DN17985_c2_g1   576         0   G   T 
TRINITY_DN17985_c2_g1   710         0   A   G

I know that I can use bcftools isec for getting the SNPs that are shared between the two or the SNPs that are unique in the first and second file respectively. But how I can get first the genes that are present in both files?

Thanks a lot in advance for any help! Sofia

RNA-Seq SNP • 1.4k views
ADD COMMENT
0
Entering edit mode

I suppose this is not the format that the post should have. The VCF format doesn't look like a VCF format. Can you please wrap it around the code flag? (little button with 101010)

ADD REPLY
0
Entering edit mode

I have just done that (wrapped it in code format)

ADD REPLY
3
Entering edit mode
6.7 years ago

If the first column is the "genes", then:

bedtools intersect -wa -wb -a 1.vcf -b 2.vcf > shared.vcf
grep -v '^#' shared.vcf | cut -f1 -d' ' shared.vcf | sort | uniq  # list of unique genes

Then, assuming it's the same reference (ie ref is the same on the same position) and that both vcf files actually have 10 columns:

grep -v '^#' shared.vcf | awk '$5!=$15' > diff.vcf   # differing positions, no header

Of course, depending on what you want to do with that file afterwards, you may want to restore the header and remove columns 11-20 (or add them as new rows etc).

ADD COMMENT

Login before adding your answer.

Traffic: 2388 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6