Question

Extract columns from a vcf file using identifiers from a second file

0

Entering edit mode

7.0 years ago

paraskevopou ▴ 20

Dears people Maybe I am too naive but I am pretty new to bioinformatics. I have two files. One is a normal vcf file with column1 having the CHROMOSOME information and column2 the POSITION information. My second file is a txt file that contains the "CHROM" of interest. However in the second file there is only one entry per chromosome. Data are from denovo RNA assemblies, so instead of 1,2,3, chromosoms I have thousands of contigs in the #CHROM entry.
e.g. file1

  #CHROM               POS  ID  REF ALT QUAL    FILTER
TRINITY_DN4621_c0_g1    45  .   G   T   6641.77 PASS
TRINITY_DN4621_c0_g1    304 .   T   A   9057.77 PASS
TRINITY_DN12351_c0_g1   34  .   G   T   131.03  PASS
TRINITY_DN12351_c0_g1   328 .   T   C   1795.77 PASS
TRINITY_DN12351_c0_g1   774 .   C   T   1649.77 PASS
TRINITY_DN12351_c0_g1   942 .   G   A   2202.77 PASS
TRINITY_DN12351_c0_g1   1035    .   T   A   4024.77 PASS
TRINITY_DN12351_c0_g1   1224    .   A   T   7691.77 PASS
TRINITY_DN12351_c0_g1   1821    .   A   T   4930.77 PASS
TRINITY_DN12351_c0_g1   2133    .   T   A   4647.77 PASS
TRINITY_DN12351_c0_g1   2160    .   G   A   2677.77 PASS
TRINITY_DN12351_c0_g1   2241    .   A   G   2563.77 PASS
TRINITY_DN12351_c0_g1   2631    .   A   C   5120.77 PASS
TRINITY_DN11255_c4_g2   212 .   T   C   200.84  PASS
TRINITY_DN11255_c4_g2   491 .   G   A   3052.77 PASS
TRINITY_DN11255_c4_g2   581 .   C   T   3994.77 PASS
TRINITY_DN11255_c4_g2   639 .   A   G   3725.77 PASS
TRINITY_DN12185_c0_g1   713 .   A   T   4053.77 PASS
TRINITY_DN12185_c0_g1   733 .   T   A   3150.77 PASS
TRINITY_DN576_c0_g1 1389    .   T   A   160.8   PASS
TRINITY_DN7282_c0_g1    127 .   A   G   94.28   PASS
TRINITY_DN11386_c5_g2   109 .   A   G   79.28   PASS
TRINITY_DN11386_c5_g2   157 .   T   A   54.74   PASS
TRINITY_DN11386_c5_g1   660 .   G   A   18333.8 PASS
TRINITY_DN11386_c5_g1   1002    .   A   C   23923.8 PASS
TRINITY_DN11386_c5_g1   1341    .   C   A   18387.8 PASS
TRINITY_DN12464_c8_g1   417 .   G   A   8615.77 PASS

file2

TRINITY_DN4621_c0_g1    
TRINITY_DN12351_c0_g1   
TRINITY_DN11255_c4_g2   
TRINITY_DN12185_c0_g1   
TRINITY_DN576_c0_g1 
TRINITY_DN7282_c0_g1    
TRINITY_DN11386_c5_g2   
TRINITY_DN11386_c5_g1   
TRINITY_DN12464_c8_g1   
TRINITY_DN12481_c4_g1   
TRINITY_DN12018_c2_g1   
TRINITY_DN12013_c0_g1   
TRINITY_DN2189_c0_g1    
TRINITY_DN739_c0_g1 
TRINITY_DN11060_c0_g1   
TRINITY_DN11770_c2_g1

My question is how can I extract the information present in file2 but for all the positions in file1? I need both columns the #CHROM column and the #POS column

Thanks a lot for any help

snp • 5.7k views

ADD COMMENT • link updated 3.5 years ago by GenoMax 151k • written 7.0 years ago by paraskevopou ▴ 20

0

Entering edit mode

Title of this thread should be changed to "Extract columns from a vcf file using identifiers from a second file".

"Extract columns from a vcf file " is a different requirement and will lead to a different answer (e.g. Extract several fields from vcf file )

ADD REPLY • link 7.0 years ago by GenoMax 151k

0

Entering edit mode

Thanks a lot for the comment. I changed the title accordingly.

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

0

Entering edit mode

Thanks advance, I have problem in comparison of numbers in my files, because some numbers included in another number. therefore it is getting more matching lines. I am getting the results from "awk -v OFS="\t" 'FNR==NR {a[$1];next} $1 in a' file2.txt file1.txt |cut -f1,2" this code. So, I could not solve the problem tough so many efforts.

Thank your interests

ADD REPLY • link 3.5 years ago by karciharun42 • 0

0

Entering edit mode

for example;

file1 chr01 168062 168062 B C C T G

chr01 180433 180433 B C C A G

chr01 183888 183888 B C C T A

chr01 201158 201158 B G G C T

file2 vcf format chr01 21165 . C T . . . GT:CLCAD2:DP 1:0,16:23

chr01 180433 . T G . . . GT:CLCAD2:DP 1:0,17:23

chr01 201158 . G C . . . GT:CLCAD2:DP 0/1:14,13:27

chr01 212233 . A T . . . GT:CLCAD2:DP 0/1:17,16:33

expected output

chr01 180433 . T G . . . GT:CLCAD2:DP 1:0,17:23

chr01 201158 . G C . . . GT:CLCAD2:DP 0/1:14,13:27

thank you very much for your responce

ADD REPLY • link 3.5 years ago by karciharun42 • 0

0

Entering edit mode

Please do not add new questions as answers to existing posts. You should probably ask this as a new question.

ADD REPLY • link 3.5 years ago by GenoMax 151k

1

Entering edit mode

7.0 years ago

Candah ▴ 80

grep -f file2 file1 | awk '{print $1, '\t', $2}'

the first part will grep -f will use the pattens in the first file and extract those of the second file. Whereas awk '{print $1, '\t', $2}' will print the first column and second column tab seperated.

ADD COMMENT • link 7.0 years ago by Candah ▴ 80

0

Entering edit mode

Hi!! Thanks a lot for the comment. I tried the above command but it doesn't work. Number of rows between the two files are not equal.

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

0

Entering edit mode

strange it works for me on your example.

ADD REPLY • link 7.0 years ago by Candah ▴ 80

0

Entering edit mode

paraskevopou : Are you using grep on macOS?

ADD REPLY • link 7.0 years ago by GenoMax 151k

0

Entering edit mode

No I work on regular linux. I tried with the example and it works but it doesn't for the full dataset. Do you have any idea where might be the problem?

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

0

Entering edit mode

You will have to clarify in what way the command does not work. Did you try @cpad0112's version?

ADD REPLY • link 7.0 years ago by GenoMax 151k

0

Entering edit mode

I think the problem is that file1 is tab-delimited file

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

0

Entering edit mode

it should not matter. I guess file2 records might have white space at the end.

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

the command works but I get an empty file at the end. I tried to remove the white spaces with the following command but still my output is an empty file.

tr -d " \t" < file2.txt > file2_rs.txt

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

score 2 · Accepted Answer · 2018-05-08

2

Entering edit mode

7.0 years ago

cpad0112 21k

`$ grep -f file2 file1 | cut -f1,2`

(assuming that records are tab separated and entries in file2 have no white spaces at the end)

ADD COMMENT • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

you can also try this:

 awk -v OFS="\t" 'FNR==NR {a[$1];next} $1 in a' file2.txt file1.txt  |cut -f1,2

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

The command works but I get an empty file at the end. I tried to remove the white spaces with the following command but still my output is an empty file.

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

0

Entering edit mode

could you please upload file some where (file1) and share the link if the data is not private?

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

Here is the link https://drive.google.com/open?id=1nWRK531_Mf39REjq8TcC3aQR0X8AP0aQ Thanks a lot

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

1

Entering edit mode

run following commands:

$ sudo apt-get install dos2unix -y
$ dos2unix file1 file2
$ grep -f file2 file1

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

I was just thinking the same... encoding issue.

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

1

Entering edit mode

You can also just use the following (independent of dos2unix):

tr -d '\r' < file1.txt > file1.new.txt
tr -d '\r' < file2.txt > file2.new.txt
awk 'FNR==NR {id[$1]; next} {if ($1 in id) print}' file2.new.txt file1.new.txt

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks a lot. The problem was solved.

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

score 2 · Accepted Answer · 2018-05-08

2

Entering edit mode

7.0 years ago

Biojl ★ 1.7k

Assuming that the separating character is '\t' I think that join is faster for larger files. In order to use the join command the files have to be sorted by the common field (in this case in both files is the first field, but it could be different).

join -t $'\t' <(sort -k1,1 file1) <(sort -k1,1 file2)

From here you can extract whatever you want and redirect it to a file

join -t $'\t' <(sort -k1,1 file1) <(sort -k1,1 file2) | cut -f1,2 > newfile.txt

ADD COMMENT • link 7.0 years ago by Biojl ★ 1.7k

0

Entering edit mode

Thanks a lot for the reply. I cannot understand why my output file is empty. I tried to remove all the gaps from both files with the following command, but still my output is an empty file.

tr -d " \t" < file2.txt > file2_rs.txt

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

0

Entering edit mode

You should always provide the original files so people can run tests. Also this tr command you are using is removing all tabs preceded by a space... which is not useful at all.

If your separating character is one space just try join -t $' ' <(sort -k1,1 file1) <(sort -k1,1 file2)

ADD REPLY • link 7.0 years ago by Biojl ★ 1.7k

0

Entering edit mode

I am so sorry for this. Here is a link with the original files https://drive.google.com/open?id=1nWRK531_Mf39REjq8TcC3aQR0X8AP0aQ

ADD REPLY • link 7.0 years ago by paraskevopou ▴ 20

0

Entering edit mode

You are working with MAC right? You have a weird file structure, it has \r (mac) and \n (unix). To fix them you have to remove the \r:

[jl@gaia drive-download-20180509T103851Z-001]$ tr -d '\r' < file1.txt > unixfile1.txt 
[jl@gaia drive-download-20180509T103851Z-001]$ tr -d '\r' < file2.txt > unixfile2.txt 
[jl@gaia drive-download-20180509T103851Z-001]$ join -t $'\t' <(sort -k1,1 unixfile1.txt) <(sort -k1,1 unixfile2.txt) | head
TRINITY_DN10000_c0_g1   1000
TRINITY_DN10000_c0_g1   920
TRINITY_DN10000_c0_g1   931
TRINITY_DN10000_c0_g1   957
TRINITY_DN10001_c0_g1   1010
TRINITY_DN10001_c0_g1   1028
TRINITY_DN10001_c0_g1   107
TRINITY_DN10001_c0_g1   1139
TRINITY_DN10001_c0_g1   1154
TRINITY_DN10001_c0_g1   1175

ADD REPLY • link 7.0 years ago by Biojl ★ 1.7k

score 2 · Accepted Answer · 2018-05-09

cat file1
#CHROM               POS  ID  REF ALT QUAL    FILTER
TRINITY_DN4621_c0_g1    45  .   G   T   6641.77 PASS
TRINITY_DN4621_c0_g1    304 .   T   A   9057.77 PASS
TRINITY_DN12351_c0_g1   34  .   G   T   131.03  PASS
TRINITY_DN12351_c0_g1   328 .   T   C   1795.77 PASS
TRINITY_DN12351_c0_g1   774 .   C   T   1649.77 PASS
TRINITY_DN12351_c0_g1   942 .   G   A   2202.77 PASS
TRINITY_DN12351_c0_g1   1035    .   T   A   4024.77 PASS
TRINITY_DN12351_c0_g1   1224    .   A   T   7691.77 PASS
TRINITY_DN12351_c0_g1   1821    .   A   T   4930.77 PASS
TRINITY_DN12351_c0_g1   2133    .   T   A   4647.77 PASS
TRINITY_DN12351_c0_g1   2160    .   G   A   2677.77 PASS
TRINITY_DN12351_c0_g1   2241    .   A   G   2563.77 PASS
TRINITY_DN12351_c0_g1   2631    .   A   C   5120.77 PASS
TRINITY_DN11255_c4_g2   212 .   T   C   200.84  PASS
TRINITY_DN11255_c4_g2   491 .   G   A   3052.77 PASS
TRINITY_DN11255_c4_g2   581 .   C   T   3994.77 PASS
TRINITY_DN11255_c4_g2   639 .   A   G   3725.77 PASS
TRINITY_DN12185_c0_g1   713 .   A   T   4053.77 PASS
TRINITY_DN12185_c0_g1   733 .   T   A   3150.77 PASS
TRINITY_DN576_c0_g1 1389    .   T   A   160.8   PASS
TRINITY_DN7282_c0_g1    127 .   A   G   94.28   PASS
TRINITY_DN11386_c5_g2   109 .   A   G   79.28   PASS
TRINITY_DN11386_c5_g2   157 .   T   A   54.74   PASS
TRINITY_DN11386_c5_g1   660 .   G   A   18333.8 PASS
TRINITY_DN11386_c5_g1   1002    .   A   C   23923.8 PASS
TRINITY_DN11386_c5_g1   1341    .   C   A   18387.8 PASS
TRINITY_DN12464_c8_g1   417 .   G   A   8615.77 PASS

.

cat file2
TRINITY_DN4621_c0_g1
TRINITY_DN11255_c4_g2
TRINITY_DN12185_c0_g1
TRINITY_DN576_c0_g1

.

awk 'FNR==NR {id[$1]; next} /^#/ {print}; !/^#/ {if ($1 in id) {print}}' file2 file1
#CHROM               POS  ID  REF ALT QUAL    FILTER
TRINITY_DN4621_c0_g1    45  .   G   T   6641.77 PASS
TRINITY_DN4621_c0_g1    304 .   T   A   9057.77 PASS
TRINITY_DN11255_c4_g2   212 .   T   C   200.84  PASS
TRINITY_DN11255_c4_g2   491 .   G   A   3052.77 PASS
TRINITY_DN11255_c4_g2   581 .   C   T   3994.77 PASS
TRINITY_DN11255_c4_g2   639 .   A   G   3725.77 PASS
TRINITY_DN12185_c0_g1   713 .   A   T   4053.77 PASS
TRINITY_DN12185_c0_g1   733 .   T   A   3150.77 PASS
TRINITY_DN576_c0_g1 1389    .   T   A   160.8   PASS

.

awk 'FNR==NR {id[$1]; next} /^#/ {print $1"\t"$2}; !/^#/ {if ($1 in id) {print $1"\t"$2}}' file2 file1
#CHROM  POS
TRINITY_DN4621_c0_g1    45
TRINITY_DN4621_c0_g1    304
TRINITY_DN11255_c4_g2   212
TRINITY_DN11255_c4_g2   491
TRINITY_DN11255_c4_g2   581
TRINITY_DN11255_c4_g2   639
TRINITY_DN12185_c0_g1   713
TRINITY_DN12185_c0_g1   733
TRINITY_DN576_c0_g1 1389