Dears people
Maybe I am too naive but I am pretty new to bioinformatics. I have two files. One is a normal vcf file with column1 having the CHROMOSOME information and column2 the POSITION information. My second file is a txt file that contains the "CHROM" of interest. However in the second file there is only one entry per chromosome. Data are from denovo RNA assemblies, so instead of 1,2,3, chromosoms I have thousands of contigs in the #CHROM entry.
e.g.
file1
#CHROM POS ID REF ALT QUAL FILTER
TRINITY_DN4621_c0_g1 45 . G T 6641.77 PASS
TRINITY_DN4621_c0_g1 304 . T A 9057.77 PASS
TRINITY_DN12351_c0_g1 34 . G T 131.03 PASS
TRINITY_DN12351_c0_g1 328 . T C 1795.77 PASS
TRINITY_DN12351_c0_g1 774 . C T 1649.77 PASS
TRINITY_DN12351_c0_g1 942 . G A 2202.77 PASS
TRINITY_DN12351_c0_g1 1035 . T A 4024.77 PASS
TRINITY_DN12351_c0_g1 1224 . A T 7691.77 PASS
TRINITY_DN12351_c0_g1 1821 . A T 4930.77 PASS
TRINITY_DN12351_c0_g1 2133 . T A 4647.77 PASS
TRINITY_DN12351_c0_g1 2160 . G A 2677.77 PASS
TRINITY_DN12351_c0_g1 2241 . A G 2563.77 PASS
TRINITY_DN12351_c0_g1 2631 . A C 5120.77 PASS
TRINITY_DN11255_c4_g2 212 . T C 200.84 PASS
TRINITY_DN11255_c4_g2 491 . G A 3052.77 PASS
TRINITY_DN11255_c4_g2 581 . C T 3994.77 PASS
TRINITY_DN11255_c4_g2 639 . A G 3725.77 PASS
TRINITY_DN12185_c0_g1 713 . A T 4053.77 PASS
TRINITY_DN12185_c0_g1 733 . T A 3150.77 PASS
TRINITY_DN576_c0_g1 1389 . T A 160.8 PASS
TRINITY_DN7282_c0_g1 127 . A G 94.28 PASS
TRINITY_DN11386_c5_g2 109 . A G 79.28 PASS
TRINITY_DN11386_c5_g2 157 . T A 54.74 PASS
TRINITY_DN11386_c5_g1 660 . G A 18333.8 PASS
TRINITY_DN11386_c5_g1 1002 . A C 23923.8 PASS
TRINITY_DN11386_c5_g1 1341 . C A 18387.8 PASS
TRINITY_DN12464_c8_g1 417 . G A 8615.77 PASS
file2
TRINITY_DN4621_c0_g1
TRINITY_DN12351_c0_g1
TRINITY_DN11255_c4_g2
TRINITY_DN12185_c0_g1
TRINITY_DN576_c0_g1
TRINITY_DN7282_c0_g1
TRINITY_DN11386_c5_g2
TRINITY_DN11386_c5_g1
TRINITY_DN12464_c8_g1
TRINITY_DN12481_c4_g1
TRINITY_DN12018_c2_g1
TRINITY_DN12013_c0_g1
TRINITY_DN2189_c0_g1
TRINITY_DN739_c0_g1
TRINITY_DN11060_c0_g1
TRINITY_DN11770_c2_g1
My question is how can I extract the information present in file2 but for all the positions in file1? I need both columns the #CHROM column and the #POS column
Thanks a lot for any help
Title of this thread should be changed to "Extract columns from a vcf file using identifiers from a second file".
"Extract columns from a vcf file " is a different requirement and will lead to a different answer (e.g. Extract several fields from vcf file )
Thanks a lot for the comment. I changed the title accordingly.
Thanks advance, I have problem in comparison of numbers in my files, because some numbers included in another number. therefore it is getting more matching lines. I am getting the results from "awk -v OFS="\t" 'FNR==NR {a[$1];next} $1 in a' file2.txt file1.txt |cut -f1,2" this code. So, I could not solve the problem tough so many efforts.
Thank your interests
for example;
file1 chr01 168062 168062 B C C T G
chr01 180433 180433 B C C A G
chr01 183888 183888 B C C T A
chr01 201158 201158 B G G C T
file2 vcf format chr01 21165 . C T . . . GT:CLCAD2:DP 1:0,16:23
chr01 180433 . T G . . . GT:CLCAD2:DP 1:0,17:23
chr01 201158 . G C . . . GT:CLCAD2:DP 0/1:14,13:27
chr01 212233 . A T . . . GT:CLCAD2:DP 0/1:17,16:33
expected output
chr01 180433 . T G . . . GT:CLCAD2:DP 1:0,17:23
chr01 201158 . G C . . . GT:CLCAD2:DP 0/1:14,13:27
thank you very much for your responce
Please do not add new questions as answers to existing posts. You should probably ask this as a new question.