Dear all,
Assuming that I have 2 large files, file 1 looks like this:
TRINITY_DN100000_c1_g1
TRINITY_DN100000_c1_g3
TRINITY_DN100000_c1_g6
TRINITY_DN100000_c2_g1
TRINITY_DN100001_c6_g1
TRINITY_DN100001_c6_g2
TRINITY_DN100006_c1_g1
TRINITY_DN100006_c6_g3
TRINITY_DN100006_c6_g4
TRINITY_DN100006_c6_g5
TRINITY_DN100008_c3_g2
TRINITY_DN100010_c4_g4
TRINITY_DN100011_c5_g3
TRINITY_DN100011_c6_g1
TRINITY_DN100012_c2_g1
TRINITY_DN100013_c0_g1
TRINITY_DN100014_c3_g1
TRINITY_DN100014_c7_g1
TRINITY_DN100014_c8_g1
And file 2 looks like this, the first column of file 2 is similar with file 1, but not totally same:
TRINITY_DN100000_c1_g1_i3 ref|XP_017307867.1| PREDICTED: A disintegrin and metalloproteinase with thrombospondin motifs 3 [Ictalurus punctatus] 108255965 XP_017307867 87.591 137 14 5628 5218 27 160 6.59E-67 250
TRINITY_DN100000_c1_g3_i1 ref|XP_017307867.1| PREDICTED: A disintegrin and metalloproteinase with thrombospondin motifs 3 [Ictalurus punctatus] 108255965 XP_017307867 90.385 52 5 1678 1523 158 209 3.76E-21 101
TRINITY_DN100000_c1_g6_i2 ref|XP_017343211.1| PREDICTED: G0/G1 switch protein 2 [Ictalurus punctatus] 108276227 XP_017343211 88.288 111 13 714 382 1 111 2.34E-65 204
TRINITY_DN100000_c2_g1_i1 ref|XP_017307867.1| PREDICTED: A disintegrin and metalloproteinase with thrombospondin motifs 3 [Ictalurus punctatus] 108255965 XP_017307867 100 86 0 2 259 250 335 6.23E-56 186
TRINITY_DN100001_c6_g1_i4 ref|XP_017314334.1| PREDICTED: complement C3-like isoform X1 [Ictalurus punctatus] 108259386 XP_017314334 71.053 76 22 1037 810 44 119 1.3E-25 112
TRINITY_DN100001_c6_g2_i3 ref|XP_017314335.1| PREDICTED: complement C3-like isoform X2 [Ictalurus punctatus] 108259386 XP_017314335 90.909 55 5 218 54 173 227 7.23E-25 105
TRINITY_DN100003_c7_g3_i1 ref|XP_017346976.1| PREDICTED: uncharacterized protein LOC108278261 isoform X2 [Ictalurus punctatus] 108278261 XP_017346976 40.678 59 32 224 48 496 551 8.74E-08 48.1
TRINITY_DN100004_c2_g1_i4 ref|XP_017348024.1| PREDICTED: transcription factor 7 isoform X11 [Ictalurus punctatus] 108278866 XP_017348024 95.238 63 3 177 365 1 63 8.65E-31 125
TRINITY_DN100004_c2_g4_i2 ref|XP_017323173.1| PREDICTED: transcription factor 7-like 1-B [Ictalurus punctatus] 108265410 XP_017323173 92.265 181 14 412 954 402 582 2.37E-46 168
TRINITY_DN100004_c2_g6_i1 ref|XP_017323173.1| PREDICTED: transcription factor 7-like 1-B [Ictalurus punctatus] 108265410 XP_017323173 100 38 0 7 120 364 401 2.04E-19 84
TRINITY_DN100004_c5_g1_i1 ref|XP_017326725.1| PREDICTED: uncharacterized protein LOC108267243 [Ictalurus punctatus] 108267243 XP_017326725 53.086 81 30 37 255 447 527 1.79E-18 81.6
TRINITY_DN100006_c1_g1_i1 ref|XP_017331360.1| PREDICTED: zinc finger MYM-type protein 4-like isoform X3 [Ictalurus punctatus] 108269816 XP_017331360 51.111 45 22 354 220 28 72 0.00000501 45.1
TRINITY_DN100006_c6_g3_i4 ref|XP_017332404.1| PREDICTED: bifunctional epoxide hydrolase 2 [Ictalurus punctatus] 108270338 XP_017332404 90.714 140 13 3270 2851 327 466 1.33E-77 269
TRINITY_DN100006_c6_g4_i3 ref|XP_017332404.1| PREDICTED: bifunctional epoxide hydrolase 2 [Ictalurus punctatus] 108270338 XP_017332404 79.07 43 9 2044 1916 284 326 1.08E-13 75.5
TRINITY_DN100006_c6_g5_i1 ref|XP_017346197.1| PREDICTED: sortilin-related receptor isoform X2 [Ictalurus punctatus] 108277790 XP_017346197 100 25 0 78 4 499 523 3.53E-09 55.1
TRINITY_DN100008_c3_g2_i8 ref|XP_017331889.1| PREDICTED: prolactin regulatory element-binding protein isoform X1 [Ictalurus punctatus] 108270080 XP_017331889 87.5 264 29 539 1318 19 282 1.03E-134 429
TRINITY_DN100010_c4_g4_i2 ref|XP_017339405.1| PREDICTED: voltage-dependent T-type calcium channel subunit alpha-1G isoform X7 [Ictalurus punctatus] 108274069 XP_017339405 91.892 37 3 1360 1250 1975 2011 4.63E-12 70.5
TRINITY_DN100011_c5_g3_i1 ref|XP_017321691.1| PREDICTED: lactase-like protein [Ictalurus punctatus] 108264568 XP_017321691 96.552 29 1 87 1 527 555 1.63E-11 62
TRINITY_DN100011_c6_g1_i1 ref|XP_017310205.1| PREDICTED: engulfment and cell motility protein 1 isoform X1 [Ictalurus punctatus] 108257187 XP_017310205 100 22 0 67 2 362 383 0.00000161 50.8
TRINITY_DN100012_c2_g1_i2 ref|XP_017351172.1| PREDICTED: disks large homolog 1 isoform X16 [Ictalurus punctatus] 108280564 XP_017351172 96.078 153 6 642 184 139 291 4.57E-87 298
TRINITY_DN100013_c0_g1_i3 ref|XP_017329849.1| PREDICTED: sorbin and SH3 domain-containing protein 2-like isoform X17 [Ictalurus punctatus] 108268954 XP_017329849 76.687 163 34 2448 1972 104 266 1.61E-57 218
TRINITY_DN100014_c3_g1_i1 sp|O00370|LORF2_HUMAN LINE-1 retrotransposable element ORF2 protein OS=Homo sapiens OX=9606 PE=1 SV=1 O00370 0 43.59 78 42 455 228 768 845 6.15E-13 68.6
TRINITY_DN100014_c7_g1_i4 ref|XP_017349364.1| PREDICTED: uncharacterized protein LOC108279543 [Ictalurus punctatus] 108279543 XP_017349364 40.984 61 36 1204 1386 103 163 4.35E-14 60.5
TRINITY_DN100014_c8_g1_i1 ref|XP_017326732.1| PREDICTED: uncharacterized protein LOC108267251 [Ictalurus punctatus] 108267251 XP_017326732 73.438 64 17 478 669 281 344 1.24E-18 85.1
TRINITY_DN100014_c8_g2_i1 ref|XP_017317911.1| PREDICTED: uncharacterized protein LOC108261586 [Ictalurus punctatus] 108261586 XP_017317911 62.281 114 41 1324 1659 475 588 1.1E-57 103
Now, I want to extracted the line in file 2 according the first column that matched with file 1, I have two scripts:
grep -Fwf file1.txt file2.txt > results
awk 'NR==FNR{x[$0];next}{for(i in x)if($0~i)print}' file1.txt file2.txt
But the results only contain one line:
TRINITY_DN100014_c8_g1_i1 ref|XP_017326732.1| PREDICTED: uncharacterized protein LOC108267251 [Ictalurus punctatus] 108267251 XP_017326732 73.438 64 17 478 669281 344 1.24E-18 85.1
It should not be correct, because thiere is 19 line in file 2 that can match with file1.
So if you know how should I correct the script..?
Thank you!
Your
awk
command works perfectly on my system (default Gawk). Yourgrep
command returns nothing for me. I'm running Ubuntu 16.04Be sure that your files are tab-delimited. Also, if you copied them from a Windows system, then they may have extraneous characters that can be removed with
dos2unix
Hi, thanks for your kindly reply! I am using MACbook, and the txt file is copied from xlsx file, if I will do anything to make it tab-delimited?
Perhaps the tab-delimited part is not so key due to this section of your code
if($0~i)
, which ultimately searches for the pattern from file1 in the entire line ofrom file2How did you copy the data from Excel to TXT?
Hi, I just used "command+c" and "command+v" to copy the data...It worked!! at first I copied directly from xlsx, now I paste it first to byword and then paste to txt file, then it worked!!
Thank you!!
Okay, it is probably related to what Ram said, and something to do with the endline characters.
Just remove the -w option in the grep command. It works for me on my macbookpro!
Hi, this answer has already been given. I'm moving this post to a comment.