Hi everyone,
I have problems with using awk, I don't get what I'm looking for, so I request your help. I have a big file which contains more than a billion lines. This file come from a sequecing and look like this
@K00114:439:HF27YBBXX:2:1101:28209:1209 1:N:0:NGAGGCTG_NTGTAGAT
NGATGGAAGAGCCCAACAGTGAATAACATCAGTAGAGGAGGTCCTGTCT
+
#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJ
@K00114:439:HF27YBBXX:2:1101:28229:1209 1:N:0:NGACTCCT_NTGTAGAT
NAACAAATCAGTGTTCTGTTGTTTGTCAAAATTTTGAACAAGCCTTGCG
+
#AAAAJJJAFJ7FJJFFJJJFJJJJJAJJJJJJJJJJJFJAJJF<A7<F
....
So every four lines I have a new read. I would like to read this file once and test every four line if one of the barcode from a list match with the barcode in the line 1,5,9 ... My list of barcode is in a different file, which in this example can be NGAGGCTG, AAAACCCC, AAAATTTT etc ... If it match, I would like to save the read in a new file. Here, the expected output would be this, because NGAGGCTG is present in my list and in the line starting with the '@'.
@K00114:439:HF27YBBXX:2:1101:28209:1209 1:N:0:NGAGGCTG_NTGTAGAT
NGATGGAAGAGCCCAACAGTGAATAACATCAGTAGAGGAGGTCCTGTCT
+
#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJ
I have to specify that my reads file is zipped, so I start by using gunzip -c read_filename or zcat. Note also, that the '1:N:0:NGAGGCTG_NTGTAGAT' is in the column 2 ($2). I tried many things, but don't know how to read the file just once and print only lines that match with my list of "pattern" and ignore read that doesn't match.
I tried something like this :
gunzip -c FCHF27YBBXX_L2_CHKPEI00001135_1.fq.gz | head | grep -A3 NGACTCCT | sed -n '/NGACTCCT/ {N;N;N;p;}'
But I don't succeed to make a loop on the pattern to change NGACTCCT by all the barcode from my list, I tried also with the awk structure awk '$2 ~ /pattern/ {for(i=1; i<=4; i++) {getline; print}}' but I also failed.
Thanks for your help !
Hi Pierre, thanks for the answer. In fact, I did it also with
But do you know how can I make a loop on this command to change the pattern "NGACTCCT" by others pattern store in a file, cause I have more than 50 barcodes to test ?
if your read SEQUENCE (not name) contains this DNA, you're going to mess your input.
use a loop like
Thanks again, This is exactly what I try to obtain. Thanks a lot.