Dear all,
I have a large fq.gz file.
I know this file is the result of the concatenation of mutiple fq.gz files (and has not been shuffled / modified since the concatenation).
Thing is, I don't have those previous "multiple fq.gz" files anymore. All I have is the name of the first and last seq IDs in those deleted fq.gz files (and the number of total lines in files).
I am trying to recreate those files (weird edge case, plz don't ask me why :-p).
This is where I am at, using the seqs IDs :
awk '/seq_id_start/{f=1} /seq_id_end/ {x=NR+3}(NR<=x) {f=0;print} f' <(zcat large_file.fq.gz);
# With actual seqs IDs (dummy exemple for the first 3 seqs in file (all between seq_1_id & seq_3_id))
awk '/@V300049672L4C001R0010000029\/1/{f=1} /@V300049672L4C001R0010000097\/1/ {x=NR+3}(NR<=x) {f=0;print} f' <(zcat large_file.fq.gz);
@V300049672L4C001R0010000029/1
GACNCCCAGGCAGGACAAAACCA...
+
FFF!@GECEGGEEFGGEGGEEFBF...
@V300049672L4C001R0010000070/1
TCCNTTGCGGTCATTTCTATAGCC...
+
FDF!6FFECCFFEFFFFFBEFFBFF4...
@V300049672L4C001R0010000097/1
ATGNAAGTCAAAAAAGTCATCGA...
+
FFF!EFEFFFFFFFFFFFFGEF*DEF...
This is working quite well, since it capture all sequences from "seq_id_start" to "seq_id_end", plus an additional 3 lines after seq_id_end (avoid breaking the last seq entry). But given that the input file is large, this cmd is hanging for long before finishing (which is sad, esp. for the dummy exemple targeting the 3 first seqs). I am trying to handle this using an exit statement, but I can't seem to use it well.
Below my attempt :
awk '/@V300049672L4C001R0010000029\/1/{f=1} /@V300049672L4C001R0010000097\/1/ {x=NR+3}(NR<=x) {f=0;print;exit} f' <(zcat large_file.fq.gz);
@V300049672L4C001R0010000029/1
GACNCCCAGGCAGGACAAAACCA...
+
FFF!@GECEGGEEFGGEGGEEFBF...
@V300049672L4C001R0010000070/1
TCCNTTGCGGTCATTTCTATAGCC...
+
FDF!6FFECCFFEFFFFFBEFFBFF4...
@V300049672L4C001R0010000097/1
This is not yielding the additional 3 lines after "seq_id_end" match anymore.
Any tips / pointers ?
Dear Pierre,
I did not manage to launch your cmd successfully after providing the actual filename, but the trick with paste and tr was good enough for me, thanks a lot
Currently using :
ah yes I see, didn't put the '@' in the
$1>="V300049672L4C001R0010000029/1"
...Oh I should have catched this, sorry