Question

Extract all seqs bewteen two seqs ID in Fastq

0

Entering edit mode

4.4 years ago

erwan.scaon ▴ 950

Dear all,

I have a large fq.gz file.
I know this file is the result of the concatenation of mutiple fq.gz files (and has not been shuffled / modified since the concatenation).
Thing is, I don't have those previous "multiple fq.gz" files anymore. All I have is the name of the first and last seq IDs in those deleted fq.gz files (and the number of total lines in files).
I am trying to recreate those files (weird edge case, plz don't ask me why :-p).

This is where I am at, using the seqs IDs :

awk '/seq_id_start/{f=1} /seq_id_end/ {x=NR+3}(NR<=x) {f=0;print} f' <(zcat large_file.fq.gz);
# With actual seqs IDs (dummy exemple for the first 3 seqs in file (all between seq_1_id & seq_3_id))
awk '/@V300049672L4C001R0010000029\/1/{f=1} /@V300049672L4C001R0010000097\/1/ {x=NR+3}(NR<=x) {f=0;print} f' <(zcat large_file.fq.gz);

@V300049672L4C001R0010000029/1
GACNCCCAGGCAGGACAAAACCA...
+
FFF!@GECEGGEEFGGEGGEEFBF...
@V300049672L4C001R0010000070/1
TCCNTTGCGGTCATTTCTATAGCC...
+
FDF!6FFECCFFEFFFFFBEFFBFF4...
@V300049672L4C001R0010000097/1
ATGNAAGTCAAAAAAGTCATCGA...
+
FFF!EFEFFFFFFFFFFFFGEF*DEF...

This is working quite well, since it capture all sequences from "seq_id_start" to "seq_id_end", plus an additional 3 lines after seq_id_end (avoid breaking the last seq entry). But given that the input file is large, this cmd is hanging for long before finishing (which is sad, esp. for the dummy exemple targeting the 3 first seqs). I am trying to handle this using an exit statement, but I can't seem to use it well.

Below my attempt :

awk '/@V300049672L4C001R0010000029\/1/{f=1} /@V300049672L4C001R0010000097\/1/ {x=NR+3}(NR<=x) {f=0;print;exit} f' <(zcat large_file.fq.gz);

@V300049672L4C001R0010000029/1
GACNCCCAGGCAGGACAAAACCA...
+
FFF!@GECEGGEEFGGEGGEEFBF...
@V300049672L4C001R0010000070/1
TCCNTTGCGGTCATTTCTATAGCC...
+
FDF!6FFECCFFEFFFFFBEFFBFF4...
@V300049672L4C001R0010000097/1

This is not yielding the additional 3 lines after "seq_id_end" match anymore.

Any tips / pointers ?

fastq awk sed • 1.0k views

ADD COMMENT • link updated 4.4 years ago by Pierre Lindenbaum 164k • written 4.4 years ago by erwan.scaon ▴ 950

score 0 · Answer 1 · 2020-07-17

0

Entering edit mode

4.4 years ago

Pierre Lindenbaum 164k

Hello Erwan

How about this:

gunzip -c input.fq.gz | paste - - - - |\
awk -F '\t' '($1>="V300049672L4C001R0010000029/1" && $1<="@V300049672L4C001R0010000097/1") {print; if($1=="@V300049672L4C001R0010000097/1") exit;} ' |\
tr "\t" " \n"

ADD COMMENT • link 4.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Dear Pierre,

I did not manage to launch your cmd successfully after providing the actual filename, but the trick with paste and tr was good enough for me, thanks a lot

Currently using :

time unpigz -c large_file.fq.gz |\
paste -d "\t" - - - - |\
awk '/@V300049672L4C001R0010000029\/1/{f=1} /@V300049672L4C001R0010000097\/1/ {f=0;print;exit} f' |\
tr "\t" "\n";

@V300049672L4C001R0010000029/1
GACNCCCAGGCAGGACAAAACCA...
+
FFF!@GECEGGEEFGGEGGEEFBF...
@V300049672L4C001R0010000070/1
TCCNTTGCGGTCATTTCTATAGCC...
+
FDF!6FFECCFFEFFFFFBEFFBFF4...
@V300049672L4C001R0010000097/1
ATGNAAGTCAAAAAAGTCATCGA...
+
FFF!EFEFFFFFFFFFFFFGEF*DEF...

real 0m0.008s
user 0m0.008s
sys 0m0.014s

ADD REPLY • link 4.4 years ago by erwan.scaon ▴ 950

0

Entering edit mode

ah yes I see, didn't put the '@' in the $1>="V300049672L4C001R0010000029/1"...