Extract all seqs bewteen two seqs ID in Fastq
1
0
Entering edit mode
4.4 years ago
erwan.scaon ▴ 950

Dear all,

I have a large fq.gz file.
I know this file is the result of the concatenation of mutiple fq.gz files (and has not been shuffled / modified since the concatenation).
Thing is, I don't have those previous "multiple fq.gz" files anymore. All I have is the name of the first and last seq IDs in those deleted fq.gz files (and the number of total lines in files).
I am trying to recreate those files (weird edge case, plz don't ask me why :-p).

This is where I am at, using the seqs IDs :

awk '/seq_id_start/{f=1} /seq_id_end/ {x=NR+3}(NR<=x) {f=0;print} f' <(zcat large_file.fq.gz);
# With actual seqs IDs (dummy exemple for the first 3 seqs in file (all between seq_1_id & seq_3_id))
awk '/@V300049672L4C001R0010000029\/1/{f=1} /@V300049672L4C001R0010000097\/1/ {x=NR+3}(NR<=x) {f=0;print} f' <(zcat large_file.fq.gz);

@V300049672L4C001R0010000029/1
GACNCCCAGGCAGGACAAAACCA...
+
FFF!@GECEGGEEFGGEGGEEFBF...
@V300049672L4C001R0010000070/1
TCCNTTGCGGTCATTTCTATAGCC...
+
FDF!6FFECCFFEFFFFFBEFFBFF4...
@V300049672L4C001R0010000097/1
ATGNAAGTCAAAAAAGTCATCGA...
+
FFF!EFEFFFFFFFFFFFFGEF*DEF...

This is working quite well, since it capture all sequences from "seq_id_start" to "seq_id_end", plus an additional 3 lines after seq_id_end (avoid breaking the last seq entry). But given that the input file is large, this cmd is hanging for long before finishing (which is sad, esp. for the dummy exemple targeting the 3 first seqs). I am trying to handle this using an exit statement, but I can't seem to use it well.

Below my attempt :

awk '/@V300049672L4C001R0010000029\/1/{f=1} /@V300049672L4C001R0010000097\/1/ {x=NR+3}(NR<=x) {f=0;print;exit} f' <(zcat large_file.fq.gz);

@V300049672L4C001R0010000029/1
GACNCCCAGGCAGGACAAAACCA...
+
FFF!@GECEGGEEFGGEGGEEFBF...
@V300049672L4C001R0010000070/1
TCCNTTGCGGTCATTTCTATAGCC...
+
FDF!6FFECCFFEFFFFFBEFFBFF4...
@V300049672L4C001R0010000097/1

This is not yielding the additional 3 lines after "seq_id_end" match anymore.

Any tips / pointers ?

fastq awk sed • 1.0k views
ADD COMMENT
0
Entering edit mode
4.4 years ago

Hello Erwan

How about this:

gunzip -c input.fq.gz | paste - - - - |\
awk -F '\t' '($1>="V300049672L4C001R0010000029/1" && $1<="@V300049672L4C001R0010000097/1") {print; if($1=="@V300049672L4C001R0010000097/1") exit;} ' |\
tr "\t" " \n"
ADD COMMENT
0
Entering edit mode

Dear Pierre,

I did not manage to launch your cmd successfully after providing the actual filename, but the trick with paste and tr was good enough for me, thanks a lot

Currently using :

time unpigz -c large_file.fq.gz |\
paste -d "\t" - - - - |\
awk '/@V300049672L4C001R0010000029\/1/{f=1} /@V300049672L4C001R0010000097\/1/ {f=0;print;exit} f' |\
tr "\t" "\n";


@V300049672L4C001R0010000029/1
GACNCCCAGGCAGGACAAAACCA...
+
FFF!@GECEGGEEFGGEGGEEFBF...
@V300049672L4C001R0010000070/1
TCCNTTGCGGTCATTTCTATAGCC...
+
FDF!6FFECCFFEFFFFFBEFFBFF4...
@V300049672L4C001R0010000097/1
ATGNAAGTCAAAAAAGTCATCGA...
+
FFF!EFEFFFFFFFFFFFFGEF*DEF...


real 0m0.008s
user 0m0.008s
sys 0m0.014s

ADD REPLY
0
Entering edit mode

ah yes I see, didn't put the '@' in the $1>="V300049672L4C001R0010000029/1"...

ADD REPLY
0
Entering edit mode

Oh I should have catched this, sorry

ADD REPLY

Login before adding your answer.

Traffic: 2647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6