Question

Max length reads from fastq

0

Entering edit mode

8 months ago

marco.barr ▴ 150

Hello everyone,

I want to extract reads with maximum length from a fastq file, I've opted for these commands and I'm not sure if my syntax is correct.

zcat file.fastq.gz | awk 'BEGIN {max_len=0} {if(NR%4==2) {len=length($0); if(len>max_len) {max_len=len; max_seq=$0; header=prev_header}}; prev_header=$0} END {print header; print max_seq; getline; getline; print; getline; print}' | gzip > reads_max_length.fastq.gz

Also, when I align with minimap2, I don't get any results. Maybe I did something wrong? Thanks

fastq • 1.0k views

ADD COMMENT • link updated 8 months ago by Michael 55k • written 8 months ago by marco.barr ▴ 150

score 1 · Accepted Answer · 2024-03-05

1

Entering edit mode

8 months ago

Michael 55k

This code gets you the first maximum length entry, but only half of it. You should test your code

$ cat test.fastq
@Areadlength 4
ACGT
+
!!!!
@Areadlength 3
ACG
+
!!!
@Areadlength 5
ACGTt
+
!!!!!
@Anotherreadlength 5
NNNNN
+
!!!!!



$ cat test.fastq | awk 'BEGIN {max_len=0} {if(NR%4==2) {len=length($0); if(len>max_len) {max_len=len; max_seq=$0; header=prev_header}}; prev_header=$0} END {print header; print max_seq; getline; getline; print; getline; print}'

@Areadlength 5
ACGTt

END {print header; print max_seq; getline; getline; print; getline; print}: At the END is at the end of the file, so getline is empty.

If you want the complete record, you need to save the + and quality line as well. If you want all records of maximum length you likely have to do two passes through your file.

ADD COMMENT • link 8 months ago by Michael 55k

0

Entering edit mode

You're right, so you think a modification to my code could be this:

awk 'BEGIN {max_len=0; delete max_indices} {if(NR%4==2) {len=length($0); if(len>max_len) {max_len=len; delete max_indices; max_indices[NR]}}}; {if(length($0)==max_len && (NR-2) in max_indices) {max_indices[NR-2]}}; END {for(i in max_indices) {start=i*4-3; end=i*4; for(j=start; j<=end; j++) {print $(j)}}}' test.fastq > output.fastq

In the first step, find the maximum length and store the indices of the rows corresponding to this length. In the second step, use these indices to print all the records that have the maximum length. Do you think it's correct? Thank you very much

ADD REPLY • link 8 months ago by marco.barr ▴ 150

1

Entering edit mode

The output from my test case is this, so no.

37: 40:

Honestly, I am not a big fan of awk. I am assuming it is possible but awk tends to become unreadable with increasing complexity.

ADD REPLY • link 8 months ago by Michael 55k

0

Entering edit mode

Me too.. What can I do instead of awk?

ADD REPLY • link 8 months ago by marco.barr ▴ 150

0

Entering edit mode

I think seqkit should be safe. In a first round determine max size, in a second call extract all sequences of that length. If you MUST automate it:

M=$(seqkit stats -T test.fastq | tail -n1 | cut -f 8); seqkit seq -m $M test.fastq # edit: sorry, missing semicolon

@Areadlength 5
ACGTt
+
!!!!!
@Anotherreadlength 5
NNNNN
+
!!!!!

Unlike an awk hack, this is more fool-proof (not necessarily faster).

ADD REPLY • link 8 months ago by Michael 55k

0

Entering edit mode

Thanks! runnig this commands give me an error: Error: invalid argument "/home/PI4KA_ATG_50R_RP.fastq" for "-m, --min-len" flag: strconv.ParseInt: parsing "/home/PI4KA_ATG_50R_RP.fastq": invalid syntaxe

Maybe Seqkit seq interprets the values for the minimum length limit as integers, but instead, it's getting the FASTQ file path. To solve this issue, do I need to provide an integer for the minimum length limit?"

ADD REPLY • link 8 months ago by marco.barr ▴ 150

2

Entering edit mode

M=$(seqkit stats -T test.fastq | tail -n1 | cut -f 8); seqkit seq -m $M test.fastq -o result.fastq.gz

or

seqkit seq -m $(seqkit stats -T test.fastq | tail -n1 | cut -f 8) test.fastq -o result.fastq.gz