Max length reads from fastq
1
0
Entering edit mode
8 months ago
marco.barr ▴ 150

Hello everyone,

I want to extract reads with maximum length from a fastq file, I've opted for these commands and I'm not sure if my syntax is correct.

zcat file.fastq.gz | awk 'BEGIN {max_len=0} {if(NR%4==2) {len=length($0); if(len>max_len) {max_len=len; max_seq=$0; header=prev_header}}; prev_header=$0} END {print header; print max_seq; getline; getline; print; getline; print}' | gzip > reads_max_length.fastq.gz

Also, when I align with minimap2, I don't get any results. Maybe I did something wrong? Thanks

fastq • 1.0k views
ADD COMMENT
1
Entering edit mode
8 months ago
Michael 55k

This code gets you the first maximum length entry, but only half of it. You should test your code

$ cat test.fastq
@Areadlength 4
ACGT
+
!!!!
@Areadlength 3
ACG
+
!!!
@Areadlength 5
ACGTt
+
!!!!!
@Anotherreadlength 5
NNNNN
+
!!!!!



$ cat test.fastq | awk 'BEGIN {max_len=0} {if(NR%4==2) {len=length($0); if(len>max_len) {max_len=len; max_seq=$0; header=prev_header}}; prev_header=$0} END {print header; print max_seq; getline; getline; print; getline; print}'

@Areadlength 5
ACGTt

END {print header; print max_seq; getline; getline; print; getline; print}: At the END is at the end of the file, so getline is empty.

If you want the complete record, you need to save the + and quality line as well. If you want all records of maximum length you likely have to do two passes through your file.

ADD COMMENT
0
Entering edit mode

You're right, so you think a modification to my code could be this:

awk 'BEGIN {max_len=0; delete max_indices} {if(NR%4==2) {len=length($0); if(len>max_len) {max_len=len; delete max_indices; max_indices[NR]}}}; {if(length($0)==max_len && (NR-2) in max_indices) {max_indices[NR-2]}}; END {for(i in max_indices) {start=i*4-3; end=i*4; for(j=start; j<=end; j++) {print $(j)}}}' test.fastq > output.fastq

In the first step, find the maximum length and store the indices of the rows corresponding to this length. In the second step, use these indices to print all the records that have the maximum length. Do you think it's correct? Thank you very much

ADD REPLY
1
Entering edit mode

The output from my test case is this, so no.

37: 40:

Honestly, I am not a big fan of awk. I am assuming it is possible but awk tends to become unreadable with increasing complexity.

ADD REPLY
0
Entering edit mode

Me too.. What can I do instead of awk?

ADD REPLY
0
Entering edit mode

I think seqkit should be safe. In a first round determine max size, in a second call extract all sequences of that length. If you MUST automate it:

M=$(seqkit stats -T test.fastq | tail -n1 | cut -f 8); seqkit seq -m $M test.fastq # edit: sorry, missing semicolon

@Areadlength 5
ACGTt
+
!!!!!
@Anotherreadlength 5
NNNNN
+
!!!!!

Unlike an awk hack, this is more fool-proof (not necessarily faster).

ADD REPLY
0
Entering edit mode

Thanks! runnig this commands give me an error: Error: invalid argument "/home/PI4KA_ATG_50R_RP.fastq" for "-m, --min-len" flag: strconv.ParseInt: parsing "/home/PI4KA_ATG_50R_RP.fastq": invalid syntaxe

Maybe Seqkit seq interprets the values for the minimum length limit as integers, but instead, it's getting the FASTQ file path. To solve this issue, do I need to provide an integer for the minimum length limit?"

ADD REPLY
2
Entering edit mode
M=$(seqkit stats -T test.fastq | tail -n1 | cut -f 8); seqkit seq -m $M test.fastq -o result.fastq.gz

or

seqkit seq -m $(seqkit stats -T test.fastq | tail -n1 | cut -f 8) test.fastq -o result.fastq.gz
ADD REPLY
0
Entering edit mode

Thanks! It worked

ADD REPLY

Login before adding your answer.

Traffic: 1682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6