Hello everyone,
I want to extract reads with maximum length from a fastq file, I've opted for these commands and I'm not sure if my syntax is correct.
zcat file.fastq.gz | awk 'BEGIN {max_len=0} {if(NR%4==2) {len=length($0); if(len>max_len) {max_len=len; max_seq=$0; header=prev_header}}; prev_header=$0} END {print header; print max_seq; getline; getline; print; getline; print}' | gzip > reads_max_length.fastq.gz
Also, when I align with minimap2, I don't get any results. Maybe I did something wrong? Thanks
You're right, so you think a modification to my code could be this:
In the first step, find the maximum length and store the indices of the rows corresponding to this length. In the second step, use these indices to print all the records that have the maximum length. Do you think it's correct? Thank you very much
The output from my test case is this, so no.
Honestly, I am not a big fan of awk. I am assuming it is possible but awk tends to become unreadable with increasing complexity.
Me too.. What can I do instead of awk?
I think seqkit should be safe. In a first round determine max size, in a second call extract all sequences of that length. If you MUST automate it:
Unlike an awk hack, this is more fool-proof (not necessarily faster).
Thanks! runnig this commands give me an error:
Error: invalid argument "/home/PI4KA_ATG_50R_RP.fastq" for "-m, --min-len" flag: strconv.ParseInt: parsing "/home/PI4KA_ATG_50R_RP.fastq": invalid syntaxe
Maybe Seqkit seq interprets the values for the minimum length limit as integers, but instead, it's getting the FASTQ file path. To solve this issue, do I need to provide an integer for the minimum length limit?"
or
Thanks! It worked