Hi, guys.
I wanna sort out reads containing specific sequences and discard reads without from fastq. However, I am new to computer programming with little knowledge about writing any codes.So I am wondering if there is any tool or codes that someone wrote which can perform this task.
Thanks
Yang
Actually, awk is space separated, so if the FASTQ contains spaces (in the sequence ID potentially), it will see more than four columns of data. You should add
awk -F $'\t' 'BEGIN {OFS = FS} ...'
Nice paste trick! Is that documented somewhere?
This is also one of my favourite tricks. Is it documented? Yes, all the commands Pierre is using behave as per documentation, including paste (from man pages: If `-' is specified for one or more of the input files, the standard input is used). It's amazing how simple commands can be stitched together to accomplish complex jobs.
I didn't mean to imply that I don't believe the answer. I just wanted to figure out how I missed this option for so long.
Yes, it says "-" means STDIN, but to me that means that there would be 4 columns of STDIN since it's repeated 4 times. The part that's surprising to me is that the next time it's used, it skips to the next line, so it's the same STDIN, not just another iteration of it. It makes sense in retrospect, but it's not necessarily obvious.
Hi igor-
It wasn't obvious to me at all as well! I saw that trick used somewhere else but just by reading the docs I would have never guessed it.
I learned it on biostars.org, years ago. https://www.google.com/search?q=paste+fastq+site%3Abiostars.org
Thanks. very simple and smart solution. Can this be applied to bam file?
In SAM, yes but you must consider the SAM header too.
Dear Pierre Does this code needs to add awk -F $'\t' 'BEGIN {OFS = FS} ...' as suggested by igor or works without adding.
Thank you Bishnu