I wanna sort out reads containing specific sequences and discard reads without from fastq. However, I am new to computer programming with little knowledge about writing any codes.So I am wondering if there is any tool or codes that someone wrote which can perform this task.
Actually, awk is space separated, so if the FASTQ contains spaces (in the sequence ID potentially), it will see more than four columns of data.
You should add awk -F $'\t''BEGIN {OFS = FS} ...'
This is also one of my favourite tricks. Is it documented? Yes, all the commands Pierre is using behave as per documentation, including paste (from man pages: If `-' is specified for one or more of the input files, the standard input is used). It's amazing how simple commands can be stitched together to accomplish complex jobs.
I didn't mean to imply that I don't believe the answer. I just wanted to figure out how I missed this option for so long.
Yes, it says "-" means STDIN, but to me that means that there would be 4 columns of STDIN since it's repeated 4 times. The part that's surprising to me is that the next time it's used, it skips to the next line, so it's the same STDIN, not just another iteration of it. It makes sense in retrospect, but it's not necessarily obvious.
using OpenGene
instream = fastq_open("your.fq.gz")
outstream = fastq_open("out.fq.gz", "w")
while (fq = fastq_read(instream)) != false
if contains(fq.sequence.seq, "Your pattern")
fastq_write(outstream, fq)
end
end
close(outstream)
If you have no idea about Julia, just do
1, run sudoapt-get install julia in ubuntu
2, run julia after the installation is finished
3, change the filename and your pattern in the code
4, paste your code into the Julia interactive command line and press ENTER
Actually, awk is space separated, so if the FASTQ contains spaces (in the sequence ID potentially), it will see more than four columns of data. You should add
awk -F $'\t' 'BEGIN {OFS = FS} ...'
Nice paste trick! Is that documented somewhere?
This is also one of my favourite tricks. Is it documented? Yes, all the commands Pierre is using behave as per documentation, including paste (from man pages: If `-' is specified for one or more of the input files, the standard input is used). It's amazing how simple commands can be stitched together to accomplish complex jobs.
I didn't mean to imply that I don't believe the answer. I just wanted to figure out how I missed this option for so long.
Yes, it says "-" means STDIN, but to me that means that there would be 4 columns of STDIN since it's repeated 4 times. The part that's surprising to me is that the next time it's used, it skips to the next line, so it's the same STDIN, not just another iteration of it. It makes sense in retrospect, but it's not necessarily obvious.
Hi igor-
It wasn't obvious to me at all as well! I saw that trick used somewhere else but just by reading the docs I would have never guessed it.
I learned it on biostars.org, years ago. https://www.google.com/search?q=paste+fastq+site%3Abiostars.org
Thanks. very simple and smart solution. Can this be applied to bam file?
In SAM, yes but you must consider the SAM header too.
Dear Pierre Does this code needs to add awk -F $'\t' 'BEGIN {OFS = FS} ...' as suggested by igor or works without adding.
Thank you Bishnu