Question

Sort specific reads from fastq

1

Entering edit mode

8.6 years ago

xuyang_dami ▴ 10

Hi, guys.

I wanna sort out reads containing specific sequences and discard reads without from fastq. However, I am new to computer programming with little knowledge about writing any codes.So I am wondering if there is any tool or codes that someone wrote which can perform this task.

Thanks

Yang

next-gen RNA-Seq sequencing • 2.6k views

ADD COMMENT • link updated 8.6 years ago by chen ★ 2.5k • written 8.6 years ago by xuyang_dami ▴ 10

score 5 · Answer 1 · 2016-03-26

5

Entering edit mode

8.6 years ago

Pierre Lindenbaum 164k

Using a simple shell script. If you're running windows, install cygwin.

gunzip -c fastq.gz |\
paste - - - - |\
awk '(!($2 ~ "ATGC"))' |\
tr "\t" "\n" |\
gzip > out.fastq.gz

gunzip uncompress the source if needed.
paste: convert 4 consecutive lines into one line with four columns *awk filter out column n°2 containing ATGC
tr: convert tabulation to 'return' = convert back to fastq
compress the output if needed and redirect to the output file "out.fastq.gz"

ADD COMMENT • link 8.6 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Actually, awk is space separated, so if the FASTQ contains spaces (in the sequence ID potentially), it will see more than four columns of data. You should add awk -F $'\t' 'BEGIN {OFS = FS} ...'

ADD REPLY • link 8.6 years ago by igor 13k

0

Entering edit mode

Nice paste trick! Is that documented somewhere?

ADD REPLY • link 8.6 years ago by igor 13k

1

Entering edit mode

This is also one of my favourite tricks. Is it documented? Yes, all the commands Pierre is using behave as per documentation, including paste (from man pages: If `-' is specified for one or more of the input files, the standard input is used). It's amazing how simple commands can be stitched together to accomplish complex jobs.

ADD REPLY • link 8.6 years ago by dariober 15k

1

Entering edit mode

I didn't mean to imply that I don't believe the answer. I just wanted to figure out how I missed this option for so long.

Yes, it says "-" means STDIN, but to me that means that there would be 4 columns of STDIN since it's repeated 4 times. The part that's surprising to me is that the next time it's used, it skips to the next line, so it's the same STDIN, not just another iteration of it. It makes sense in retrospect, but it's not necessarily obvious.

ADD REPLY • link 8.6 years ago by igor 13k

1

Entering edit mode

Hi igor-

It makes sense in retrospect, but it's not necessarily obvious

It wasn't obvious to me at all as well! I saw that trick used somewhere else but just by reading the docs I would have never guessed it.

ADD REPLY • link 8.6 years ago by dariober 15k

0

Entering edit mode

I learned it on biostars.org, years ago. https://www.google.com/search?q=paste+fastq+site%3Abiostars.org

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks. very simple and smart solution. Can this be applied to bam file?

ADD REPLY • link 8.6 years ago by xuyang_dami ▴ 10

0

Entering edit mode

In SAM, yes but you must consider the SAM header too.

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Dear Pierre Does this code needs to add awk -F $'\t' 'BEGIN {OFS = FS} ...' as suggested by igor or works without adding.

Thank you Bishnu

ADD REPLY • link 7.3 years ago by vetbishnu_tfeed • 0

score 2 · Answer 2 · 2016-03-26

2

Entering edit mode

8.6 years ago

WouterDeCoster 47k

Python, using Biopython:

import sys
from Bio import SeqIO
sequences_of_interest = ["ATG", "GGCACAC", "GTGACCC"] #To be adapted
handle = open(sys.argv[1], "rU")
filtered = []
for record in SeqIO.parse(handle, "fastq")
    for seq in sequences_of_interest:        
        if record.seq.count(seq) > 0:
             filtered.append(record)
             break
output_handle = open("seqlected.fasta", "w")
SeqIO.write(filtered, output_handle, "fastq")
output_handle.close()
handle.close()

Save script (e.g. selectSEQ.py) and execute as python selectSEQ.py yourfile.fastq

Let me know if you need additional help.

ADD COMMENT • link 8.6 years ago by WouterDeCoster 47k

0

Entering edit mode

are you putting a whole fastq in memory ?

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hmm, you are right. Depending on the size of the fastq and the memory available, putting the file entirely in memory might not be the best idea.

ADD REPLY • link 8.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks. One question. whether tolerance to mismatch and deletion is counted?

ADD REPLY • link 8.6 years ago by xuyang_dami ▴ 10

score 1 · Answer 3 · 2016-03-26

1

Entering edit mode

8.6 years ago

igor 13k

There is FASTQ-GREP for filtering for sequences matching a pattern: http://homes.cs.washington.edu/~dcjones/fastq-tools/fastq-grep.html

ADD COMMENT • link 8.6 years ago by igor 13k

score 1 · Answer 4 · 2016-03-27

If you are able to use Julia, you can use OpenGene(https://github.com/OpenGene/OpenGene.jl) to do this very easily

using OpenGene

instream = fastq_open("your.fq.gz")
outstream = fastq_open("out.fq.gz", "w")

while (fq = fastq_read(instream)) != false
    if contains(fq.sequence.seq, "Your pattern")
        fastq_write(outstream, fq)
    end
end

close(outstream)

If you have no idea about Julia, just do
1, run sudo apt-get install julia in ubuntu
2, run julia after the installation is finished
3, change the filename and your pattern in the code
4, paste your code into the Julia interactive command line and press ENTER