Sort specific reads from fastq
4
1
Entering edit mode
8.6 years ago
xuyang_dami ▴ 10

Hi, guys.

I wanna sort out reads containing specific sequences and discard reads without from fastq. However, I am new to computer programming with little knowledge about writing any codes.So I am wondering if there is any tool or codes that someone wrote which can perform this task.

Thanks

Yang

next-gen RNA-Seq sequencing • 2.6k views
ADD COMMENT
5
Entering edit mode
8.6 years ago

Using a simple shell script. If you're running windows, install cygwin.

gunzip -c fastq.gz |\
paste - - - - |\
awk '(!($2 ~ "ATGC"))' |\
tr "\t" "\n" |\
gzip > out.fastq.gz
  • gunzip uncompress the source if needed.
  • paste: convert 4 consecutive lines into one line with four columns *awk filter out column n°2 containing ATGC
  • tr: convert tabulation to 'return' = convert back to fastq
  • compress the output if needed and redirect to the output file "out.fastq.gz"
ADD COMMENT
1
Entering edit mode

Actually, awk is space separated, so if the FASTQ contains spaces (in the sequence ID potentially), it will see more than four columns of data. You should add awk -F $'\t' 'BEGIN {OFS = FS} ...'

ADD REPLY
0
Entering edit mode

Nice paste trick! Is that documented somewhere?

ADD REPLY
1
Entering edit mode

This is also one of my favourite tricks. Is it documented? Yes, all the commands Pierre is using behave as per documentation, including paste (from man pages: If `-' is specified for one or more of the input files, the standard input is used). It's amazing how simple commands can be stitched together to accomplish complex jobs.

ADD REPLY
1
Entering edit mode

I didn't mean to imply that I don't believe the answer. I just wanted to figure out how I missed this option for so long.

Yes, it says "-" means STDIN, but to me that means that there would be 4 columns of STDIN since it's repeated 4 times. The part that's surprising to me is that the next time it's used, it skips to the next line, so it's the same STDIN, not just another iteration of it. It makes sense in retrospect, but it's not necessarily obvious.

ADD REPLY
1
Entering edit mode

Hi igor-

It makes sense in retrospect, but it's not necessarily obvious

It wasn't obvious to me at all as well! I saw that trick used somewhere else but just by reading the docs I would have never guessed it.

ADD REPLY
0
Entering edit mode

I learned it on biostars.org, years ago. https://www.google.com/search?q=paste+fastq+site%3Abiostars.org

ADD REPLY
0
Entering edit mode

Thanks. very simple and smart solution. Can this be applied to bam file?

ADD REPLY
0
Entering edit mode

In SAM, yes but you must consider the SAM header too.

ADD REPLY
0
Entering edit mode

Dear Pierre Does this code needs to add awk -F $'\t' 'BEGIN {OFS = FS} ...' as suggested by igor or works without adding.

Thank you Bishnu

ADD REPLY
2
Entering edit mode
8.6 years ago

Python, using Biopython:

import sys
from Bio import SeqIO
sequences_of_interest = ["ATG", "GGCACAC", "GTGACCC"] #To be adapted
handle = open(sys.argv[1], "rU")
filtered = []
for record in SeqIO.parse(handle, "fastq")
    for seq in sequences_of_interest:        
        if record.seq.count(seq) > 0:
             filtered.append(record)
             break
output_handle = open("seqlected.fasta", "w")
SeqIO.write(filtered, output_handle, "fastq")
output_handle.close()
handle.close()

Save script (e.g. selectSEQ.py) and execute as python selectSEQ.py yourfile.fastq

Let me know if you need additional help.

ADD COMMENT
0
Entering edit mode

are you putting a whole fastq in memory ?

ADD REPLY
0
Entering edit mode

Hmm, you are right. Depending on the size of the fastq and the memory available, putting the file entirely in memory might not be the best idea.

ADD REPLY
0
Entering edit mode

Thanks. One question. whether tolerance to mismatch and deletion is counted?

ADD REPLY
1
Entering edit mode
8.6 years ago
igor 13k

There is FASTQ-GREP for filtering for sequences matching a pattern: http://homes.cs.washington.edu/~dcjones/fastq-tools/fastq-grep.html

ADD COMMENT
1
Entering edit mode
8.6 years ago
chen ★ 2.5k

If you are able to use Julia, you can use OpenGene(https://github.com/OpenGene/OpenGene.jl) to do this very easily

using OpenGene

instream = fastq_open("your.fq.gz")
outstream = fastq_open("out.fq.gz", "w")

while (fq = fastq_read(instream)) != false
    if contains(fq.sequence.seq, "Your pattern")
        fastq_write(outstream, fq)
    end
end

close(outstream)

If you have no idea about Julia, just do
1, run sudo apt-get install julia in ubuntu
2, run julia after the installation is finished
3, change the filename and your pattern in the code
4, paste your code into the Julia interactive command line and press ENTER

ADD COMMENT
0
Entering edit mode

Nice! thank you so much

ADD REPLY

Login before adding your answer.

Traffic: 932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6