Removing a subset of fasta reads from a huge fastq file
0
0
Entering edit mode
4.5 years ago
resug ▴ 40

Hi Biostars,

I need to remove a subset of reads from a large fastq file. The subset of reads are listed in a fasta file. I would appreciate it if you could help me with this.

What I have is:

A large 'reads.fastq.gz' file containing millions of reads in this format:

@HISEQ105:173:H3MNNCCX2:3:2224:23906:72789 1:N:0:TGGTGTTT
GTCTTGCATCTATAGTATTTTTACCCTGGTGGATATCTCTATCATTTCAAAAAAGTCTGGAATCTTGGGTTACTGATTCGTGGAATATTAAACAATCTGCAACTTTTTTGAATGATATTCTAGAAACTAGTCTTCTAGAAAAATTCAACGA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ#JJ#J#JJJJJ#JJJJJJJJJJJFJJJJJ#JJJJJJJJ##JJJJJJJJJJJJJJJJJJJJJJ#JJJAJJJJJJJJF
@HISEQ105:173:H3MNNCCX2:3:2224:24008:72789 1:N:0:TGGTGTTT
AGGTCCAACCAAGCCAACCATAAGTTTTCTAATTCATAATATTAATCTGAATCTCCCTAAATCCAAAAAATGGATCTAAATGCATTTCACGCTCCAAACTTTTGATGATTCAATGAATCTTTCTTCGGCGAAAGGGAGGATATGTCAATCC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJ7FJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ#JJ#J#JJJJJ#JJJJJJJJJJJJJJJJJ#JJJJJJFJ##FJJJJJJJJJJJJJJJJJJJJJ#JJJJJJJJJJJJJ

And a 'subset.fasta' file containing reads in this format (all these reads are present in the 'reads.fastq.gz' file'):

>419759601/1
GTCTTGCATCTATAGTATTTTTACCCTGGTGGATATCTCTATCATTTCAAAAAAGTCTGGAATCTTGGGTTACTGATTCGTGGAATATTAAACAATCTGCAACTTTTTTGAATGATATTCTAGAAACTAGTCTTCTAGAAAAATTCAACGA
>347851844/1
AGGTCCAACCAAGCCAACCATAAGTTTTCTAATTCATAATATTAATCTGAATCTCCCTAAATCCAAAAAATGGATCTAAATGCATTTCACGCTCCAAACTTTTGATGATTCAATGAATCTTTCTTCGGCGAAAGGGAGGATATGTCAATCC

What I want is: Create a fastq file with reads unique to 'reads.fastq.gz' (without the subset reads), including header, + and quality lines. I tried this using 'dedupe.sh' from BBMap, without success. Instead 'dedupe.sh' interleaved reads from both files; maybe due to the difference in the formatting. It would be greatly appreciate it to have your help.

fastq fasta remove extract deduplicate • 1.9k views
ADD COMMENT
2
Entering edit mode

You can try this with seqkit:

seqkit grep -sf <(seqkit seq -sw 0 test.fa) test.fq

seqkit supports both bgzip and gzip for searching.

ADD REPLY
0
Entering edit mode

Your syntax works great! I added the -v flag to extract the non-matching reads only, which is what I need:

seqkit grep -v -sf <(seqkit seq -sw 0 test.fa) test.fq

Thanks!

ADD REPLY
0
Entering edit mode

Unfortunately the seqkit command above is too slow when working with huge fastq files (human genome size).

ADD REPLY
1
Entering edit mode

Why can't you align, and convert the mapped reads in the bam back to fastq?

ADD REPLY
0
Entering edit mode

I guess I could. Although I find that 'seqkit grep' suggested by cpad0112 above would be more efficient. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6