Hi Biostars,
I need to remove a subset of reads from a large fastq file. The subset of reads are listed in a fasta file. I would appreciate it if you could help me with this.
What I have is:
A large 'reads.fastq.gz' file containing millions of reads in this format:
@HISEQ105:173:H3MNNCCX2:3:2224:23906:72789 1:N:0:TGGTGTTT
GTCTTGCATCTATAGTATTTTTACCCTGGTGGATATCTCTATCATTTCAAAAAAGTCTGGAATCTTGGGTTACTGATTCGTGGAATATTAAACAATCTGCAACTTTTTTGAATGATATTCTAGAAACTAGTCTTCTAGAAAAATTCAACGA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ#JJ#J#JJJJJ#JJJJJJJJJJJFJJJJJ#JJJJJJJJ##JJJJJJJJJJJJJJJJJJJJJJ#JJJAJJJJJJJJF
@HISEQ105:173:H3MNNCCX2:3:2224:24008:72789 1:N:0:TGGTGTTT
AGGTCCAACCAAGCCAACCATAAGTTTTCTAATTCATAATATTAATCTGAATCTCCCTAAATCCAAAAAATGGATCTAAATGCATTTCACGCTCCAAACTTTTGATGATTCAATGAATCTTTCTTCGGCGAAAGGGAGGATATGTCAATCC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJ7FJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ#JJ#J#JJJJJ#JJJJJJJJJJJJJJJJJ#JJJJJJFJ##FJJJJJJJJJJJJJJJJJJJJJ#JJJJJJJJJJJJJ
And a 'subset.fasta' file containing reads in this format (all these reads are present in the 'reads.fastq.gz' file'):
>419759601/1
GTCTTGCATCTATAGTATTTTTACCCTGGTGGATATCTCTATCATTTCAAAAAAGTCTGGAATCTTGGGTTACTGATTCGTGGAATATTAAACAATCTGCAACTTTTTTGAATGATATTCTAGAAACTAGTCTTCTAGAAAAATTCAACGA
>347851844/1
AGGTCCAACCAAGCCAACCATAAGTTTTCTAATTCATAATATTAATCTGAATCTCCCTAAATCCAAAAAATGGATCTAAATGCATTTCACGCTCCAAACTTTTGATGATTCAATGAATCTTTCTTCGGCGAAAGGGAGGATATGTCAATCC
What I want is: Create a fastq file with reads unique to 'reads.fastq.gz' (without the subset reads), including header, + and quality lines. I tried this using 'dedupe.sh' from BBMap, without success. Instead 'dedupe.sh' interleaved reads from both files; maybe due to the difference in the formatting. It would be greatly appreciate it to have your help.
You can try this with seqkit:
seqkit supports both bgzip and gzip for searching.
Your syntax works great! I added the -v flag to extract the non-matching reads only, which is what I need:
seqkit grep -v -sf <(seqkit seq -sw 0 test.fa) test.fq
Thanks!
Unfortunately the seqkit command above is too slow when working with huge fastq files (human genome size).
Why can't you align, and convert the mapped reads in the bam back to fastq?
I guess I could. Although I find that 'seqkit grep' suggested by cpad0112 above would be more efficient. Thanks!