HI All, I have two fastaq files and I want to subtract the reads of one fastaq file from the other fastaq file. I want to know what a command line or software I can use to do that?
HI All, I have two fastaq files and I want to subtract the reads of one fastaq file from the other fastaq file. I want to know what a command line or software I can use to do that?
gunzip -c f1.fq.gz f2.fq.gz | paste - - - - | sort |uniq |tr "\t" "\n" > f3.fq
But lets say if I have two fastq files and now if i want to remove only the reads present in one file and not to make a common uniq reads file then what should we do?
If you are referring to actual sequence identity (and not full fastq record being duplicated) then only way to do that is by using clumpify.sh
from BBMap suite. See this thread: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files
I think it will remove duplicate read. It will not work as per above question Lets say i have
one fastq file like :
@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7
@HISEQ:230:C6G45ANXX:3:1101:1498:2162 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF
@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7
##################################
another fastq file like:
@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7
@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFBFFFBB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7
So now let say i want compare these two file in a way if i find the reads from 2nd fastq file in 1st fastq the remove the reads otherwise keep the fastq file as it is.
Do you think clumpify.sh will do that ?
I think some one already asked the topic but little modified here : "I have a fastq file that seems to be contaminated by some sequences contaminating my reagents during library preparation. If I know the reads that came from reagents and I have them in a fastaq format, do you think I can eliminate those reads from my fastq file? I want to remove any reads contaminating my fastq file. How can I work this out?"
If you need to remove reads that are contaminants (do you have the reference for either both or at least species of interest) then you can use bbsplit.sh
from BBMap suite like this: C: BBSplit syntax for generating builds for the reference genome and how to call di
Sorry its not very clear to me yet. I have merged both forward and reverse file and which are generated from positive sample. And also i have another file which negative control from reagents contamination also in fastq file. Now i would like to remove those reads which present only in negative sample. So i will get finally a clean fastq file. Can i use clumify.sh to such job?
I have merged both forward and reverse file
I think the best solution is to align the file1 against file 2 and then only keep/select those reads that do not map.
If you don't know how to do this then let me know.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You probably mean a fastq file, since a 'fastaq' file does as far as I know not exist.