Subtracting one FASTAq file Reads from other FASTAq reads
1
0
Entering edit mode
6.9 years ago
aftabahmad • 0

HI All, I have two fastaq files and I want to subtract the reads of one fastaq file from the other fastaq file. I want to know what a command line or software I can use to do that?

alignment sequence next-gen • 4.2k views
ADD COMMENT
0
Entering edit mode

You probably mean a fastq file, since a 'fastaq' file does as far as I know not exist.

ADD REPLY
3
Entering edit mode
6.9 years ago
gunzip -c f1.fq.gz f2.fq.gz | paste  - - - - | sort |uniq |tr "\t" "\n" > f3.fq
ADD COMMENT
0
Entering edit mode

I think it will make a uniq reads among two file. But lets say if I have two fastq files and now if i want to remove only the reads present in one file and not to make a common uniq reads file then what should we do?

ADD REPLY
0
Entering edit mode

use comm

ADD REPLY
0
Entering edit mode

@Pierre Lindenbaum So the command should be like this ??: gunzip -c f1.fq.gz f2.fq.gz | paste - - - - | sort |comm|tr "\t" "\n" > f3.fq

ADD REPLY
0
Entering edit mode

But lets say if I have two fastq files and now if i want to remove only the reads present in one file and not to make a common uniq reads file then what should we do?

If you are referring to actual sequence identity (and not full fastq record being duplicated) then only way to do that is by using clumpify.sh from BBMap suite. See this thread: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD REPLY
0
Entering edit mode

I think it will remove duplicate read. It will not work as per above question Lets say i have

one fastq file like :

    @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
    TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
    +
    BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7
    @HISEQ:230:C6G45ANXX:3:1101:1498:2162 1:N:0:ACAGTGGTTGAACCTT
    TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
    +
    BBB<B<F<FFFFFFFBFFFFFFBFFFFBFF/F<FFFFBBFFFFFFFFFFBFB/BFFFFFFFFFFFBFFB/<<<FFFFFFFFFFFFFFBFFFF

    @HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
    TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
    +
    BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7


##################################
another fastq file like:

@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGGCTGCAGACTTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFFFFFFFFFFFBFFFBFB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7

@HISEQ:230:C6G45ANXX:3:1101:1395:2141 1:N:0:ACAGTGGTTGAACCTT
TGACGGCACTTTCTCTTCCCAACCACGTGTCTTGCTCTCAAGTTGTCCTGACATGCTCTGAGAGCACACA
+
BB//<<BFBFFF<FFFFBBB<<<F/FBBB<FF/B<FFFFBFFFBB/FBFFB//F//B<FFF</</BF<BBBFFFFF//B<FBFF/7

So now let say i want compare these two file in a way if i find the reads from 2nd fastq file in 1st fastq the remove the reads otherwise keep the fastq file as it is.

Do you think clumpify.sh will do that ?

ADD REPLY
0
Entering edit mode

I think some one already asked the topic but little modified here : "I have a fastq file that seems to be contaminated by some sequences contaminating my reagents during library preparation. If I know the reads that came from reagents and I have them in a fastaq format, do you think I can eliminate those reads from my fastq file? I want to remove any reads contaminating my fastq file. How can I work this out?"

ADD REPLY
0
Entering edit mode

If you need to remove reads that are contaminants (do you have the reference for either both or at least species of interest) then you can use bbsplit.sh from BBMap suite like this: C: BBSplit syntax for generating builds for the reference genome and how to call di

ADD REPLY
0
Entering edit mode

For this particular application if you know that reads from file 1 are NOT present in file 2 for sure then merge the two files together and then use clumpify.sh to remove ALL duplicates. That should get rid of reads that are duplicated.

ADD REPLY
0
Entering edit mode

Sorry its not very clear to me yet. I have merged both forward and reverse file and which are generated from positive sample. And also i have another file which negative control from reagents contamination also in fastq file. Now i would like to remove those reads which present only in negative sample. So i will get finally a clean fastq file. Can i use clumify.sh to such job?

ADD REPLY
0
Entering edit mode

I have merged both forward and reverse file

  1. Have you merged R1/R2 reads to get a longer single read in place of two reads? OR
  2. Just copied the two files together for both positive/negative samples?

I think the best solution is to align the file1 against file 2 and then only keep/select those reads that do not map.

If you don't know how to do this then let me know.

ADD REPLY
0
Entering edit mode

i changed the thread now....

ADD REPLY

Login before adding your answer.

Traffic: 2524 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6