Question

How to remove contamination reads from fastq flle

0

Entering edit mode

5.7 years ago

jeccy.J ▴ 60

I have a fastq (negetive control generated using shotgun metagenome sequencing) file that seems to be contaminated by some sequences contaminating my reagents during library preparation. Few of those reads are also present in positive samples. Now, i would like to subtract only those contamination reads from final analysis. How I can remove those reads?

sequencing • 4.3k views

ADD COMMENT • link 5.7 years ago by jeccy.J ▴ 60

score 1 · Answer 1 · 2019-03-23

1

Entering edit mode

5.7 years ago

GenoMax 147k

See a potential answer already provided: C: Subtracting one FASTAq file Reads from other FASTAq reads .
You could also try this: C: How to remove reads from fastq flle that match to a set of reads in my fasta fil Contaminant file can be converted to fasta by:

reformat.sh in=file2.fq.gz out=file2.fa

ADD COMMENT • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

Could you please share the "reformat.sh" code. Also, my file format like this : negative control in two fastq file(fq1,fq2) and i have true positive sample in another two fastq (fqq1, fqq2) file. Now i would like to subtract those negative control reads from positive reads. Do you think it will work?

ADD REPLY • link 5.7 years ago by jeccy.J ▴ 60

0

Entering edit mode

Let us stick to this thread (don't respond to your posts in other threads) so we are not creating unnecessary cross links. reformat.sh is part of BBMap suite which you can download using the link included.

I think the best solution is to align the file1 against file 2 and then only keep/select those reads that do not map from file 1. I hope your contaminant is very different from species of interest otherwise none of this is going to work.

If you don't know how to do this then let me know.

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

yes, i have merged those positive control reads by using : $ cat R1.fastq R2.fastq > merge_R1_R2.fastq and i have merged all negative control reads also in similar way.
Could you please let me know what should be my next step? How can i mapped those one against other?

ADD REPLY • link 5.7 years ago by jeccy.J ▴ 60

0

Entering edit mode

For species point of view it will not change I think. I have shotgun metagenome data and i am removing only lab reagents contamination reads so do you think could it be a problem? Moreover, i rarely saw people hardly care about those reads but from my point of view one should consider those reads carefully.

ADD REPLY • link 5.7 years ago by jeccy.J ▴ 60

0

Entering edit mode

If the contamination is not from a single species and if the contaminants are similar to your own data then you will lose a lot of reads. If the contamination could have been avoided in the first place then you should really be repeating this experiment. Following is an experiment and may not work at all. Use at your risk.

 1. Convert sample 2 fastq reads to fasta.
    reformat.sh in1=file2.R1.fq.gz in2=file2.R2.fq.gz out1=file_R1.fa out2=file_R2.fa

 2. Merge the two reads to make a reference
    cat  file_R1.fa file_R2.fa > file2.fa

 3. Do the alignment
    bbmap.sh -Xmx10g in1=file1_R1.fq.gz in2=file1_R2.fq.gz out=aligned.bam outu1=not_align_R1.fq.gz outu2=not_align_R2.fq.gz ref=file2_R2.fa perfectmode

 4. Unmapped reads should be in the two file specified by outu=.. If two outu= directives do not work then use only one outu=not_aligned.fq.gz. This will create an interleaved file. You can then separate the R1/R2 reads by doing 
    reformat.sh in=not_aligned.fq.gz out1=not_align_R1.fq.gz out2=not_align_R2.fq.gz

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

Thanks for your reply. Repeating the experiment is not possible as the patients samples not easy to obtain. I agreed with you that if the contaminant reads and true signal reads are exactly similar then there is high chance to lose the reads. But let's say contaminant and true signal have same species but the while sequencing there must be difference in number of reads. And while specie identification of those reads using any kraken or kaiju you can obtain some ture signals which are from positive samples. On the other hand if i keep those reads which are coming from contaminants and also present in positive samples, it will over representative a particular species, which is not a true picture.

ADD REPLY • link 5.7 years ago by jeccy.J ▴ 60

0

Entering edit mode

With patient samples you have to do the best you can. Did you try the method I posted above?

But let's say contaminant and true signal have same species but the while sequencing there must be difference in number of reads.

That does not sound promising. Even if there is a difference in number how would you incorporate that information in your analysis.

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

No i did not try yet. I will have a try. My guess was the reads which is coming from negative control(mainly from kit and reagents) that must be present in positive control as well. So if I remove particular those reads and the remaining will be from positive control. But i am not sure that i am right track or not.

ADD REPLY • link 5.7 years ago by jeccy.J ▴ 60

0

Entering edit mode

@genomax i tried as per your suggestion but could not make any improvement in reads for the negative control. Do you have any other suggestion/idea?

ADD REPLY • link 5.7 years ago by jeccy.J ▴ 60