I have a fastq (negetive control generated using shotgun metagenome sequencing) file that seems to be contaminated by some sequences contaminating my reagents during library preparation. Few of those reads are also present in positive samples. Now, i would like to subtract only those contamination reads from final analysis. How I can remove those reads?
Could you please share the "reformat.sh" code. Also, my file format like this : negative control in two fastq file(fq1,fq2) and i have true positive sample in another two fastq (fqq1, fqq2) file. Now i would like to subtract those negative control reads from positive reads. Do you think it will work?
Let us stick to this thread (don't respond to your posts in other threads) so we are not creating unnecessary cross links.
reformat.sh
is part of BBMap suite which you can download using the link included.I think the best solution is to align the file1 against file 2 and then only keep/select those reads that do not map from file 1. I hope your contaminant is very different from species of interest otherwise none of this is going to work.
If you don't know how to do this then let me know.
yes, i have merged those positive control reads by using : $ cat R1.fastq R2.fastq > merge_R1_R2.fastq and i have merged all negative control reads also in similar way.
Could you please let me know what should be my next step? How can i mapped those one against other?
For species point of view it will not change I think. I have shotgun metagenome data and i am removing only lab reagents contamination reads so do you think could it be a problem? Moreover, i rarely saw people hardly care about those reads but from my point of view one should consider those reads carefully.
If the contamination is not from a single species and if the contaminants are similar to your own data then you will lose a lot of reads. If the contamination could have been avoided in the first place then you should really be repeating this experiment. Following is an experiment and may not work at all. Use at your risk.
Thanks for your reply. Repeating the experiment is not possible as the patients samples not easy to obtain. I agreed with you that if the contaminant reads and true signal reads are exactly similar then there is high chance to lose the reads. But let's say contaminant and true signal have same species but the while sequencing there must be difference in number of reads. And while specie identification of those reads using any kraken or kaiju you can obtain some ture signals which are from positive samples. On the other hand if i keep those reads which are coming from contaminants and also present in positive samples, it will over representative a particular species, which is not a true picture.
With patient samples you have to do the best you can. Did you try the method I posted above?
That does not sound promising. Even if there is a difference in number how would you incorporate that information in your analysis.
No i did not try yet. I will have a try. My guess was the reads which is coming from negative control(mainly from kit and reagents) that must be present in positive control as well. So if I remove particular those reads and the remaining will be from positive control. But i am not sure that i am right track or not.
@genomax i tried as per your suggestion but could not make any improvement in reads for the negative control. Do you have any other suggestion/idea?