Hi Everyone, I've been digging around the web trying to find a tool that would allow me to clean-up my paired-end Illumina data before mapping. My pipeline thus far has been to:
1) FASTQC - my R1 file had a bit of adaptor contamination, the R2 file was fine. 2) fastx_collapser - I had a lot of data and am just mapping to determine coverage of the genome (of closely related species) to see how broad our coverage is before other analysis begins - ran on R1 and R2 seperately (files were left with a different number of sequences although it was <1% of the total number of sequences) 3) fastx_clipper - only on the file with the adaptor contamination - removed sequences containing the adaptor 4) fix pairing data - ? tool
I saw there was some tool referred to as rePair, but I have not been able to track it down. I thought for sure that fastx or picard would have something to filter out unpaired reads, but I'm just not seeing it. I'm hoping there is any easy answer here. I am planning to use bowtie2 for the alignment. Thanks in advance!
Thanks dpryan79, in the end I decided that I could concatenate the collapsed files and map them as though they were single reads. This will work for just looking at coverage of a closely related genome, but wouldn't work for any solid, in-depth analysis. Since I am just double checking the sequencing protocol gives sufficient coverage (not talking depth here) of the genome, this should work fine. If anyone else was considering using the pipeline I described above, don't do it. The problem is that you lose the headers by collapsing the reads using the fastx tools. Better to do as dpryan79 suggests and just map all the reads and collapse/remove redundant reads after the fact. I believe samtools and picard both have tools for reducing redundancy in sam/bam files.