Tool Recommendation Wanted For Cleaning Fasta/Fastq Files To Remove Unpaired Reads Following Pre-Processing
3
0
Entering edit mode
11.1 years ago
Moss ▴ 20

Hi Everyone, I've been digging around the web trying to find a tool that would allow me to clean-up my paired-end Illumina data before mapping. My pipeline thus far has been to:

1) FASTQC - my R1 file had a bit of adaptor contamination, the R2 file was fine. 2) fastx_collapser - I had a lot of data and am just mapping to determine coverage of the genome (of closely related species) to see how broad our coverage is before other analysis begins - ran on R1 and R2 seperately (files were left with a different number of sequences although it was <1% of the total number of sequences) 3) fastx_clipper - only on the file with the adaptor contamination - removed sequences containing the adaptor 4) fix pairing data - ? tool

I saw there was some tool referred to as rePair, but I have not been able to track it down. I thought for sure that fastx or picard would have something to filter out unpaired reads, but I'm just not seeing it. I'm hoping there is any easy answer here. I am planning to use bowtie2 for the alignment. Thanks in advance!

paired-end • 12k views
ADD COMMENT
0
Entering edit mode

Thanks dpryan79, in the end I decided that I could concatenate the collapsed files and map them as though they were single reads. This will work for just looking at coverage of a closely related genome, but wouldn't work for any solid, in-depth analysis. Since I am just double checking the sequencing protocol gives sufficient coverage (not talking depth here) of the genome, this should work fine. If anyone else was considering using the pipeline I described above, don't do it. The problem is that you lose the headers by collapsing the reads using the fastx tools. Better to do as dpryan79 suggests and just map all the reads and collapse/remove redundant reads after the fact. I believe samtools and picard both have tools for reducing redundancy in sam/bam files.

ADD REPLY
3
Entering edit mode
10.3 years ago

This script outputs pairs and solo reads separately.

So, either use Trimmomatic that keeps pairing our use your favorite software that will leave you with unequal number of sequences and then fix pairing with this script (written by Eric Normandeau).

ADD COMMENT
1
Entering edit mode

The script is still available and multiple people are reporting using it with success.

ADD REPLY
1
Entering edit mode

Dear Eric, it works perfectly as described, I confirm. Thanks!

ADD REPLY
2
Entering edit mode
11.1 years ago

Have a look here (How to sort two mate pair (fastq) files so that the order of the identifiers is the same?) or here (Combining the paired reads from Illumina run) for solutions to resyncing fastq files. In general, it's probably faster to simply map those reads rather than collapsing them and then needing to resync your files.

ADD COMMENT
0
Entering edit mode
11.1 years ago
Ian 6.1k

I would recommend Trimmomatic as it performs read filtering/trimming, etc, and maintains paired filtered reads whilst removing singletons.

ADD COMMENT

Login before adding your answer.

Traffic: 2742 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6