Question

Filtering Overlapping Paired-end Reads

0

Entering edit mode

7.9 years ago

ScubaChris ▴ 10

Greetings,

I have a metagenomics dataset of Illumina overlapping paired-end reads in which the quality deteriorates rapidly. The sequences are already joined. When I filter by sliding window I end up losing 85% of the dataset.

http://imgur.com/a/8GO1T

Would separating the sequences, doing trailing-end filtering and then using them as separate files be a viable approach? I'm currently using Diamond which only seems to accept one input file, however.

Any ideas welcome, and thank you for your time.

fastqc ngs illumina • 3.2k views

ADD COMMENT • link 7.8 years ago by ScubaChris ▴ 10

0

Entering edit mode

If you map to a reference, do you see an overlap of paired-ends? What is you insert size distribution (after mapping)?

ADD REPLY • link 7.9 years ago by Gabriel R. ★ 2.9k

1

Entering edit mode

Hi, I'm not using a reference genome, just blasting against a list of specific protein sequences.

ADD REPLY • link 7.9 years ago by ScubaChris ▴ 10

0

Entering edit mode

DNA to Protein blast?

ADD REPLY • link 7.9 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Yes, like I said, I'm using Diamond.

ADD REPLY • link 7.9 years ago by ScubaChris ▴ 10

0

Entering edit mode

7.9 years ago

sridhar56 ▴ 110

For your dataset, assess the reads on FASTQC separately, that should give you an idea of where the quality is deteriorating and then you can trim them using PRINSEQ. Merge them later after quality trimming and map it to your reference sequence.

ADD COMMENT • link 7.9 years ago by sridhar56 ▴ 110

0

Entering edit mode

Hi, thank you for your reply. If I split them in order to trim, should I merge them by adding "N"s to the gaps? Is this normally a good practice? Thanks.

ADD REPLY • link 7.9 years ago by ScubaChris ▴ 10

score 1 · Accepted Answer · 2017-02-10

1

Entering edit mode

7.9 years ago

Brian Bushnell 20k

If the sequences are already successfully merged using a pair-merging tool, the low-quality overlapping ends should have already been error-corrected into consensus sequence and they won't need trimming. Why are you trying to trim or filter them? And anyway, you can't separate them once they are merged since it's a lossy process. What exactly do you mean by "joined", anyway - how was the procedure performed? And what command line are you trying to use for filtering?

ADD COMMENT • link 7.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you for your reply.

The sequences are part of a public dataset I got from iMicrobe. The study paper I got the link from said, and I quote, "sequences are QC’d fasta files of joined paired-end reads, also with internal standards and rRNA sequences (metatranscriptomes only) removed." The sequences however were FASTQ instead, and this is how the FastQC output looks like. I can easily trim the trailing end, but I can't clean up the middle of the joined sequences (I am using a sliding window) without losing 85% of my dataset. I am also in the process of contacting the original researcher directly. My main goal is to blastx them (Diamond) against some specific protein databases.

ADD REPLY • link 7.9 years ago by ScubaChris ▴ 10

0

Entering edit mode

Right - you have to do quality trimming on the paired reads, if you want to do it. You can't effectively quality-trim merged (joined) reads, and you can't separate them once they have been merged. So, just use them as-is. You can filter out the very low quality ones if you want, though.

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you for your answer, I will use them as-is then.

ADD REPLY • link 7.9 years ago by ScubaChris ▴ 10

score 1 · Accepted Answer · 2017-02-24

1

Entering edit mode

7.8 years ago

ScubaChris ▴ 10

To close this post myself, it turned out that the sequences I wanted to use had been incorrectly deposited, and should not have been like that in the first place. Thanks for all the answers!

ADD COMMENT • link 7.8 years ago by ScubaChris ▴ 10