Hi everyone,
First post here - this forum is being of immense help in my bioinformatics journey so far.
Brief explanation: I have Illumina MiSeq 2x300 bp reads metagenomes. I have used BBduk + Trimmomatic to remove adapters and to quality trim the sequences. I have four output files - forward paired, forward unpaired, reverse paired and reverse unpaired. I run FastQC on all of them and the quality of unpaired output is slightly worse than that of the paired output.
Input Read Pairs: 3163058 Both Surviving: 2631476 (83.19%) Forward Only Surviving: 363260 (11.48%) Reverse Only Surviving: 48940 (1.55%) Dropped: 119382 (3.77%)
Forward only surviving: % of forward reads that was high quality but couldn't be kept because the paired reverse read was low quality Reverse only surviving: % of reverse reads that was high quality but couldn't be kept because the paired forward read was low quality Dropped: sequences which have been dropped because BOTH forward and reverse were bad quality.
From what I understood, Trimmomatic drops both forward and reverse reads when one or both of the reads do not go through the quality threshold. The whole pair will be dropped and will end in the unpaired output.
First question: why is that? Even though the quality is not high enough, the sequences will BOTH be dropped and end in the UNPAIRED output, but they will still be paired, right? So why is it called unpaired?
Second question: should I include my unpaired output in the assembly process? In my opinion, it would add a lot of information. I am going to use Megahit, and I would run it in a way that includes both paired and unpaired sequences. Like this:
megahit -1 1_R1_paired, 1_R1_unpaired -2 1_R2_paired, 1_R2_unpaired -o output/
Is it Megahit going to recognize the R1_unpaired and the R2_unpaired as still paired?
Sorry for any confusion I might have created, if necessary I will try to explain it better. Thanks in advance for your help.
Stefano
Thanks,
I got it now! I thought both reads were included in the unpaired output if one or both were not high-quality enough.
While the reality is that when one of the paired sequence is bad quality, it will be dropped; the other one will be kept but in the unpaired file. At this point, I am wondering: can I treat the unpaired output as a single read output, given that they have lost their paired sequence?
If this is the case, Megahit allows (-r) to list single end files and use them as input. Regarding whether I should use them or not, I think I will run the assembler twice and evaluate the output.
Thanks again, you have been extremely clear and helpful.
You're welcome. And yes, you can treat the unpaired as single.