I am a beginner of the RNA seq. Now I do some analysis about the paired-end data. For filtering the raw data, I use the Trimmomatic to classify the data to two parts: unpaired reads and paired reads.
So, I don't understand what's the unpaired. Because every reads should have its partner, I always get the same number of the Reads 1 and Reads 2. Why exist the unpaired reads?
When a trimming program trims data one of the reads (assuming you are using paired-end data) may become short and fail a criteria you have set (e.g. minimum length 25 bp). At the point that read is removed by the trimming program. Let us say that was from R2 file. A trimming program should remove the corresponding R1 read from R1-file (even though that may have passed) when R2 read is dropped. If it does not do that you are left with an unpaired read in R1 file.
Aligners expect reads to be in the same order in R1/R2 files. If they are not then you can get strange results (e.g. discordant mapping). Presence of unpaired reads in main sequence files (As @mastal points out trimmomatic should collect them in separate files) generally signifies improper use of a trimming program (or using a trimmer that is not paired-end aware). If that happens repair.sh tool from BBMap suite can be used to remove those unpaired reads and bring R1/R2 files back in sync.
Hi Genomax, I see your reply for this question, but I still do not understand what is unpaired reads or unpaired.fastq file? based on your answer, my understanding is that for a Paired End sequencing, generally, the types of sequences in the R1 file is equal to that in the R2 file. Here, we do not care about the number of each sequence. for instance, if one sequence cannot pass the QC (set in trimmomatic)in R1 file, but this sequence pass the QC in R2 file, however, this sequence in both R1 and R2 file will be classified into unpaired reads/.fastq file, which means all the copies in R1 and R2 files also will be classified into the unpaired reads. Or another understanding is that one sequence exist in both R1 and R2 file, but one copy in either R1 or R2 cannot pass the QC, this copy will be classified into unpaired fastq file/reads. (I think the second view might right). if so, some guys also mentioned using the unpaired for alignment/mapping with BWA, whether these under-QC sequences should be dealt with trimmomatic again with a strict set? are they useful? Finally, what kind of sequences are in unpaired.fastq file? please five some examples. my email is wangqk198738@163.com Thanks a lot!
After trimming with trimmomatic, there will be some read pairs where Read1 doesn't pass the quality conditions but Read2 does, so this will end up in the Read2_Unpaired file. Similarly in some cases if Read2 doesn't pass the quality conditions but Read1 does, then Read1 for that read pair will end up in the trimmomatic output Read1_Unpaired file.
Thanks for you answer, but when I run the data in tophat without trimming by Trimmomatic, there are still the unpaired reads to pruduce. I know the discordant reads is which non-mapped, but unpaired-reads , I don't understand.
OK, I don't really understand why you get unpaired reads either, if you started with only paired reads.
However, both bowtie and tophat seem to filter out some reads before alignment, I'm not sure why, perhaps they are removing reads with N bases, so that could be why you end up with some unpaired reads.
nice! very detailed,thanks a lot.
Hi Genomax, I see your reply for this question, but I still do not understand what is unpaired reads or unpaired.fastq file? based on your answer, my understanding is that for a Paired End sequencing, generally, the types of sequences in the R1 file is equal to that in the R2 file. Here, we do not care about the number of each sequence. for instance, if one sequence cannot pass the QC (set in trimmomatic)in R1 file, but this sequence pass the QC in R2 file, however, this sequence in both R1 and R2 file will be classified into unpaired reads/.fastq file, which means all the copies in R1 and R2 files also will be classified into the unpaired reads. Or another understanding is that one sequence exist in both R1 and R2 file, but one copy in either R1 or R2 cannot pass the QC, this copy will be classified into unpaired fastq file/reads. (I think the second view might right). if so, some guys also mentioned using the unpaired for alignment/mapping with BWA, whether these under-QC sequences should be dealt with trimmomatic again with a strict set? are they useful? Finally, what kind of sequences are in unpaired.fastq file? please five some examples. my email is wangqk198738@163.com Thanks a lot!