Hi all, Maybe I'm asking a too basic question, but I really feel confused. I have R1.fastq file and R2.fastq file from the paired-end RNA-seq. As far as I know, the read order in R1 and R2 files should be the same, namely the reads in the same pair should get the same rank in R1 and R2 respectively. However, when I count the initial read numbers in R1 and R2 files, they are different. For example, R1 has 1878678 reads, while R2 has 1800352 reads. This makes me confused becasue if so, does this mean the additional reads in R1 compared to R2 (1878678 - 1800352 = 78326 reads) are unpaired and all the other reads in R1 and R2 are paired and have the same rank? What makes me more confusing is that, after trim R1 and R2 using Trimmomatic (PE mode), the trimmed, and PAIRED R1 and R2 files still have different read numbers. (R1, 1397878, R2, 1402966). So, does this mean the additional reads in R2 this time (1402966 - 1397878 = 5088 reads) are not paired and others are paired with R1? But trimmomatic attributes these reads to the PAIRED result file and actually the unpaired reads have been transferred to the special unpaired fastq result files. This makes me feel confused. Could anyone give some answers? Thank you so much.
Careful while Downloading fastq files. Always prefer fastq-dump or prefetch . Donot use direct download separately as R1 and R2. Contact the data provider also.
Where did you obtain the files? Did you download them from ENA / SRA? A sequencing facility sequenced your samples? You were given these files by a collaborator?
Did you run FastQC on them? Seems like they may have been trimmed already. Some quick and dirty sanity checks - what is the output of:
This may sound stupid, but can you tell us how you have count the reads? Because if you are simply using grep command with "@" symbol then it may end-up counting sequence header as well as qualities in fourth line of sequence (i.e in illumina, 31 quality value is represented by symbol "@") which results in inequality of PE counts.
Thank you. I used wc -l to check the total line, and then divide them by 4.