Question

Trim Paired-end Fastq Files

0

Entering edit mode

6.4 years ago

yuabrahamliu ▴ 60

Hi all, Maybe I'm asking a too basic question, but I really feel confused. I have R1.fastq file and R2.fastq file from the paired-end RNA-seq. As far as I know, the read order in R1 and R2 files should be the same, namely the reads in the same pair should get the same rank in R1 and R2 respectively. However, when I count the initial read numbers in R1 and R2 files, they are different. For example, R1 has 1878678 reads, while R2 has 1800352 reads. This makes me confused becasue if so, does this mean the additional reads in R1 compared to R2 (1878678 - 1800352 = 78326 reads) are unpaired and all the other reads in R1 and R2 are paired and have the same rank? What makes me more confusing is that, after trim R1 and R2 using Trimmomatic (PE mode), the trimmed, and PAIRED R1 and R2 files still have different read numbers. (R1, 1397878, R2, 1402966). So, does this mean the additional reads in R2 this time (1402966 - 1397878 = 5088 reads) are not paired and others are paired with R1? But trimmomatic attributes these reads to the PAIRED result file and actually the unpaired reads have been transferred to the special unpaired fastq result files. This makes me feel confused. Could anyone give some answers? Thank you so much.

RNA-Seq trimmomatic paired-end • 5.4k views

ADD COMMENT • link updated 6.4 years ago by h.mon 35k • written 6.4 years ago by yuabrahamliu ▴ 60

0

Entering edit mode

Careful while Downloading fastq files. Always prefer fastq-dump or prefetch . Donot use direct download separately as R1 and R2. Contact the data provider also.

ADD REPLY • link 6.4 years ago by k.kathirvel93 ▴ 310

0

Entering edit mode

Where did you obtain the files? Did you download them from ENA / SRA? A sequencing facility sequenced your samples? You were given these files by a collaborator?

Did you run FastQC on them? Seems like they may have been trimmed already. Some quick and dirty sanity checks - what is the output of:

head -n1 R1.fastq
head -n1 R2.fastq
tail -n4 R1.fastq
tail -n4 R2.fastq

ADD REPLY • link 6.4 years ago by h.mon 35k

0

Entering edit mode

This may sound stupid, but can you tell us how you have count the reads? Because if you are simply using grep command with "@" symbol then it may end-up counting sequence header as well as qualities in fourth line of sequence (i.e in illumina, 31 quality value is represented by symbol "@") which results in inequality of PE counts.

ADD REPLY • link 6.4 years ago by Tm ★ 1.1k

0

Entering edit mode

Thank you. I used wc -l to check the total line, and then divide them by 4.

ADD REPLY • link 6.4 years ago by yuabrahamliu ▴ 60

score 1 · Answer 1 · 2018-07-20

1

Entering edit mode

6.4 years ago

swbarnes2 14k

Step one...ask the person who gave you the fatsq how they were filtered. The fastqs that came off the instrument should all be paired and in order. You might have fastqs where some reads were purged for quality reasons while their mates were left in the file. Or one was truncated.

ADD COMMENT • link 6.4 years ago by swbarnes2 14k

score 1 · Answer 2 · 2018-07-20

1

Entering edit mode

6.4 years ago

Dattatray Mongad ▴ 380

It happens many times even I had encountered the same problem. What I did was...

trimming & filtering forward and reverse reads (I used NGSQCToolkit)
Use fastq-pair to get only those reads which have mates in both forward and reverse fastq file.
Here you have to check how much per cent of data you lost. If amount of data retained is significant then proceed for next step.

If you lose the huge amount of data then you can contact data provider.

ADD COMMENT • link 6.4 years ago by Dattatray Mongad ▴ 380

0

Entering edit mode

Awesome. I think it is a very useful tool, fastq-pair.

ADD REPLY • link 6.4 years ago by yuabrahamliu ▴ 60

score 0 · Answer 3 · 2018-07-20

Awesome. I think it is a very useful tool, fastq-pair.

You are "fixing" something which you don't even know how it is broken in the first place - at least, if you know, you didn't tell us. You didn't tell us the source of the data, and you didn't follow up on some of our questions. Again, what is the output of:

head -n1 R1.fastq
head -n1 R2.fastq
tail -n4 R1.fastq
tail -n4 R2.fastq

For all we know, it is even possible you are treating as pairs two files from different samples. This can happen, see for example this post. So before fixing anything, try to discover how things got broken in the first place, before you have some really nonsensical results downstream.