Question

Read pairing issues detected in Tophat run

0

Entering edit mode

8.4 years ago

tunl ▴ 90

I am running Tophat (v2.0.10) as follows:

./tophat2 -p 8 -G genes.gtf --b2-very-sensitive --library-type fr-firststrand -o ./result/CTR Bowtie2index/genome CTR-0.fastq CTR-1.fastq

At the step “Preparing reads” ( prep_reads v2.0.9 (3067M) ), I got:

WARNING: read pairing issues detected (check prep_reads.log) ! Pair #1 name mismatch: HWI-ST1133R:6:1101:1060:2144#AGGCAGCTCTCT/1 vs HWI-ST1133R:7:1101:1166:2068#NGGCAG/1 4266 out of 12827817 reads have been filtered out; 7461 out of 12827817 read mates have been filtered out

I had two other subsequent Tophat runs on two other samples, and also got the read pairing issues (name mismatch) as follows:

WARNING: read pairing issues detected (check prep_reads.log) ! Pair #1 name mismatch: HWI-ST1133R:7:1101:1445:2216#CTCTCTCTCTCT/1 vs HWI-S3R:2:1101:1053:2168#CTCTCTCTCTCT/1 9331 out of 23151044 reads have been filtered out; 138 out of 23151044 read mates have been filtered out

And:

WARNING: read pairing issues detected (check prep_reads.log) ! Pair #1 name mismatch: HWI-ST1133R:6:1101:1392:2151#GGACTCCTCTCT/1 vs HWI-ST1133R:7:1101:1172:2165#NGACTC/1 4210 out of 12330176 reads have been filtered out; 7200 out of 12330176 read mates have been filtered out

Is this read pairing name mismatch a serious problem in running Tophat? What impact does it have?

What could I possibly do to fix this problem?

I’d greatly appreciate any ideas and suggestions.

Thank you very much!

RNA-Seq Tophat • 3.5k views

ADD COMMENT • link 8.4 years ago by tunl ▴ 90

1

Entering edit mode

Did you trim your paired end data files independently (or using a trimming program that is not PE data aware). That is likely cause of the reads being out of order in your data files. You can use repair.sh from BBMap to fix the read pairing like so

repair.sh in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz

Note: You are running an old version of TopHat (almost 2.5 year). That is not a good idea. You should upgrade to the latest (v. 2.1.1), if you are able to.

In terms of impact on alignments, if your reads out of order in the two files then you could get discordant/strange alignments that will not make sense.

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

Thank you so much for your advice!

Actually we got the fastq files from other people, so I’m not sure if they were trimmed or not. Is there an easy way to find out whether a fastq file has been trimmed?

If the fastq files are not trimmed, would it also cause this read pairing name mismatch problem?

So when we trim the fastq files, we should not trim the paired-end data files independently, right?

Is Trimmomatic a good tool to trim paired-end fastq files? How about FastQC?

Thank you very much for your help!

ADD REPLY • link 8.4 years ago by tunl ▴ 90

1

Entering edit mode

If all reads are not identical length then that would be an indication that the data has been trimmed. You should be able to see that in FastQC report (general stats at top). FastQC only does QC it does not change data in any way.

Improper trimming is a surefire way of getting reads out of sync. I could think of few additional ways this can happen but they would all have low probability (e.g. corruption during transfer).

Trimmomatic is PE aware trimmer. If you downloaded BBMap suite then you could use BBDuk for trimming/scanning your data.

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much for your further help!

So repair.sh can fix the read pairing issues no matter what caused the mismatch, right?

What is the singletons.fq.gz file in the repair.sh command line?

Another question is, when Tophat says some reads and read mates have been filtered out, does this mean the mismatched parts are not aligned at all?

Thank you very much!

ADD REPLY • link 8.4 years ago by tunl ▴ 90

1

Entering edit mode

Yes repair.sh can fix the read order so the two files R1/R2 are in sync again. singletons.fq.gz will contain reads where a mate from a pair may have been completely trimmed out/eliminated/otherwise absent.

Are you referring to this filtering by TopHat: What are the cut-offs during read quality filtering in Bowtie/TopHat before mapping?

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

I was referring to the messages in prep_reads.log (as quoted in the blue box in my posting): 4266 out of 12827817 reads have been filtered out; 7461 out of 12827817 read mates have been filtered out.

Thank you for pointing me to the previous posting. So this “filtering-out” is also a quality control to skip the bad reads. In this case, I am just wondering if Tophat also filtered out the name-mismatched parts so that the name-mismatched parts are not aligned at all?

I ran Cuffdiff on the BAM files created by Tophat, and for some reason, the step “Testing for differential expression and regulation in locus” became extremely slow (only 10% was done after 3 days). So I am just wondering if the name mismatch has anything to do with this slowness. If the name-mismatched parts are filtered out by Tophat (no alignment), can they still appear in the output BAM files and affect Cuffdiff in some way?

I found that our fastq data are actually not trimmed (identical length). Could untrimmed paired-end reads also have name mismatch (if data not corrupted during transfer)?

Thank you very much for your help!

ADD REPLY • link 8.4 years ago by tunl ▴ 90

1

Entering edit mode

I found that our fastq data are actually not trimmed (identical length). Could untrimmed paired-end reads also have name mismatch (if data not corrupted during transfer)?

There is no reason they should be mismatched (unless someone did something to the files).

I am going to point out again that unless you are using the latest TopHat some of these issues may have been known and have since been fixed in latest version of TopHat. Did you upgrade TopHat to latest?

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much for the advice!

I’ll try to upgrade to Tophat 2.1.1 now.

People who provided us fastq files now just told us the data may be ATAC-seq instead of RNA-seq (originally we were told the data is RNA-seq).

So I’m just wondering if untrimmed ATAC-seq paired-end reads could have name mismatch?

Does Tophat process ATAC-seq data and RNA-seq data in any way different?

Thanks a lot for your help!

ADD REPLY • link 8.4 years ago by tunl ▴ 90

1

Entering edit mode

If this is ATAC-seq data then you should not use TopHat for analysis. See this thread for options.

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

Thanks a lot for the information!

It looks like that they use Bowtie to map ATAC-seq data.

So does Tophat have issues with mapping the ATAC-seq data?

I thought Tophat uses Bowtie as its alignment engine and Bowtie cannot align reads that span introns...

ADD REPLY • link 8.4 years ago by tunl ▴ 90