Question

tophat2 mapping merged PE and nonmerged PE

1

Entering edit mode

9.7 years ago

apelin20 ▴ 480

Hello,

So I issued the following command:

tophat2 -p 32 -o fur1 -G NC_000000.gtf NC_000000 ../Trimmed_Reads/merged/fur1_unmerged_BB_PE1.fastq,../Trimmed_Reads/merged/fur1_unmerged_BB_PE2.fastq ../Trimmed_Reads/merged/fur1_merged_BB.fastq

Thinking that first feeding the non_merged PE and then the merged PE should be no problem. Note, merged PE should be treated as SE.

I get the following log:

[2015-04-04 21:15:24] Beginning TopHat run (v2.0.13)
-----------------------------------------------
[2015-04-04 21:15:24] Checking for Bowtie
          Bowtie version:     2.2.4.0
[2015-04-04 21:15:24] Checking for Bowtie index files (genome)..
[2015-04-04 21:15:24] Checking for reference FASTA file
[2015-04-04 21:15:24] Generating SAM header for NC_000000
[2015-04-04 21:15:24] Reading known junctions from GTF file
[2015-04-04 21:15:24] Preparing reads
WARNING: read pairing issues detected (check prep_reads.log) !

     left reads: min. length=50, max. length=100, 914334 kept reads (177 discarded)
    right reads: min. length=35, max. length=188, 914511 kept reads (0 discarded)
[2015-04-04 21:17:48] Building transcriptome data files fur1/tmp/NC_002163

...

So the warning got me to check prep_reads.log:

prep_reads v2.0.13 (4310)
---------------------------
WARNING: read pairing issues detected (check prep_reads.log) !
 Pair #1 name mismatch: DD63XKN1:325:C41PAACXX:7:1101:1374:34829/1 vs DD63XKN1:325:C41PAACXX:7:1101:1372:99014/1
177 out of 914511 reads have been filtered out
0 out of 914511 read mates have been filtered out

So DD63XKN1:325:C41PAACXX:7:1101:1372:99014 is part of the merged file while DD63XKN1:325:C41PAACXX:7:1101:1374:34829 is an unmerged paired-end read. I'm wondering what I am doing wrong, it's not supposed to try and merge those.

Any thoughts?

Thanks,
Adrian

merged NGS tophat2 bowtie2 • 3.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.7 years ago by apelin20 ▴ 480

Ram · Answer 1 · 2015-04-05

2

Entering edit mode

9.7 years ago

Devon Ryan 104k

The command you wanted to run is:

tophat2 -p 32 -o fur1 -G NC_000000.gtf NC_000000 ../Trimmed_Reads/merged/fur1_unmerged_BB_PE1.fastq,../Trimmed_Reads/merged/fur1_merged_BB.fastq ../Trimmed_Reads/merged/fur1_unmerged_BB_PE2.fastq

Read 1 and 2 shouldn't be separated by commas. I won't comment on the use of merged reads too, presumably you have a good reason for that.

ADD COMMENT • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Well, the overlap rate between samples ranged from 72% to 80% of reads that were merged into one. From what I understand, merging reads increases their quality confidence at the 3' end. That's really the only reason I did it. Would you advise against doing that?

ADD REPLY • link 9.7 years ago by Adrian Pelin ★ 2.6k

2

Entering edit mode

Hi Adrian,

The latest version of BBTools (34.79) has a greatly improved version of BBMerge. Aside from the increased accuracy and merge rate, it now has a "ecc" flag. If you enable that, overlapping reads will be error-corrected, but not merged. Specifically -

bmerge.sh in=r#.fq out=corrected#.fq ecc mix

The "mix" flag puts the corrected and uncorrected reads in the same files; otherwise they get split into corrected reads going to "out" and uncorrected reads (those for which an overlap was not found) going to "outu".

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you, will try it out. I still don't get why merging reads when counting FPKMs is a bad idea... The merged read is a much better representation of the sequenced DNA fragment. Although I can see cases where 5' errors in the R2 or R1 reda may cause misalignment or no alignment at all.

ADD REPLY • link 9.6 years ago by apelin20 ▴ 480

0

Entering edit mode

It becomes a question of how reliably you can merge them prealignment. If your fragment size distribution allows you to do this with high certainty then you should be fine doing so, this is just typically not the situation for the overwhelming majority of those doing RNAseq.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Hello Brian, this tool certainly provides a brand new capability to NGS data, thank you for building this. I tried it, and indeed quality values are way higher, which is great, I am going to try read mapping next. I am however wondering now if it is best to do this read correction before or after trimming reads for quality. If I do it before, I made save some sequence data that would otherwise be trimmed, however, some reads due to bad 3' quality won't merge if I do not trim.

ADD REPLY • link 9.6 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

BBMerge is pretty tolerant of low quality, and quality-trimming shortens the reads so fewer of them overlap, so I recommend running BBMerge before quality-trimming. My testing has not shown quality-trimming to significantly increase merge rates. But you can, if you want, do it both ways, by first doing overlap-correction, then trimming the reads that didn't overlap, then correcting those, and then concatenating the resulting files:

bbmerge.sh in=r#.fq out=corrected#.fq outu=uncorrected#.fq ecc
bbduk.sh in=uncorrected#.fq out=trimmed#.fq qtrim=r trimq=15
bbmerge.sh in=trimmed#.fq out=trimmed_ecc#.fq ecc mix
cat corrected1.fq trimmed_ecc1.fq > all1.fq
cat corrected2.fq trimmed_ecc2.fq > all2.fq

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Merging is only useful for assembly. Unless you're doing that, then don't bother.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

Ram · Answer 2 · 2015-04-04

1

Entering edit mode

9.7 years ago

Brian Bushnell 20k

Hi Adrian,

I can't really troubleshoot TopHat problems, but if I were you, I would try doing 2 separate TopHat runs - one with the unmerged paired reads, and one with the merged reads. Then merge the sam files.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.7 years ago by Brian Bushnell 20k

Ram · Answer 3 · 2015-04-05

1

Entering edit mode

9.7 years ago

Brian Bushnell 20k

I wouldn't say only useful for assembly, but having some merged and some unmerged reads can make RNA-seq quantification more difficult so I would also skip merging for quantification analysis. However, merging can improve your ability to distinguish alternatively-spliced isoforms, so it depends on the goal of the experiment.

I'm adding a mode to BBMerge to error-correct the read tails by overlap without actually merging the reads, so hopefully that will be useful for RNA-seq quantification in the future.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Interesting, was it essentially just the fixed fragment size of the merged sequences that helped discriminate isoforms? I've not seen this, but I guess I've also never looked.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

The longer fragment length better distinguishes isoforms, because it sometimes allows longer "anchors" in exons. For example, consider this scenario, where the merged read maps to exon X and exon Z, skipping exon Y, like this:

XXXXXXXXXXXXXXXXZZZZZZZZZZZZZZZ

If the individual reads mapped like this:

XXXXXXXXXXXXXXXXZZ
XXZZZZZZZZZZZZZZZZ

...then the terminal bases will probably be soft-clipped rather than spanning the intron since there are only a few.

With BBMap, for example, with default settings (K=13), it will not map a read spanning an intron unless there are at least 13 bases on the next exon. So if each read only went 10bp into the next exon, then neither would span it. But in that case the reads would have a 20bp overlap, which would allow a highly confident merge; and the merged read would be mapped spanning the intron.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Is that the point of 2-pass algorithms (e.g., with STAR or tophat)?

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

I'm not really sure, but I imagine it's similar. If were to write a 2-pass mode for genome-alignment, I'd have pass 1 map to the genome, then identify exon-exon junctions, and in pass 2 map the unmapped and imperfectly mapped reads to the junction set. The net result would be that reads with very short overhangs on exons would get mapped correctly in pass 2.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.7 years ago by Brian Bushnell 20k