Hello,
So I issued the following command:
tophat2 -p 32 -o fur1 -G NC_000000.gtf NC_000000 ../Trimmed_Reads/merged/fur1_unmerged_BB_PE1.fastq,../Trimmed_Reads/merged/fur1_unmerged_BB_PE2.fastq ../Trimmed_Reads/merged/fur1_merged_BB.fastq
Thinking that first feeding the non_merged PE and then the merged PE should be no problem. Note, merged PE should be treated as SE.
I get the following log:
[2015-04-04 21:15:24] Beginning TopHat run (v2.0.13)
-----------------------------------------------
[2015-04-04 21:15:24] Checking for Bowtie
Bowtie version: 2.2.4.0
[2015-04-04 21:15:24] Checking for Bowtie index files (genome)..
[2015-04-04 21:15:24] Checking for reference FASTA file
[2015-04-04 21:15:24] Generating SAM header for NC_000000
[2015-04-04 21:15:24] Reading known junctions from GTF file
[2015-04-04 21:15:24] Preparing reads
WARNING: read pairing issues detected (check prep_reads.log) !
left reads: min. length=50, max. length=100, 914334 kept reads (177 discarded)
right reads: min. length=35, max. length=188, 914511 kept reads (0 discarded)
[2015-04-04 21:17:48] Building transcriptome data files fur1/tmp/NC_002163
...
So the warning got me to check prep_reads.log
:
prep_reads v2.0.13 (4310)
---------------------------
WARNING: read pairing issues detected (check prep_reads.log) !
Pair #1 name mismatch: DD63XKN1:325:C41PAACXX:7:1101:1374:34829/1 vs DD63XKN1:325:C41PAACXX:7:1101:1372:99014/1
177 out of 914511 reads have been filtered out
0 out of 914511 read mates have been filtered out
So DD63XKN1:325:C41PAACXX:7:1101:1372:99014
is part of the merged file while DD63XKN1:325:C41PAACXX:7:1101:1374:34829
is an unmerged paired-end read. I'm wondering what I am doing wrong, it's not supposed to try and merge those.
Any thoughts?
Thanks,
Adrian
Well, the overlap rate between samples ranged from 72% to 80% of reads that were merged into one. From what I understand, merging reads increases their quality confidence at the 3' end. That's really the only reason I did it. Would you advise against doing that?
Hi Adrian,
The latest version of BBTools (34.79) has a greatly improved version of BBMerge. Aside from the increased accuracy and merge rate, it now has a "ecc" flag. If you enable that, overlapping reads will be error-corrected, but not merged. Specifically -
The "mix" flag puts the corrected and uncorrected reads in the same files; otherwise they get split into corrected reads going to "out" and uncorrected reads (those for which an overlap was not found) going to "outu".
Thank you, will try it out. I still don't get why merging reads when counting FPKMs is a bad idea... The merged read is a much better representation of the sequenced DNA fragment. Although I can see cases where 5' errors in the R2 or R1 reda may cause misalignment or no alignment at all.
It becomes a question of how reliably you can merge them prealignment. If your fragment size distribution allows you to do this with high certainty then you should be fine doing so, this is just typically not the situation for the overwhelming majority of those doing RNAseq.
Hello Brian, this tool certainly provides a brand new capability to NGS data, thank you for building this. I tried it, and indeed quality values are way higher, which is great, I am going to try read mapping next. I am however wondering now if it is best to do this read correction before or after trimming reads for quality. If I do it before, I made save some sequence data that would otherwise be trimmed, however, some reads due to bad 3' quality won't merge if I do not trim.
BBMerge is pretty tolerant of low quality, and quality-trimming shortens the reads so fewer of them overlap, so I recommend running BBMerge before quality-trimming. My testing has not shown quality-trimming to significantly increase merge rates. But you can, if you want, do it both ways, by first doing overlap-correction, then trimming the reads that didn't overlap, then correcting those, and then concatenating the resulting files:
Merging is only useful for assembly. Unless you're doing that, then don't bother.