I'm totally new to all NGS tools so please bear with me.
I've been trying to analyze data some paired reads with the tools on Galaxy. After "grooming" the FASTQ files, I attempted to map the reads with Tophat2, following the tutorial here. The alignment summary is pasted below:
Left reads:
Input: 36728489
Mapped: 35783336 (97.4% of input)
of these: 10212030 (28.5%) have multiple alignments (4222 have >20)
Right reads:
Input: 35892294
Mapped: 35027752 (97.6% of input)
of these: 10007714 (28.6%) have multiple alignments (4283 have >20)
97.5% overall read alignment rate.
Aligned pairs: 34152758
of these: 2784707 ( 8.2%) have multiple alignments
and: 34025873 (99.6%) are discordant alignments
0.4% concordant pair alignment rate.
Why are there so many "discordant alignments"? And is this why I'm getting such a low % of "properly paired" reads according to flagstat (pasted below)?
167038160 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
167038160 + 0 mapped (100.00%:-nan%)
167038160 + 0 paired in sequencing
84138107 + 0 read1
82900053 + 0 read2
16240 + 0 properly paired (0.01%:-nan%)
162899512 + 0 with itself and mate mapped
4138648 + 0 singletons (2.48%:-nan%)
154361318 + 0 with mate mapped to a different chr
42840562 + 0 with mate mapped to a different chr (mapQ>=5)
And just for good measure, here is output from Picard Alignment summary
CATEGORY FIRST_OF_PAIR
TOTAL_READS 35796836
PF_READS 35796836
PCT_PF_READS 1
PF_NOISE_READS 8
PF_READS_ALIGNED 35796828
PCT_PF_READS_ALIGNED 1
PF_ALIGNED_BASES 1532498561
PF_HQ_ALIGNED_READS 24005633
PF_HQ_ALIGNED_BASES 1029373773
PF_HQ_ALIGNED_Q20_BASES 1029373773
PF_HQ_MEDIAN_MISMATCHES 0
PF_MISMATCH_RATE 0.282276
PF_HQ_ERROR_RATE 0.284893
PF_INDEL_RATE 0.000134
MEAN_READ_LENGTH 51
READS_ALIGNED_IN_PAIRS 34152750
PCT_READS_ALIGNED_IN_PAIRS 0.954072
BAD_CYCLES 0
STRAND_BALANCE 0.500655
PCT_CHIMERAS 0.999132
PCT_ADAPTER 0
I expect in order to answer these questions, one would need to know more about the quality of my reads, but as I said, I'm really flying blind here, so I can't even anticipate what additional info, you might need. I can say that according to my FastQC results on 223 and 224 (history #234 and #235), both look like they have lengths of 51bp. Also, in the FastQc results I'm seeing per base sequence qualities that are initially around 33, but then are steady >35 for the rest of the read. I apologize for what is probably a stupid question. Hey, I gotta start somewhere.
How did you groom the reads? It's likely that the files became out of sync at that step.
The initial file format was fastqillumina
My question was actually more whether you did the "NGS: QC and manipulation > FASTX-TOOLKIT FOR FASTQ DATA" step. Fastx is known to be problematic with paired-end reads. As a rule if you have paired-end reads, don't use any tool that can't accept both fastq files at once (except FastQC). I assume there's a trimmer (trimmomatic, Trim Galore!, etc.) available through galaxy that can properly handle paired-end reads. If not, you can trim locally (trimmomatic is in Java, so it'll even run on Windows) and then upload the resulting 2 files.
I did NGS: QC and manipulation > Filter FASTQ and set a min quality to 20. I see that there's a checkbox for "This is paired end data" and I'm pretty sure I checked that. But I can run this again, and be certain to check it. I don't actually think this filtering is really necessary because my reads seem good: quality scores > 33 through read and sizes are all 51 bp. I am getting a problem with sequence duplication levels >=73%.
The duplication level (I assume you're getting this from FastQC) is fairly common for RNAseq, so I wouldn't worry about it. That's only really useful for looking at DNA sequencing, where you expect to more uniformly cover things (in RNAseq, anything really highly expressed will throw the duplication level metric off).
Try aligning the untrimmed/ungroomed reads and see if you get the same results. If so, perhaps try to post the fastq files somewhere and one of us can have a look at what might have gone wrong (please note the species you're working on if you do that!).
What's your reference genome? I see where you said you're following a tutorial, but if you're mapping against a draft reference or a bacterial/viral reference with really short chromosomes, then it's not at all surprising that you'd see reads mapping to different chromosomes/contigs.
The result for PCT_CHIMERAS is really interesting, too. Not yet sure what to make of that.
Actually after thinking about it a little more, the PCT_CHIMERAS has to be associated with your discordant alignment %. Tell me more about your samples. What kind of prep was used to make them, and what's the source organism? Did you see overrepresented sequences in your FASTQ report?
I see PCT_CHIMERAS is, "The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes" so that sounds the same as discordant alignments.
These are Qiagen prepped RNA samples from mouse ECs. The mapping is with mm9. I'm not sure if I see overrepresented sequences, or do I? I am getting a problem with sequence duplication levels >=73%.
This one is a doozie. Any chance you could make your accepted_hits.bam available for download via Dropbox, Google Drive, or otherwise?
It would also be helpful to know what settings you used. Perhaps you can post a screenshot of your Galaxy settings? The only thing I can think of right now is that the strandedness specification is wrong...