Hi Biostars Community,
These are from patient rheumatoid arthritis samples.
My % aligned was at about 85% for all samples, and mapped reads (M aligned) roughly about 6-9 (million) for most of the samples. However one of the samples and all of its technical replicates had mapped reads in the 16-18 (million) range. This was also similiar before mapping to the transcriptome using salmon. What could be an explanation for this?
This picture is before mapping.
This picture is after mapping:
Thank you in advance. I will pay it forward!
You know I found something interesting through the analysis process. Those 4 processed samples (SRR3350551 through SRR3350554) with double the mapped reads than the other samples are the only samples with only 4 technical replicates. While all the other samples have 8 technical replicates!! I think that's what the problem is... Somehow maybe the replicates got combined??? resulting in almost double the mapped reads!
As I said maybe this is just based on how these were sequenced. A HiSeq has 8 lanes, so some samples maybe got split over one entire flowcell, while others have not. This is normal, I see things like that all the time, both in public data and in our own. It is not always possible logistically to ensure very similar read depth across all samples especially if some samples come into the library prep process much later than others, or some samples failed initially and had to be repeated. Normalization will take care of that.
Does it matter to Salmon whether lanes are merged prior to alignment? I remember back in the day that it was better to merge the raw FastQ files for tools like TopHat, as it could affect how the initial transcript model was built..
I just cat the FASTQs, for example:
...or:
Thank you for responding so quickly GenoMax.
I am using
salmon
.Hmmm... maybe, you're right... (they have more data) It could potentially be that the patient had a rheumtoid arthritis flair up (this is a knee sample), therefore there were more cells in the synovial tissues -> more mRNA?
Others are knee samples or hip samples, but maybe they weren't flairing up?
I'm curious to find out how to check for secondary alignments. Googling now.
EDIT: So there is the
--validateMappings
flag insalmon
which improves sensitivity and specificity. This I did do, but what I didn't do was-z
/--writeMapping
which, I think, reports alignment scores in a SAM file? I think that's what I need to see to determine if they are secondary alignments?I'm going to run salmon again with this flag...
More RNA is a possible cause but could also just be more sequencing. Balancing sequencing pools perfectly is tough - variations in extraction efficiency, concentration measurements, primer efficiencies and pipetting errors can all lead to variations in sequencing depth for subsets of samples.
This isn't usually a big problem as long as the difference is not vast (which it doesn't appear to be). Most common quantification / analysis methods should normalise for variations in sequencing depth.
I would definitely run some pre-alignment QC (eg. FastQC) to check the raw data and raw counts. If you're worried you can always subsample the data to get equal read counts (eg. using seqtk).
Adding on this, it is not even guaranteed that all these samples were sequenced on the same flow cell at the same time. Maybe some samples came in (for whatever reason) on a different flow cell together with other samples so the entire pooling strategy was different. There is no way to tell, and I think it does not matter too much. Raw read numbers are not really a good QC criterium unless they're unexpectedly low or vastly different. As Phil Ewels says the difference here is not large and normalization should take care of it. Be sure to perform some QC like PCA to see whether clustering is reasonable or some samples show evidence of batch effect.
Thank you for this guys. The
-z
/--writeMapping
was taking forever! It's still going for like 3 hours now. I don't think it's parallelized.Regardless, going to follow your suggestions!
If you are not using multiple threads with
salmon
then it will take longer.It's weird... I was using
-j28
/--cores 28
, and I didn't hear all my fans going as using that many threads usually does. The server usually gets loud. I didn't check my system monitor though - so can't say for sure how many CPUs were actually being used! I guess I will check this (if multiple CPUs are being used via system montior), if I run salmon with the-z
/--writeMappings
flags again with--cores
set.If someone wants to transfer their question to the answer, I would accept?
What do you need that for? Keep in mind that the coordinates of the SAM file are in spliced-transcriptome space so you cannot easily visualize this bam on a browser.
Honestly, ATpoint, I thought it would be a way to see if the samples in question had more secondary alignments, since Genomax had suggested that as a potential reason for the greater number of mapped reads. Not sure if that would be accomplish-able now, but as you and Phil Ewens recommended -> don't worry about them and carry on with the analysis (normalization should take care of this).
Currently, about to transfer from
tximport
toDESeq2
and then going to:Excited to finally have results. I've been trudging through this process for like 3 days now lol