I got some RNASeq fastq data from a customer, and he told me the samples were mainly from human cell lines but may have some contamination with mouse cells. My question is whether I should align those sequences against both human genome reference and mouse genome reference or just humna's. Any suggestions?
That is true, but many of the mouse reads will remain un-mapped, you can use BLAST (or SNAP) to look at the unmapped reads more closely (i.e. determine which organism they belong to).
Do you plan on trying to using the contaminated samples? I personally would advise against that.
In this setting wouldn't t make more sense to align against a conjoined human/mouse reference, or to separately align to both human and mouse and select the species origin of the reads based on the quality of alignment in sp1 vs sp2
First subset the files (seqtk) and then use fastq_screen to get an idea what the contamination rate is. I've found it useful to only pay close attention to the "single alignment in a single organism" (or whatever that's called) category, since the others are more an indicator of sequence complexity. I happen to do this with all sequencing runs produced at our institute, since it immediately allows us to flag problematic samples (anything over 0.5% off-species unique alignment is a problem).
Ideally you won't have much contamination and if you do you can just exclude the sample. If you can't exclude the sample, then you'll need to simultaneously align to both genomes (get one from Ensembl and the other from UCSC, so the chromosome names differ, and then concatenate them). Align against the concatenated genome and then extract only the human reads with some meaningful MAPQ threshold. One can get more elegant with this, but that should suffice 99.9% of the time.
BBSplit from BBMap has been designed to address this kind of a situation for binning reads (to best extent they can be assigned by alignment). It is a one step process.
yes, but after I use BBsplit, I obtain the fastq file and then I can to remap this with STAR but the count? FeatureCount doesn't work well with bbsplit.
So my question is: After BBsplit What can I use to map and to calculate the count?
Thanks
There should be no direct relation with the splitting. If you reads are not aligning to exons then you can have issues with counting (assuming you are using a reference/annotation where chromosome ID's match).
When I use featureCount I get reads % that map to exons that are too low, while STAR percentages are greater than 80%, Assuming the same annotation file. WHY?
Thanks @Lando Ringel. One problem could be that (maybe very likely) a sequence was actually from mouse but it can be mapped to both human and mouse.
That is true, but many of the mouse reads will remain un-mapped, you can use BLAST (or SNAP) to look at the unmapped reads more closely (i.e. determine which organism they belong to).
Do you plan on trying to using the contaminated samples? I personally would advise against that.
In this setting wouldn't t make more sense to align against a conjoined human/mouse reference, or to separately align to both human and mouse and select the species origin of the reads based on the quality of alignment in sp1 vs sp2