So I am transitioning into using STAR as our primary aligner for RNA-seq workflow and was curious if there's a way to add RG without knowing what it is beforehand - primarily because I am working with publicly available TCGA fastq files. These were not sequenced in-house.
I see --outSAMattrRGline and --outSAMreadID under STAR options but 1) the file needs to be in SAM/BAM format to extract RG (?) but our premapped files are in fastq format and 2) to add RGID I need to manually load (have prior knowledge of the RGID of our samples).
I looked up the filename format for TCGA files but it does not involve specific read group info. I am wondering if there's a way to extract RGID from fastq file and automatically input into the aligner for downstream analysis using GATK.
Oh okay, I see. Double checking - is it recommended/preferred to have the same RG for paired-end reads? I will ultimately have them merged into a single bam for variant calling downstream.
It is my understanding (although refer to Devon when he comes back) that the RGID is supposed to be unique to the lane/chip/sequencer as its the primary unit GATK uses to model noise in the data.
The samples ID (RGSM) needs to be the same for all samples you want to merge together and call SNPs on.
So my RGID is like "RG:Z:DBV2SVN1.D263EACXX.7.CTTGTA" in one bam,
"RG:Z:DBV2SVN1.D263EACXX.3.AATTCA" in another bam, (same chip, different lane and multiplex barcode),
but in BOTH bams the RGSM is "C57BL/6J" because they're both from the same isogenic mouse line.
When i merge them together, GATK will (hopefully) be able to call SNPs right... lets see :P
I'm in complete agreement with all of your comments :) For samples that you download you typically just use the SRA/ENA/whatever accession as the sample ID and library, unless the metadata online happens to suggest something more reasonable.
Since these files are single samples, I think I'm going to add an arbitrary RGID after merging paired-ends using picard (addorreplacereadgroup function). For variant calling (we're looking at rare, uncharacterized somatic mutations) is it preferred to merge all my samples into a single bam or call variants individually/separately?
Again, im not the authority on this - just someone who's read the docs a lot lately.
My understanding is that it's better to do joint calling than to do the calling individually and then look at overlapping (or all) SNPs called. To do joint calling however, you don't have to merge the BAM - you just have to make gVCFs for each BAM, and then run the joint caller on those gVCFs to 'merge' them at the variant level. However, as I found out the hard way, those gVCFs will have the sample ID in them, and they need to be the same for them to be joined.
So yeah, just give everything you want to treat as "the same" (genetically) the same sample ID, and you should be fine.
Awesome. The info will be very helpful. Thank you! Upvoted.
You should be aligning mates in a pair at the same time...