Entering edit mode
10.2 years ago
thjnant
▴
160
Hello,
I am going through the STAR_2PASS of the GATK pipeline to get SNPs out of RNA-seq data.
I have run the first round of alignment for my 6 samples, now I am in the second round that I must run this command:
genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
--sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>
For this option:
--sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab
Should I use the SJ.out.tab
file of only one of my samples and use that for others or should I use the one for each sample?
Thanks in advance
I would think that you'd get the best results from merging the tab files and then using the result.
Or by running STAR on a large subset of your entire dataset (FASTQ files from multiple representative (or all) samples) on the first-pass.
Yup and that'd probably be a bit faster since you don't need all of the instances to run to completion. Do you happen to know if anyone's looked for an optimal subset percentage? While the real value will vary, I expect there's a decent ball-park starting place to be found (perhaps as a function of total number of reads).
If you believe the old RUM paper, perhaps 40-100M reads will get you the vast, vast majority of splice junctions that are available in a dataset. One can always test by simply staging the analysis. Run 5%, 10%, 15%, etc. to see where the return plateaus, but that is probably overkill.
The rarefaction curve route would end up taking as long as just processing everything at once (well, unless you really had a LOT of samples). 40-100M reads seems reasonable.