I have 2x75b TruSeq stranded RNA Seq data from rat samples and collected on an Illumina NextSeq machine. I have removed adapters from the FASTQ files and quality trimmed them using trimmomatic. I'd like to align them using STAR, and generate counts matrices for downstream differential expression analysis. I am confused about the options to use during the STAR alignment.
Here is what I have:
STAR --genomeDir $STARINDICES/ \
--readFilesIn sample1_read1.fq.gz sample1_read2.fq.gz \
--outFileNamePrefix out_ \
--runThreadN 4 \
--outSAMattrRGline ID:"sample1" SM:"sample1" LB:"sample1" PL:"ILLUMINA" \
--outBAMsortingThreadN 4 \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--readFilesCommand zcat \
--chimSegmentMin 20 \
--genomeLoad NoSharedMemory
Specifically, am I correct to select these three options?
--outSAMunmapped Within \ # outputs unmapped reads within the main SAM file.
--outSAMstrandField intronMotif \ # strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns are filtered out.
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \ # filter out alignments that contain non-canonical unannotated junctions when using annotated spice junctions database. The annotated non-canonical junctions will be kept.
I will be using htseq-count or featureCounts (but may use Cufflinks as well) to generate expression counts.
Have I missed anything? And do I need to modify the resulting BAM file in any way before using it as input for htseq-count / featureCounts?
Thanks.
You should refer (if not already done) to 3.2.2 in STAR manual : "ENCODE options" (for long RNA-Seq pipeline).
If you want to read more about the latest ENCODE options for RNA-Seq, you will find documentation here.