Question

SAM flags needed for paired-end strand-specific differential expression

0

Entering edit mode

5.4 years ago

Etstevens • 0

Hello! I analyzing paired-end RNA-seq reads made from a strand-specific dUTP Illumina library construction kit (Read2 corresponding to the original transcript orientation). These reads are from one organism that has a reference genome, and I am hoping to perform differential expression analyses with HT-seq. I have aligned my strand-specific reads to the genome with Bowtie2 (using the --fr --nofw flags), and now I need to pull the appropriate alignments out of the SAM output file.

I've read Istvan Albert's great post on using SAMtools to split strand-specific data, and looked at how SAM flags are visualized, but I am still lost. My original thought was to output SAM flag -3 (read paired, read mapped in proper pair), but does this not take in to consideration strand-specificity? What are the appropriate flags to pull out proper mate-pairs of reads that also correspond to the strand that it originated from? Is this extremely important for downstream HT-seq counts to get differential expression?

Thank you!

RNA-Seq SAM SAMTOOLS alignment • 1.7k views

ADD COMMENT • link updated 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.1k • written 5.4 years ago by Etstevens • 0

score 2 · Answer 1 · 2019-11-14

Hi a few things. There's a bit of confusion so I;m going to clear things up. I hope you don't mind the corrections.

differential expression analyses with HT-seq

HTSeq count script does not perform DE analysis. That's for tools like edgeR or DESeq2.

Bowtie2 (using the --fr --nofw flags)

Why are you excluding unpaired reads? IMO this is not needed for DE analysis. Also, just to confirm your organism is for example a bacterium that does not undergo mRNA splicing? If it does, you need to use a splice aware aligner such as STAR or HISAT2. Bowtie2 is not splice aware.

Is this extremely important for downstream HT-seq counts to get differential expression?

From my understanding no and... yes, it's complicated. You don't have to worry about filtering out non-stranded non-mate read pairs. HTSeq from my understanding uses a (relatively) simple counting method described here. Depending on what you select as the count method, it will change how unaligned reads are counted.

BTW, a very important flag is --stranded option. You should specify --stranded yes or --stranded reverse. reverse is for some illumina library prep that produces reads in reverse orientation (so forward = reverse, etc). Unsure what which your library is. Also you need to sort your sam files by read name before counting.

The reason I say no and yes is because you need a good reason for filtering out non-paired or non stranded reads. Anything performed upstream in the alignment/counting can have major ramification downstream in the DE analysis. From my reading the best approaches use very little filtering/trimming of the reads upstream. Whether a paired/non-paired should be counted/filtered is a question I don't know the answer to. But IMO you should count non-paired reads.

I hope that answers your questions.

score 1 · Answer 2 · 2019-11-15

1

Entering edit mode

5.4 years ago

Kristoffer Vitting-Seerup ★ 4.1k

I agree with @Amar - there is no need for requiring the paired end information but standness is important. But why not use some of the more modern pseudo aligners for mappung/quantification? You can read about such considerations here - they also give better DE results as I answered in this previous answer.

ADD COMMENT • link 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.1k