SAM flags needed for paired-end strand-specific differential expression
2
0
Entering edit mode
5.0 years ago
Etstevens • 0

Hello! I analyzing paired-end RNA-seq reads made from a strand-specific dUTP Illumina library construction kit (Read2 corresponding to the original transcript orientation). These reads are from one organism that has a reference genome, and I am hoping to perform differential expression analyses with HT-seq. I have aligned my strand-specific reads to the genome with Bowtie2 (using the --fr --nofw flags), and now I need to pull the appropriate alignments out of the SAM output file.

I've read Istvan Albert's great post on using SAMtools to split strand-specific data, and looked at how SAM flags are visualized, but I am still lost. My original thought was to output SAM flag -3 (read paired, read mapped in proper pair), but does this not take in to consideration strand-specificity? What are the appropriate flags to pull out proper mate-pairs of reads that also correspond to the strand that it originated from? Is this extremely important for downstream HT-seq counts to get differential expression?

Thank you!

RNA-Seq SAM SAMTOOLS alignment • 1.6k views
ADD COMMENT
2
Entering edit mode
5.0 years ago
Mark ★ 1.6k

Hi a few things. There's a bit of confusion so I;m going to clear things up. I hope you don't mind the corrections.

differential expression analyses with HT-seq

HTSeq count script does not perform DE analysis. That's for tools like edgeR or DESeq2.

Bowtie2 (using the --fr --nofw flags)

Why are you excluding unpaired reads? IMO this is not needed for DE analysis. Also, just to confirm your organism is for example a bacterium that does not undergo mRNA splicing? If it does, you need to use a splice aware aligner such as STAR or HISAT2. Bowtie2 is not splice aware.

Is this extremely important for downstream HT-seq counts to get differential expression?

From my understanding no and... yes, it's complicated. You don't have to worry about filtering out non-stranded non-mate read pairs. HTSeq from my understanding uses a (relatively) simple counting method described here. Depending on what you select as the count method, it will change how unaligned reads are counted.

BTW, a very important flag is --stranded option. You should specify --stranded yes or --stranded reverse. reverse is for some illumina library prep that produces reads in reverse orientation (so forward = reverse, etc). Unsure what which your library is. Also you need to sort your sam files by read name before counting.

The reason I say no and yes is because you need a good reason for filtering out non-paired or non stranded reads. Anything performed upstream in the alignment/counting can have major ramification downstream in the DE analysis. From my reading the best approaches use very little filtering/trimming of the reads upstream. Whether a paired/non-paired should be counted/filtered is a question I don't know the answer to. But IMO you should count non-paired reads.

I hope that answers your questions.

ADD COMMENT
1
Entering edit mode
5.0 years ago

I agree with @Amar - there is no need for requiring the paired end information but standness is important. But why not use some of the more modern pseudo aligners for mappung/quantification? You can read about such considerations here - they also give better DE results as I answered in this previous answer.

ADD COMMENT

Login before adding your answer.

Traffic: 2568 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6