Hello,
I am performing RNA-seq using DESeq2 that requires un-normalized raw reads data as input from the BAM files. In order to simplify my problem, I extracted the ERG gene from the BAM file (sampleA.bam sorted by position) as follow:
- Find the position ERG on the genome from GENCODE:
chr21 HAVANA gene 39751949 40033704 . - . gene_id "ENSG00000157554.18_3"; gene_type "protein_coding"; gene_name "ERG"; level 2; havana_gene "OTTHUMG00000090767.3_3"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
Extract the region from the BAM:
samtools view -h sampleA.bam "chr21:39751949-40033704" > ERG_sampleA.sam
Convert SAM to BAM
samtools view -Sb ERG_sampleA.sam > ERG_sampleA.bam
Display on IGV with the counts to verify:
Zoom in:
So, we can expect to obtain reads when we count using HTSeq-count [Python] and GenomicAlignements [R]. PS: as per definition, in the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons.
Counting reads with HTSeq-count with Python
The following command is issued to count the raw reads of sampleA:
htseq-count -f bam -m union -i gene_id -r pos -s yes ERG_sampleA.bam gencode.v28lift37.annotation.gtf > erg.txt
and this is the following output (removed all IDs that had 0 reads):
ENSG00000157554.18_3 28 <- ERG gene
ENSG00000231480.1_3 275 <- SNRPGP13 gene in ERG region
__no_feature 48474
__ambiguous 0
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 105
Counting with GenomicAlignements with R:
genomeAnnot <- "gencode.v28lift37.annotation.gtf"
txdb <- makeTxDbFromGFF(genomeAnnot, format = "gtf", circ_seqs = character())
ebg <- exonsBy(txdb, by="gene")
sampleA <- "ERG_sampleA.bam"
params <- ScanBamParam(what = scanBamWhat(), flag = scanBamFlag(isPaired = T, isProperPair = T, isUnmappedQuery = F, hasUnmappedMate = F, isSecondaryAlignment = F, isNotPassingQualityControls = F, isDuplicate = F), mapqFilter = 10)
countingReads <- summarizeOverlaps(features=ebg, reads=sampleA, mode="Union", singleEnd=FALSE, fragments=TRUE, ignore.strand=F, param = params)
and this is the following output:
One expects that we obtain similar number of counted reads between the two methods. So, I would highly appreciate if someone can give me an explanation on why there are such differences. Thank you in advance.
Other things I tried:
try adding
--stranded=no
to your htseq command and also in R, you are filtering reads by q-value (10). You can also try removing it when you are comparing.I was trying to reproduce the your observations with commands you have furnished with a sample file I have. I made following changes:
Removed few params:
params <- ScanBamParam(what = scanBamWhat(), flag = scanBamFlag(isUnmappedQuery = F, isSecondaryAlignment = F, isNotPassingQualityControls = F, isDuplicate = F))
Results are below. Please note that results exactly do not match. With few more param optimizations one can get matching numbers:
delete your account
how to delete account
Thanks for the reply !
Your comment has solved my problem and as you have said, turning off stranded mode, I will obtain more reads counted on the gene. So, I would like to ask you why does this option outputs very few reads when my samples are sequenced in paired-end and stranded ?
Thanks a lot @cpad0112
copy/pasted from what is difference between strand specific in rnaseq analysis from htseq-count (upvote OP: Devon Ryan ) :
Another method is to use http://rseqc.sourceforge.net/#infer-experiment-py . This will give you a likelihood how the strandedness of the transcripts correlate with the strandedness of the reads.
Thanks for replying.
But my samples are stranded so why should I put
--stranded = no
but I will give it a try.