I want to do differential-gene-expression analysis. I took the RNA-Seq (paired-end)data from BioProject(NCBI), and aligned the fastq files against coding region(cds) from ensemble, using tophat2, command=> tophat Homo_sapiens r1.fastq r2.fastq
the resultant BAM file looks like this =>
SRR7523525.12972962.1 339 ENST00000005286.8 1395 1 101M =1379 -117 CAAGGAGAGCCGGGGCGCCCGGGGGGTGCGAGTGGACTTCTGGTGGCGCCGGCTCCGCGCCTCGCTGCGGCTGACCGTGTGGGCCCCGCTGCTACCGCTGC @GDHHGHDHHCHDDHHDCDCC?HDHIIIHHHCIIHFHHHHFIHHD<CGHGHHFCDE<HDHHDDIIIIHGHHHIHIIIIIIHHDDGHHFG@GIIHGIDACDD AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:87C13 YT:Z:UU NH:i:3 CC:Z:ENST00000453848.6 CP:i:1392 HI:i:0
SRR752525.24445544.1 83 ENST00000005286.8 1395 3 101M =1319 -177 CAAGGAGAGCCGGGGCGCCCGGGGGGTGCGAGTGGACTTCTGGTGGCGCCGGCTCCGCGCCTCGCTGCGGCTGACCGTGTGGGCCCCGCTGCTACCGCCGC HIIIIIIIIIIIIIIHHECIIIIIHIGIIIIHIIIIIIIIIIIIIHHIIIIHHDIHIIIIIIIIHIIHHIIHIIHIIIIIIIIIIIIIIIIIIIIIDDDDD AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:87C10T2 YT:Z:UU NH:i:2 CC:Z:ENST00000453848.6 CP:i:1392 HI:i:0
Now, i am unable to count the aligned-reads with HTSeq or featureCounts, as the BAM file contains only transcript_id ,,,,and the softwares count reads for gene_id or exon. I also set the -t option in HTSeq to transcript_id ,,and i also tried featureCounts with -g Parent parameter, but unable to count the reads. Please suggest me any possible solution , so that i can count my reads, and analyze that result for differential gene expression(in DESeq or edgeR).
thanks in advance for your kind support
This is a sub-optimal way to count your transcripts, because you will have plenty of reads that are mapped to more than one transcript and you can either count them all or filter them out. Other tools like RSEM try to estimate the proportion of reads that are originating from each transcript. Better yet, you can use newer tools like salmon.
If your aim is to do a standard gene-level differential expression analysis instead, you can map your reads to the genome and count them with a GTF file (for example from ensembl).
Look at Figure 1 from Why you should use alignment-independent quantification to see Martombo point.
You should align to the reference genome ideally, or use tools specifically for transcript pseudo-alignment such as Salmon.
Thankyou for suggestions. I counted the reads using Salmon , in non-alignment-based mode , using transcripts (cdna from ensemble} as index.
As the output from Salmon,that is= quant.sf, contains counts for whole transcripts (non-translated rna +translated rna).
Can ,counts for other RNAs be filtered out from quant.sf , so that ,,, only counts for mRNA is left in quant.sf(output).If this is possible, kindly suggest me the process.
Please correct me, if I am wrong. Thanks for your support.
You can use biomRt to get "gene_biotype" and filter quant.sf accordingly.
Thanks for the suggestion!!