Question

unable to count reads that are aligned to coding region

0

Entering edit mode

6.2 years ago

shivyasoni1994 ▴ 10

I want to do differential-gene-expression analysis. I took the RNA-Seq (paired-end)data from BioProject(NCBI), and aligned the fastq files against coding region(cds) from ensemble, using tophat2, command=> tophat Homo_sapiens r1.fastq r2.fastq

the resultant BAM file looks like this =>

SRR7523525.12972962.1   339 ENST00000005286.8   1395    1   101M    =1379   -117    CAAGGAGAGCCGGGGCGCCCGGGGGGTGCGAGTGGACTTCTGGTGGCGCCGGCTCCGCGCCTCGCTGCGGCTGACCGTGTGGGCCCCGCTGCTACCGCTGC   @GDHHGHDHHCHDDHHDCDCC?HDHIIIHHHCIIHFHHHHFIHHD<CGHGHHFCDE<HDHHDDIIIIHGHHHIHIIIIIIHHDDGHHFG@GIIHGIDACDD   AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:87C13  YT:Z:UU NH:i:3  CC:Z:ENST00000453848.6  CP:i:1392   HI:i:0
SRR752525.24445544.1    83  ENST00000005286.8   1395    3   101M    =1319   -177    CAAGGAGAGCCGGGGCGCCCGGGGGGTGCGAGTGGACTTCTGGTGGCGCCGGCTCCGCGCCTCGCTGCGGCTGACCGTGTGGGCCCCGCTGCTACCGCCGC   HIIIIIIIIIIIIIIHHECIIIIIHIGIIIIHIIIIIIIIIIIIIHHIIIIHHDIHIIIIIIIIHIIHHIIHIIHIIIIIIIIIIIIIIIIIIIIIDDDDD   AS:i:-11    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:87C10T2    YT:Z:UU NH:i:2  CC:Z:ENST00000453848.6  CP:i:1392   HI:i:0

Now, i am unable to count the aligned-reads with HTSeq or featureCounts, as the BAM file contains only transcript_id ,,,,and the softwares count reads for gene_id or exon. I also set the -t option in HTSeq to transcript_id ,,and i also tried featureCounts with -g Parent parameter, but unable to count the reads. Please suggest me any possible solution , so that i can count my reads, and analyze that result for differential gene expression(in DESeq or edgeR).

thanks in advance for your kind support

RNA-Seq • 2.1k views

ADD COMMENT • link 6.2 years ago by shivyasoni1994 ▴ 10

4

Entering edit mode

This is a sub-optimal way to count your transcripts, because you will have plenty of reads that are mapped to more than one transcript and you can either count them all or filter them out. Other tools like RSEM try to estimate the proportion of reads that are originating from each transcript. Better yet, you can use newer tools like salmon.

If your aim is to do a standard gene-level differential expression analysis instead, you can map your reads to the genome and count them with a GTF file (for example from ensembl).

ADD REPLY • link 6.2 years ago by Martombo ★ 3.1k

1

Entering edit mode

Look at Figure 1 from Why you should use alignment-independent quantification to see Martombo point.

ADD REPLY • link 6.2 years ago by h.mon 35k

0

Entering edit mode

aligned the fastq files against coding region(cds) from ensemble

You should align to the reference genome ideally, or use tools specifically for transcript pseudo-alignment such as Salmon.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Thankyou for suggestions. I counted the reads using Salmon , in non-alignment-based mode , using transcripts (cdna from ensemble} as index.

As the output from Salmon,that is= quant.sf, contains counts for whole transcripts (non-translated rna +translated rna).

Can ,counts for other RNAs be filtered out from quant.sf , so that ,,, only counts for mRNA is left in quant.sf(output).If this is possible, kindly suggest me the process.

Please correct me, if I am wrong. Thanks for your support.

ADD REPLY • link 6.2 years ago by shivyasoni1994 ▴ 10

2

Entering edit mode

You can use biomRt to get "gene_biotype" and filter quant.sf accordingly.

ADD REPLY • link 6.2 years ago by h.mon 35k

0

Entering edit mode

Thanks for the suggestion!!

ADD REPLY • link 6.2 years ago by shivyasoni1994 ▴ 10