Extracting Read Count For Each Gene/Exon From Rna-Seq Bam Files
2
21
Entering edit mode
12.5 years ago
Sahel ▴ 310

Hello all,

I have some RNA-seq bam files which I would like to submit to edgeR for differential expression analysis. My problem is how to extract the read count for each gene/exon from the bam files. I have followed what explained here "http://seqanswers.com/wiki/How-to/RNASeq_analysis" but for some reason that I could not figure out, I only get 0 for the read counts. I also tested "samtools idxstats aln.bam" but it only ruterns read counts for each chromosome not genes. So I was wondering if there is an other way for extraction of read counts from bam files? Also it would be very appreciated if you could give an idea what might have gone wrong with the other method that I used and got only 0.

Thanks in advance :)

rna-seq differential expression • 46k views
ADD COMMENT
0
Entering edit mode

It would be helpful to paste in the code you ran along with the result of sessionInfo().

ADD REPLY
0
Entering edit mode

Hi there, I have a problem along the same line. I have used Trinity to do a de novo assembly and now I want to generate the counts from all my experimental cells using the de novo assemly as my reference. I want to use these counts doing differential expression work using edgeR, say. So in short, I have a de novo reference transcriptome in fasta format, and PE fastq files from each of my 30 experimental cells. I have not figured out if this is suitable for HTSeq, or if there is any other software suitable out there. Comments/suggestions appreciated. Thanks. jahn

ADD REPLY
9
Entering edit mode
12.5 years ago
Rm 8.3k

Try htseq-count: it will take annotation file like gtf or bed and your sam file

htseq-count [options] <sam_file> <gff_file>
ADD COMMENT
0
Entering edit mode

Hi, Rm

I am wondering how to extract raw reads from the de novo assembly without a gff file available?Thanks!

ADD REPLY
0
Entering edit mode

@ Baoqing: Can you post it as saperate question to receive the answers

ADD REPLY
0
Entering edit mode

oh, sorry, I will do that, thanks!

ADD REPLY
4
Entering edit mode
12.5 years ago

You could also try multicov from BEDtools:

bedtools multicov –bams run.bam -bed genes.bed

If you have more than one BAM file, you can easily get the read counts for all of them:

bedtools multicov –bams run1.bam run2.bam run3.bam -bed genes.bed

It can also handle 12 column BED files (with intron-exon structure, just counting the reads overlapping with the exons)... just download the gene annotations for your species and assembly, using the UCSC Genome Browser (Tables) and safe it as BED file. It's just that simple.

ADD COMMENT
0
Entering edit mode

Hi David, Thanks for the comment. I followed exactly your instructions... my data is hg18 so I got the bed file from UCSC for hg18 and ran: ./BEDTools-Version-2.16.2/bin/bedtools multicov -bams TAnalysis/sample1/sample1.sorted.bam -bed hg18.BED > Sample1_readCounts.txt but it immediately terminates and gives me all 0s in the readcount file... How I can check what is wrong ??? Thank you so much for your help :)

ADD REPLY
0
Entering edit mode

Did you sort and index your bam file? And did you select "output in bed format" in the Table Browser?

ADD REPLY
0
Entering edit mode

Yes, I did sort and index my bam files by doing this:

samtools sort sample1-3-lanes.bam sample1.sorted
samtools index sample1.sorted.bam

But when I look at the header of the sorted bam, it does not really look sorted!

samtools view -H sample1.sorted.bam | less  -- 
@HD VN:1.0 SO:coordinate @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:1 LN:247249719 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:M LN:16571 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954.

Is this the source of problem? Do you know other tools for sorting bams? Thanks alot... :)

ADD REPLY
0
Entering edit mode

'samtools sort' sorts the mappings by their position and not the header. So, 'bedtools multicov' does not throw any error message? It just terminates and then there are only 0s in the last column? Have you checked only the head of your file, or the complete output? 'multicov' is very fast, so perhaps for the first 1000 entries, there are just no overlaps. You could, by hand, check a gene, that has for sure some overlapping mappings.

ADD REPLY
0
Entering edit mode

"It can also handle 12 column BED files (with intron-exon structure, just counting the reads overlapping with the exons)... "

From what I tested (see below), it's not true; bed12 format is treated same as bed6 (i.e., counting reads overlapping with the whole transcript, not just exons).

$ grep ENST00000002596.5 gencode.v14.annotation.bed15 | cut -f1-12 | bedtools multicov -D -bams ~/scratch/Sample_3576_H_01/accepted_hits.bam -bed stdin
chr41139477311431389ENST00000002596.50-1140070511401629026964,1067,0,35549,518
$ grep ENST00000002596.5 gencode.v14.annotation.bed15 | cut -f1-6 | bedtools multicov -D -bams ~/scratch/Sample_3576_H_01/accepted_hits.bam -bed stdin
chr41139477311431389ENST00000002596.50-518
$ grep ENST00000002596.5 gencode.v14.annotation.gtf | grep exon | bedtools multicov -D -bams ~/scratch/Sample_3576_H_01/accepted_hits.bam -bed stdin
chr4HAVANAexon1143032311431389.-.gene_id "ENSG00000002587.5"; transcript_id "ENST00000002596.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HS3ST1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HS3ST1-001"; level 2; tag "basic"; tag "CCDS"; ccdsid "CCDS3408.1"; havana_gene "OTTHUMG00000090547.4"; havana_transcript "OTTHUMT00000207073.3";33
chr4HAVANAexon1139477411401737.-.gene_id "ENSG00000002587.5"; transcript_id "ENST00000002596.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HS3ST1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HS3ST1-001"; level 2; tag "basic"; tag "CCDS"; ccdsid "CCDS3408.1"; havana_gene "OTTHUMG00000090547.4"; havana_transcript "OTTHUMT00000207073.3";415</p>

btw, I am using bedtools v2.16.2.

ADD REPLY
1
Entering edit mode

OK. I saw bedtools v2.17.0 has already added the option -split to fix the problem. Thanks

ADD REPLY
0
Entering edit mode

Hi David. Thanks so much for helping us. I tried to run this command But I did get results. I believe that the problem is the bed file. I would to know what the next columns I need to include to get my table in ucsc and then my genes.bed

name Name of gene chrom Reference sequence chromosome or scaffold strand + or - for strand txStart Transcription start position txEnd Transcription end position cdsStart Coding region start cdsEnd Coding region end exonCount Number of exons exonStarts Exon start positions exonEnds Exon end positions proteinID UniProt display ID, UniProt accession, or RefSeq protein ID alignID Unique identifier (GENCODE transcript ID for GENCODE Basic

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 1553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6