Question

how can I convert a table of EXON counts (coordinate) to a table of GENE or TRANSCRIPT counts (eg Summarized Experiment in R)?

0

Entering edit mode

8.0 years ago

trstueve • 0

Greetings,

Briefly, my question is, "how can I covert a (samples x feature) table of counts by EXON to a table of counts by GENE or TRANSCRIPT (Ideally a ‘Summarized Experiment’ in R)? I am new to RNA-seq and bioinformatics generally. I have inherited a standard 'samples by features' table of raw RNA-seq counts from a colleague, and would like to perform a straightforward differential gene expression (DEG) analysis with this file, e.g. w/DESeq2 or EdgeR, which I have done previously starting from BAM files.

However, the dataset I inherited and am inquiring about is an excel file, representing an exonic 'table of counts', in which > 500 TCGA sample IDs are arrayed in columns, and ~240k exons are arrayed in rows (each row contains GRanges info for exonic feature it represents: chr; hg19 coordinate start; stop; strand). Raw counts fill the table. My question is how should I convert this 'Table of EXON counts' into a 'Table of Gene Counts' (or Transcript)? I do not have access to the original BAM files. There are also overlapping exons (pls see example).

My guess is the pros would recommend I treat each exon as a ‘feature’ and perform a ‘differential exon expression’ analysis. Concerns w/suggestions of a ‘differential exon expression’ analysis are: (a) most R packages require BAMs as input (b) not sure I have the computational or mental bandwidth for DEG of 240k exons and 500 samples. Differential gene/transcript expression is really my goal. I use R for most projects, but can run command-line stuff, bedtools etc. I am totally unversed in Python. I have banged around in the ‘TCGA2STAT” R package to get raw counts by gene for my research question, but there are only 162 samples available, and the boss would prefer analysis for the 500 samples in the behemoth exonic counts file I inherited. As always, your help (and patience w/the green people) is appreciated.

sample_1 sample_2 sample_3 sample_4 sample_5 sample_6

chr1:12595-12721:+ 5 6 0 19 4 5

chr1:12613-12721:+ 5 6 0 18 4 5

RNA-Seq TCGA Assembly • 2.6k views

ADD COMMENT • link updated 8.0 years ago by anp375 ▴ 190 • written 8.0 years ago by trstueve • 0

0

Entering edit mode

I'm not sure why this was deleted. Since it might be useful for everyone I reopened this post. Please let me know if you or someone else disagrees.

ADD REPLY • link 8.0 years ago by WouterDeCoster 47k

score 0 · Answer 1 · 2016-11-29

Your exons are based on coordinates. You may need to compromise.

Either you assign counts to multiple genes, or you somehow make a choice.

First, you need a bedfile. Are your coordinates 0-based or 1-based? You may need a gtf to find out. You also need to find out what annotation source your coordinates came from if they weren't just determined. The following will work with an ensembl gtf for GRCh37.

Grep for one of your coordinates (12595\t12721) in the gtf. If you find it, then your coordinates are 1-based. If you don't, the grep for 1+the start position (12596-12721). If you find that, then your coordinates are 0-based. They should be 0-based.

Make the GTF 0-based as well. Subtract 1 from each start position.

Convert your coordinates to a bed file. Get rid of the "chr". cat counts.file | tr -d 'chr' | tr ':' '\t' | tr '-' '\t' > bedfilewithcounts.bed

One option is to load this file and exons from the gtf into R and intersect the start and stop positions using %in%. grep -e $'\texons\t' EnsemblGTF | sort > exons.gtf Then you can assign a gene to each exon.

If that doesn't work for whatever reason, then you can do this: grep -e $'\tgene\t' EnsemblGTF | sort > genes.gtf bedtools intersect -a bedfilewithcounts.bed -b genes.gtf -wao -bed | tr -d '"' | tr ';' '\t' > intersect.file cut -f 1-165,167,174,175,179 temp.bed | tr ' ' '\t' > genesandcounts.txt

Now you may need to decide which gene to assign an exon to, and add the counts to it.. If you want, you can go back to being 1-based after this. Just add 1 to each start position.

Let me know if this works.