Question

Read Counts For Genes

1

Entering edit mode

12.7 years ago

Varun Gupta ★ 1.3k

Hi everyone

I am working on RNA SEQ data for pombe genome. I mapped the reads with the pombe genome and created sam/bam files.

Now my question is i have a list of pombe genes with its coordinates.

I want to find the number of read counts for those genes.

How can i do that. Any code would be really helpful.

I will explain a bit more.

Suppose my gene coordinate for gene SPAC227.11c is 567876 568100

Now i want to find the read coordinates for the gene SPAC227.11c.

How should i do it.

Hope to hear from you guys soon

Regards Varun

read counts • 8.2k views

ADD COMMENT • link updated 3.9 years ago by jerry ▴ 130 • written 12.7 years ago by Varun Gupta ★ 1.3k

Ram · Answer 1 · 2012-03-21

8

Entering edit mode

12.7 years ago

David Langenberger 11k

First of all you need a contig identifier, which is the header of the fasta file you mapped your data to (in human it would be the chromosome).

If your gene coordinate for SPAC227.11c is then something like:

contig = ??? (let's assume it is chr1 here)
start = 567876
end = 568100

Just call:

samtools view yourBAMfile.bam chr1:567876-568100

This command will give you all mapped read positions overlapping with SPAC227.11c.

samtools view yourBAMfile.bam chr1:567876-568100 | wc -l

will give you the read count then.

To make that thingy high throughput, you can try BEDtools:

bedtools multicov -bams yourBAMfile.bam -bed yourGenes.bed

This call will return your bed file with a new column at the end, containing the read counts overlapping with your genes.

You can create a bed out of your coordinates by trying something like this:

perl -ane 'print "yourContig\t$F[1]\t$F[2]\t$F[0]\t0\t*\n";' yourGenes.file > yourGenes.bed

ADD COMMENT • link updated 3.9 years ago by Ram 44k • written 12.7 years ago by David Langenberger 11k

0

Entering edit mode

actually the contig-name your read was mapped to is written in the third column of your sam file.

ADD REPLY • link 12.7 years ago by David Langenberger 11k

0

Entering edit mode

Hey David

This works. But i will explain you how my yourGenes.file looks like column1 = chromosome column2 = gene_start(ofcourse 1-based) column3 = gene_end(ofcourse 1 -based)

Now for a bed file we just want to change column2 such that we have value 1 less than the original number i.e if column2 = gene_start is 567876 then in the output of bed file it would be 567875 and column3 = gene_end would remain the same which is 568100.

Is Your perl code above doing the same thing?

Let me know if i am wrong somewhere.

Varun

ADD REPLY • link 12.7 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Sorry, but I actually have no idea what you are talking about! ;)

But if you just want to subtract 1 of your start or end position, you can do it like this:

perl -ane 'print "$F[0]\t".($F[1]-1)."\t".($F[2]-1)."\t$F[0]\t0\t*\n";' yourGenes.file > yourGenes.bed

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.7 years ago by David Langenberger 11k

0

Entering edit mode

sorry David,

i have a bed file came from converting the accepted_hits.bam (tophat2 output) but in the column one there is no gene feature or just Latin numbers I, II, II, IV, so one...i used gtf file but i don't know where is the gene coordinate... then how i can get how many reads have mapped to each gene????

thank you

ADD REPLY • link 9.2 years ago by zizigolu ★ 4.3k

Ram · Answer 2 · 2012-03-21

3

Entering edit mode

12.7 years ago

Damian Kao 16k

You can use htseq-count. It will take a .gtf annotation file and your .sam file as input.

ADD COMMENT • link updated 3.9 years ago by Ram 44k • written 12.7 years ago by Damian Kao 16k

0

Entering edit mode

Hi Dk I don't have a gff file just the gene name and start and end coordinates.

ADD REPLY • link 12.7 years ago by Varun Gupta ★ 1.3k

score 1 · Answer 3 · 2020-12-14

1

Entering edit mode

3.9 years ago

jerry ▴ 130

Easiest way to get gene read counts is to use a program called featurecounts

Install:

$ sudo apt-get install -y subread

Create a gene annotation file in SAF format (mygenes.saf), which looks like so:

GeneID      Chr Start   End Strand
SPAC227.11c     chr1    567876  568100  +

And then run featurecounts (assuming you did paired-end sequencing). Example:

$ featureCounts -O -B --minOverlap 30 -F SAF -s 2 -p -a mygenes.saf -o output.featurecounts.txt my.bam

ADD COMMENT • link 3.9 years ago by jerry ▴ 130

0

Entering edit mode

I'd recommend using conda to install software, because it works across platforms and one does not need sudo privileges. apt-get is specific to Ubuntu and needs admin privilege, so it's not optimal.

ADD REPLY • link 3.9 years ago by Ram 44k