Htseq-count result assistance
1
0
Entering edit mode
5.8 years ago
Liftedkris ▴ 30

i want to use Htseq-count to be used to count the number of reads for the features i have. i have run the program and got the following as part of my result.

ENSG00000202526 0
ENSG00000206882 0
ENSG00000207129 0
ENSG00000207186 0
ENSG00000207277 0
ENSG00000207341 0
ENSG00000210082 5001
ENSG00000211459 1006
ENSG00000212138 0
ENSG00000212154 0
ENSG00000212171 0
ENSG00000212176 0
ENSG00000212204 0
ENSG00000212237 0

i got up to 500 genes but only about 4 or 5 has counts greater than one. is this correct or have i done something wrongly. thanks so much for your assistance

RNA-Seq • 2.1k views
ADD COMMENT
1
Entering edit mode

i got up to 500 genes

You get exactly the number of genes which were present in your gtf annotation file.

ADD REPLY
1
Entering edit mode

It is not uncommon that many genes are lowly or not expressed in RNA-seq. If you want beyond-eyeballing idea, load the counts into R (or similar) and create a histogram of the counts. It should approximately follow a negative binomial distribution.

ADD REPLY
0
Entering edit mode

Are you suggesting that nothing is wrong with my analysis? I re-downloaded a new gtf file and got more counts between 1 and 3 but most genes still returned a count of zero

ADD REPLY
1
Entering edit mode

Its possible nothing is wrong. You really need the full numbers. Depending on your GTF, most genes will have a count of zero, even in a good sample.... but how many can still be a mesaure of how good the sample is. If you are using the full ensembl GTF, which, I believe has about 50,000 genes in it, I'd probably expect that around 20k or so of them to be non-zero.

If you don't want to look at the full distribution in something like R, try the following:

awk '$2 > 0' counts.tsv | wc -l
ADD REPLY
0
Entering edit mode

Thanks very much sir! I have a total of 6601 genes. I am thinking this number represents the total genes present in my gtf file... the ones with values greater than zero should be the genes present in my sample. Is this correct?

ADD REPLY
0
Entering edit mode

Thanks very much sir! I have a total of 6601 genes.

You would expect more genes in a human genome... how did you obtain this gtf?

ADD REPLY
0
Entering edit mode

I obtained it from ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/

I downloaded the 3rd one

ADD REPLY
0
Entering edit mode

And where did you get the genome reference (fasta) from. The one you used to align the reads?

ADD REPLY
0
Entering edit mode

Check your alignment stats. While aligning RNA-SEQ data to all patches, haplotype annotations you might get a lot of multi-mapping reads which will not be counted with default Htseq-count settings.

ADD REPLY
0
Entering edit mode

Getting zero for a lot of genes is expected as not all genes are expressed in all tissues.

ADD REPLY
0
Entering edit mode

Hi @citynsukka I edit your post to make the list of counts format correctly. You can do this by highlighting the part of the post to be formatted as-is and pressing the code button (labelled 101010). I've also changed the category - the tool category is used to announce/promote new tools.

ADD REPLY
2
Entering edit mode
5.8 years ago
michael.ante ★ 3.9k

Hi citynsukka,

The genes with a read count greater than 0 are MT-genes. Please check if your GTF is consistent with your reference used. Use head to get the chromosome names out of yor GTF file and samtools view -H to get the names out of your bam file.

Cheers,

Michael

ADD COMMENT

Login before adding your answer.

Traffic: 3550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6