Question

Understanding the output from HTseq-count

0

Entering edit mode

4.3 years ago

r.barton17 • 0

Hello,

I'm currently analysing some RNAseq data for differential gene expression analysis. I have completed the alignment stage using RNA STAR and have counted the reads mapped to each gene using HTseq-count but I am having real trouble understanding the output files. The file extension is ".counts" which doesn't help much but I can open them in R as tab delimited file. However the files don't seem to have any column names which means that I have no idea what each of the columns contains. I've looked through the documentation and online and I can't find anywhere that tells me exactly what these are. This is the first line of the file which is the first entry for each of the 16 columns: A00917:211:H35WCDSXY:2:1156:23773:13855 99 chr1 14481 255 150M chr1 14616 285 GGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCACCCCAGCTGTGTGGCCTCAGGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCGCACAGTGCTGGTTCCATCACCCCCACCCAGGGAAGCAGGTCTGAGCAGCTTGTCCT FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:282 nM:i:8 XF:Z:ENSG00000227232.

Does anyone have experience using HTseq-counts and can tell me what this information means? The files are also massive (~24 GB), is that normal? Thank you!

RNA-Seq • 2.6k views

ADD COMMENT • link updated 4.3 years ago by swbarnes2 14k • written 4.3 years ago by r.barton17 • 0

score 1 · Answer 1 · 2020-07-21

1

Entering edit mode

4.3 years ago

2nelly ▴ 350

This looks like the alignment file. Are you sure that having a look at HTseq output? Normally, you should get a two column file (gene name or ID and counts per gene).

ADD COMMENT • link 4.3 years ago by 2nelly ▴ 350

score 1 · Answer 2 · 2020-07-21

1

Entering edit mode

4.3 years ago

Shalu Jhanwar ▴ 540

The output file size is huge and is not normal. Could you share the command line used to generate counts from HTseq?

ADD COMMENT • link 4.3 years ago by Shalu Jhanwar ▴ 540

score 0 · Answer 3 · 2020-07-21

Despite the name, that file is a sam file. It's supposed to be massive. (It also really should be converted to a compressed bam file)

The gene counting of STAR is supposed to mimic HTSeq-count, so you shouldn't have to run them both, unless you are doing something really clever when running HTSeq-count.

Does your STAR command line have --outSAMtype BAM or --quantMode GeneCounts?