HTSEq-Count returns 0 counts for all genes in the counts file!
0
0
Entering edit mode
6.9 years ago
manish.nu • 0

Hi all!

I've been teaching myself to do RNA-Seq.

I prepared my libraries using KAPA mRNA HyperPrep Kit and had them sequenced single read 50bp

To generate my counts' file for DESeq2, I

Downloaded the reference/GTF files from Genecode:

Gencode/GRCh38.p10.genome.fa
Homo_sapiens.GRCh38.91.gtf

Indexed them as:

STAR --runMode genomeGenerate -- genomeDir $genomeDir --genomeFastaFiles /hg/GRCh38.p10.genome.fa --sjdbGTFfile /hg/Homo_sapiens.GRCh38.91.gtf --sjdb Overhang 49 --runThreadN 20

Aligned the fastq files as:

STAR --runThreadN 20 --genomeDir $genome_dir --readFilesCommand zcat --readFilesIn /projects/FastQ/post/SK1-1_S1.fastq.gz --outSAMtype BAM Unsorted --outFileNamePrefix /projects/aligned/post/P1_pre_

created HTSeqCount files using:

htseq-count -f bam -q -m union -s yes -t exon -i gene_id /projects/aligned/P1_post_Aligned.out.bam /Homo_sapiens.GRCh38.91.gtf  > /projects/counts/htseq/P1_pre_hts_counts

My counts file look like this:

ENSG00000284741 0
ENSG00000284742 0
ENSG00000284743 0
ENSG00000284744 0
ENSG00000284745 0
ENSG00000284746 0
ENSG00000284747 0
ENSG00000284748 0
__no_feature    22375495
__ambiguous 0
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  18649806

All of the genes are reflected = 0. Can someone please tell me what am I doing wrong?

RNA-Seq htseq-count • 7.7k views
ADD COMMENT
1
Entering edit mode

Does your GTF contain exon lines, actually?

ADD REPLY
0
Entering edit mode

this is what my gtf file shows:

$ head Homo_sapiens.GRCh38.91.gtf 
#!genome-build GRCh38.p10
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.25
#!genebuild-last-updated 2017-06
1   havana  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1   havana  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
1   havana  exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1   havana  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";

It looks like it has 'exon' but I wouldn't trust myself right now.

ADD REPLY
1
Entering edit mode

Hi, Two suggestions - 1) Plz check the Log.final.out created by STAR at the end of the run. Check the mapping % there.

2) If the mapping% is ok, then as a check you could re-run STAR with

--quantMode GeneCounts

option. You wouldn't need to run a count generator separately.

ADD REPLY
0
Entering edit mode

Very stupidly deleted the log files but as far as I remember seeing them in the beginning, all files mappped >30% except one which was in 20-30%

I'm very tempted to run the alignment again with --quantMode GeneCounts

ADD REPLY
0
Entering edit mode

Are the chromosome identifiers in your fasta genome the same as in the gtf?

ADD REPLY
1
Entering edit mode

The standard Genecode fasta has a header like "chr1 1". As far as I remember, STAR uses the full header to name its chromosomes. Can you report the result of samtools view -H /projects/aligned/P1_post_Aligned.out.bam | head and compare it to the first column of /Homo_sapiens.GRCh38.91.gtf ?

ADD REPLY
0
Entering edit mode

Hi WouterDeCoster, to me, it looks like they're not the same (see below please). I went by the assumption that if you download your genome file and GTF file from the same source (Gencode in my case), they will be 'compatible'. This makes me doubt my whole alignment step.

Would you recommend that I start from scratch (Indexing>Alignment) and could you please suggest a source for a GTF file? Can I still salvage this by switching the GTF file?

ADD REPLY
2
Entering edit mode

If you get the correct GTF file from Gencode (http://www.gencodegenes.org/releases/current.html ) then the counting will work.

##description: evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90)
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2017-08-01
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; 
gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
ADD REPLY
0
Entering edit mode

It's working! Thanks so much!

ADD REPLY
2
Entering edit mode

Just using the right gtf for counting would be sufficient, no need to realign, unless you want to use the gtf in alignment as well for defining the known exon structure.

ADD REPLY
0
Entering edit mode
$ samtools view -H P1_post_Aligned.out.bam | head
@HD VN:1.4
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717

GTF File:

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1"; 1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";

Does this information help? To me it looks like they're not the same.

ADD REPLY

Login before adding your answer.

Traffic: 1541 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6