Question

Using HTseq-count for Pseudomonas Aeruginosa data

0

Entering edit mode

6.3 years ago

jspainhour3 ▴ 10

I am trying to use HTseq to generate count data from a Pseudomonas Aeruginosa rna-seq sample. I end up with all of my output going to no feature.

__no_feature    6171314
__ambiguous 0
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  478089

When I look at my gtf file:

##gff-version 2
##source-version rtracklayer 1.38.3
##date 2018-08-06
NC_002516   PseudoCAP   region  1   6264404 .   .   .   ID "NC_002516"; Name "Pseudomonas aeruginosa PAO1 NC_002516,complete genome."; Dbxref "refseq:NC_002516";
NC_002516   PseudoCAP   gene    483 2027    .   +   0   ID "gene134012"; Dbxref "GeneID:878417"; Alias "PA0001"; name "dnaA";
NC_002516   PseudoCAP   CDS 483 2027    .   +   0   ID "CDS134013"; name "chromosomal replication initiator protein DnaA"; Parent "gene134012"; locus "PA0001"
NC_002516   PseudoCAP   CDS 2056    3159    .   +   0   ID "CDS134019"; name "DNA polymerase III,beta chain"; Parent "gene134018"; locus "PA0002"
NC_002516   PseudoCAP   gene    2056    3159    .   +   0   ID "gene134018"; Dbxref "GeneID:879244"; Alias "PA0002"; name "dnaN";
NC_002516   PseudoCAP   CDS 3169    4278    .   +   0   ID "CDS134021"; name "RecF protein"; Parent "gene134020"; locus "PA0003"
NC_002516   PseudoCAP   gene    3169    4278    .   +   0   ID "gene134020"; Dbxref "GeneID:879229"; Alias "PA0003"; name "recF";

This gtf file was generated using rtracklayer from the gff file

##gff-version 3
##sequence-region chromosome 1 6264404
chromosome  PseudoCAP   region  1   6264404 .   .   .   ID=chromosome;Name=Pseudomonas aeruginosa PAO1 chromosome, complete genome.;Dbxref=refseq:NC_002516
chromosome  PseudoCAP   gene    483 2027    .   +   0   ID=gene134012;Alias=PA0001;name=dnaA;Dbxref=GeneID:878417
chromosome  PseudoCAP   CDS 483 2027    .   +   0   ID=CDS134013;Parent=gene134012;locus=PA0001;name=chromosomal replication initiator protein DnaA;
chromosome  PseudoCAP   CDS 2056    3159    .   +   0   ID=CDS134019;Parent=gene134018;locus=PA0002;name=DNA polymerase III, beta chain;
chromosome  PseudoCAP   gene    2056    3159    .   +   0   ID=gene134018;Alias=PA0002;name=dnaN;Dbxref=GeneID:879244
chromosome  PseudoCAP   CDS 3169    4278    .   +   0   ID=CDS134021;Parent=gene134020;locus=PA0003;name=RecF protein;
chromosome  PseudoCAP   gene    3169    4278    .   +   0   ID=gene134020;Alias=PA0003;name=recF;Dbxref=GeneID:879229

The GFF file and accompanying fasta file (Pseudomonas_aeruginosa_PAO1_107, both from ncbi) were used with Star 2.5.3 to generate the bam file. I changed the chromosome name from chromosome to NC_002516 so it would match the fasta file for star indexing. I used picard to deduplicate the bam file.

The GFF/GTF lists all entries as gene CDS or rna.

Does anyone have any suggestions on what I might be doing wrong or if there is some error in my formatting of the gtf file that may be causing this problem.

Thanks in advance for any advice.

RNA-Seq htseq-count • 2.0k views

ADD COMMENT • link 6.3 years ago by jspainhour3 ▴ 10

0

Entering edit mode

Still working on the solution, thank you both very much.

ADD REPLY • link 6.3 years ago by jspainhour3 ▴ 10

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY • link 6.3 years ago by GenoMax 147k

score 1 · Answer 1 · 2018-08-08

1

Entering edit mode

6.3 years ago

Ian 6.1k

When you run htseq-count check that the -i flag is set to ID (-i ID) and not 'gene_id', which is the GTF default.

ADD COMMENT • link 6.3 years ago by Ian 6.1k

1

Entering edit mode

And if you have only CDS features, you want to set -t CDS. Otherwise htseq-count is only looking for exon features.

ADD REPLY • link 6.3 years ago by michael.ante ★ 3.9k

0

Entering edit mode

The output is the same except it now includes

CDS102784   0
CDS102786   0
CDS102788   0
CDS102792   0
CDS102794   0
CDS102796   0
CDS102798   0

But with the same final no_feature and alignment_not_unique outputs

My commands used are

htseq-count  -f bam -r pos -s no -i ID -t CDS my.bam my.gtf

I have also tried to use

--mode=intersection-nonempty

But it has had no effect on output. Am I misusing the comands? I have also looked at the bam files using IVG and there are counts found in the regions I list in the GTF file.

ADD REPLY • link 6.3 years ago by jspainhour3 ▴ 10

0

Entering edit mode

What does the header for your BAM file look like? samtools view -H your.bam.

ADD REPLY • link 6.3 years ago by GenoMax 147k

0

Entering edit mode

The header looks like:

@HD     VN:1.4  SO:coordinate
@SQ     SN:NC_002516.2  LN:6264404
@PG     ID:STAR PN:STAR VN:STAR_2.5.3a  CL:STAR   --runThreadN 16   --genomeDir genome   --readFilesIn index13_AGTCAA_L002_R1_001.fastq   index13_AGTCAA_L002_R2_001.fastq      --outFileNamePrefix /home/d13/outputs/   --outSAMtype BAM   SortedByCoordinate      --outSAMstrandField intronMotif   --outSAMattributes NH   HI   AS   NM   MD      --outSAMunmapped Within
@PG     ID:MarkDuplicates       VN:1.131(cd60f90fdca902499c70a4472b6162ef37f919ce_1431022382)   CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/home/d13/in/sorted_bam/index13_AGTCAA_L002_R.genome.bam] OUTPUT=./index13_AGTCAA_L002_R.genome.deduplicated.bam METRICS_FILE=./index13_AGTCAA_L002_R.genome.duplication_metrics REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json  PN:MarkDuplicates
@CO     user command line: STAR --genomeDir genome --outSAMstrandField intronMotif --outSAMattributes NH HI AS NM MD --outSAMunmapped Within --outFileNamePrefix /home/d13/outputs/ --outSAMtype BAM SortedByCoordinate --runThreadN 16 --readFilesIn index13_AGTCAA_L002_R1_001.fastq index13_AGTCAA_L002_R2_001.fastq

ADD REPLY • link 6.3 years ago by jspainhour3 ▴ 10

1

Entering edit mode

Looks like you will have to add .2 to NC chromosome name in your annotation file. The name needs to exactly match between the files.

ADD REPLY • link 6.3 years ago by GenoMax 147k

0

Entering edit mode

Thank you for your help and sharp eyes.

ADD REPLY • link 6.3 years ago by jspainhour3 ▴ 10

0

Entering edit mode

Hello,

May I ask how you changed the name from "chromosome" to NC_#" in your gtf file? I am getting 0 counts for my downstream htseq, and I realize I need to do the same thing to match the bam files.

Thank you!

ADD REPLY • link 6.3 years ago by alexadeanfitz ▴ 10

0

Entering edit mode

̶T̶h̶e̶ ̶c̶h̶r̶o̶m̶o̶s̶o̶m̶e̶ ̶n̶a̶m̶e̶ ̶i̶n̶ ̶t̶h̶e̶ ̶b̶a̶m̶ ̶f̶i̶l̶e̶ ̶i̶s̶ ̶b̶a̶s̶e̶d̶ ̶o̶n̶ ̶t̶h̶e̶ ̶f̶a̶s̶t̶a̶ ̶h̶e̶a̶d̶e̶r̶ ̶u̶s̶e̶d̶ ̶i̶n̶ ̶i̶n̶d̶e̶x̶ ̶g̶e̶n̶e̶r̶a̶t̶i̶o̶n̶.̶ ̶Y̶o̶u̶r̶ ̶G̶T̶F̶ ̶a̶n̶d̶ ̶y̶o̶u̶r̶ ̶F̶a̶s̶t̶a̶ ̶s̶h̶o̶u̶l̶d̶ ̶b̶e̶ ̶f̶r̶o̶m̶ ̶t̶h̶e̶ ̶s̶a̶m̶e̶ ̶s̶o̶u̶r̶c̶e̶.̶ ̶

It seems it was done by Rtracklayer - an R-tool for annotation manipulation see here

[EDIT] Mixing up things

ADD REPLY • link 6.3 years ago by michael.ante ★ 3.9k