Question

Help related to HTSEq usage specific for downstream DEXSeq analysis

0

Entering edit mode

18 months ago

achanda • 0

I am trying to generate count files from STAR aligned sequencing data from an RNA seq analysis. I am using HTSeq to do that with the ai of performing differential exon usage analysis using DEXSeq in R. I generated flattened GTF file using:

python ~/dexseq_prepare_annotation.py gencode.v44.annotation.gtf dexseq_prepared_annotation.gtf

Then I am running a slurm job:

#!/bin/bash

#SBATCH --job-name=htseq_count
#SBATCH --output=htseq_count_%j.out
#SBATCH --error=htseq_count_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=128gb
#SBATCH --time=24:00:00

# Activate conda environment 
source activate htseq_env37

# Navigate to the directory containing your data
cd ~/STAR_output

# Create a directory for count files
mkdir -p $HOME/countfiles

# Run dexseq_count.py for each sample
for sampleDir in */; do
    sampleName=$(basename "$sampleDir")
    bamFile="${sampleDir}/${sampleName}_Aligned.sortedByCoord.out.bam"
    outputPath="$HOME/Arthur_countfiles/${sampleName}_counts.txt"
    htseq-count -f bam -r pos -s no -a 10 -t exonic_part -i gene_id "$bamFile" $
done

However, I am now realizing the -i gene_id part is wrong as it needs to be exon_id or similar. Any idea on what would be the correct attribute? Also, is there other edits required? For reference, the flattened gtf file is:

chr1    dexseq_prepare_annotation.py    aggregate_gene  11869   14409   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"
chr1    dexseq_prepare_annotation.py    exonic_part 11869   12009   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000456328.2"; exonic_part_number "001"
chr1    dexseq_prepare_annotation.py    exonic_part 12010   12057   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000450305.2+ENST00000456328.2"; exonic_part_number "002"
chr1    dexseq_prepare_annotation.py    exonic_part 12058   12178   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000456328.2"; exonic_part_number "003"
chr1    dexseq_prepare_annotation.py    exonic_part 12179   12227   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000450305.2+ENST00000456328.2"; exonic_part_number "004"
chr1    dexseq_prepare_annotation.py    exonic_part 12613   12697   .   +   .   gene_id "ENSG00000290825.1+ENSG00000223972.6"; transcripts "ENST00000450305.2+ENST00000456328.2"; exonic_part_number "005"

Splicing DEXSeq HTSeq • 633 views

ADD COMMENT • link updated 18 months ago by Barry Digby ★ 1.3k • written 18 months ago by achanda • 0

0

Entering edit mode

Your flattened GTF file looks fine (isn't it a .gff file??).

As an alternative to htseq-count, you could use the python scripts packaged with the DEXseq R package, which you already have if you used dexseq_prepare_annotation.py. See this basic workflow: https://github.com/BarryDigby/GSE37001/tree/main?tab=readme-ov-file#differentially-expressed-exons

ADD REPLY • link 18 months ago by Barry Digby ★ 1.3k