I am dealing with Illumina genomic (not transcriptomic) paired-end sequencing. For some purpose, I'd like to have a number of reads that cover a particular feature (let's say a gene) defined in a BED or GFF file.
I have sorted SAM and BAM files.
My BED files looks like this:
RandomSeqSTD 500000 505001
RandomSeqSTD 1000000 1001001
RandomSeqSTD 1500000 1500501
RandomSeqSTD 2000000 2000251
RandomSeqSTD 2500000 2500101
My GFF file (which I made) looks like this:
##gff-version 3
##sequence-region RanSeqSTD 1 4659625
RanSeqSTD random CDS 500000 505000 . . . gene_id=5000bp;transcript_id=same
RanSeqSTD random CDS 1000000 1001000 . . . gene_id=1000bp;transcript_id=same
RanSeqSTD random CDS 1500000 1500500 . . . gene_id=500bp;transcript_id=same
RanSeqSTD random CDS 2000000 2000250 . . . gene_id=250bp;transcript_id=same
RanSeqSTD random CDS 2500000 2500100 . . . gene_id=100bp;transcript_id=same
First I tried getting read number for each of these features with:
bedtools multicov -bams *.bam -bed BEDfile.bed
I got a read count and in theory this might be enough, however the problem is that I have limited control over what is considerate to be part of the feature. So I tried using HTSeq with the sorted SAM file and the above GFF file with the following command line:
htseq-count -s no -m union -i gene_id -t CDS STD100_sorted.sam RanSeqGFF.gff3 > HTSeq.txt
However, I keep getting count "0" for all gene_id features.
Any idea why HTSeq gives me this and any idea how this can be done differently?
Note: I am fully aware that HTSeq was not developed for genomic sequencing and thus should not be used for this.
Thanks TP
I think, gene_id and transcript_id need to be separated by a space (no '=' sign) and the actual IDs are escaped using " characters:
gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
That's at least how I parse these IDs in my own tool alfred which should do the read counting you are looking for (requires a position-sorted BAM file):
/src/alfred count -i gene_id -f CDS -g RanSeqGFF.gtf.gz STD100_sorted.bam
What you are describing is a GTF file, I believe. It's worth trying though. I'll update the comment when I get the result.