Hi - I am trying to use HTSeq (or even featureCounts) to generate counts for RNAseq data. The data set is from a bacteria (Illumina HiSeq PE). It was mapped to the reference genome with Bowtie - and all looks good. When I try to run HTSeq all the mapped reads are designated as "no feature". Below are samples of data/output/what I've tried . Here is what I ran in Bowtie and the results :
bowtie2 -x 206_old --local --no-unal -p 12 -1 ~/bioinformatics3/206/206raw/z44h_1a.fastq.gz -2 ~/bioinformatics3/206/206raw/z44h_1b.fastq.gz -S ~/bioinformatics3/206/Bowtie/PEalgin_44h_1.sam
produces:
20417626 reads; of these:
20417626 (100.00%) were paired; of these:
10819260 (52.99%) aligned concordantly 0 times
8177956 (40.05%) aligned concordantly exactly 1 time
1420410 (6.96%) aligned concordantly >1 times
10819260 pairs aligned concordantly 0 times; of these:
1548865 (14.32%) aligned discordantly 1 time
9270395 pairs aligned 0 times concordantly or discordantly; of these:
18540790 mates make up the pairs; of these:
17336065 (93.50%) aligned 0 times
626560 (3.38%) aligned exactly 1 time
578165 (3.12%) aligned >1 times
57.55% overall alignment rate
When I run HTSeq:
htseq-count -s no -t CDS -i ID -a 0 -o ~/bioinformatics3/206/HTSeq/htPE44h_1.sam PEalgin_44h_1.sam ~/bioinformatics3/206/206.gtf
I get:
__no_feature 12053420
__ambiguous 0
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 0
My GTF file:
##gff-version 3
scaffold-0 FIG CDS 99 914 . - 0 ID=fig|6666666.35312.peg.1;Name=Thiazole biosynthesis protein ThiG
scaffold-0 FIG CDS 948 1148 . - 0 ID=fig|6666666.35312.peg.2;Name=Sulfur carrier protein ThiS
scaffold-0 FIG CDS 1180 2415 . - 1 ID=fig|6666666.35312.peg.3;Name=Glycine oxidase ThiO (EC 1.4.3.19);Ontology_term=KEGG_ENZYME:1.4.3.19
scaffold-0 FIG CDS 2744 2866 . - 2 ID=fig|6666666.35312.peg.4;Name=hypothetical protein
And a sample of the (sorted by coordinate) Bowtie alignment:
GWNJ-0850:541:GW1903111949th:1:1101:12236:56334 99 fig|6666666.35312.peg.1 1 41 5 8S92M = 302 509 CGTGCATCCGGTGGCGGGCGGCTGACGCCGCCGCTTCCGGCCCAACCAGGATTCTTCGATGACG TCCCTTCCTTCCGCCGACGCCCTCACGCTGTACGGCGAAACCTTCGCGAGCCGCGTGCTGATCGGCACGTCGCGCTATCCGTCCCT A-A-FJ-F AJAF<JJJFAA<<J<JFJJJJ7JFJF-7A-AJJJFF<JJJFJFJFFFJAJ--7F<J--7A-AFAAAJJAAJJ<JAA-7-AF7FJ7JFJ<A<-7<F- FA7<7--A<77FJJ---7F7A7FF7F-7<FA7--7<<--7-A<A<F AS:i:179 XN:i:0 XM:i:1 XO:i:0 XG:i:0 N M:i:1 MD:Z:26G65 YS:i:256 YT:Z:CP
GWNJ-0850:541:GW1903111949th:1:1102:3275:12402 99 fig|6666666.35312.peg.1 1 16 6 5S80M5S = 1 -155 TCGATGTCGTGCATCCGGTGGCGGGCGGCTGACGCCGCCGCTTCCGGCCCAACCAGGATTCTTC GATGACGTCCCTTCCTTCCGCCGACGCGCTCACGCTGTACGGCGAAACCTTCGCGAGCCGCGTGCTGATCGGCACGTCGCGAGATC AAAFFJJJ JJFJJJJJJJJJJJJ-FFFFJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJ<AAAJJJJJJJJJFJJJJJJJJJ<JFJJJFJJJJ JJJAJJFJFFAJJJJJAJJJJJFJJFJJ7FJJFFJJJJJJF7FFFJ AS:i:160 XS:i:64 XN:i:0 XM:i:0 XO:i:0 X G:i:0 NM:i:0 MD:Z:80 YS:i:160 YT:Z:CP
GWNJ-0850:541:GW1903111949th:1:1102:23937:39405 99 fig|6666666.35312.peg.1 1 36 1 14S36M = 39 302 GTCGCGCGCACCCAGCACGCGGCGCGTGCGCTCGCCGGCGGCGACCGGCTCGATGTCGTGCATC CGGTGGCGGGCGGCTGACGCCGCCGCTTCCGGCCCAACCAGGATTCTTCGATGACGTCCCTTCCTTCCGCCGACGCGCTCACGCTG AAFFFJJJ JJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ JJFJJJFJJJJJFAJJAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:72 XS:i:162 XN:i:0 XM:i:0 XO:i:0 X G:i:0 NM:i:0 MD:Z:36 YS:i:300 YT:Z:CP
I have tried sorting by name and position, I've tried using GFF instead of GTF, I've tried using only the forward data instead of PE, I've tried featureCounts and HTSeq, I've run HTSeq with bam and sam files... I'm at a complete loss for what might be the problem and I've read just about every thread/post on related issue. The only think I can thing is it's an issue with the GTF file (downloaded from RAST). Any help or advice would be greatly appreciated!
Your "GTF" file is actually a malformed GFF file, is there an actual GTF file available for this organism?
Thanks, that is what is was starting to suspect. The "GTF" file was downloaded from RAST under the "GTF" option so I suspect this is an issue with RAST. Is there a good way to generate a GTF file that will be compatible with HTSeq-count? Re-annotation is fine if it will result in a better GTF file (small bacterial genome).
I just noticed I accidentally pasted two versions of the genome above (bowtie vs HTSeq) but the ID's do match when I've run this.
Can you update your post so the contig IDs are compatible (just so we can ensure they are)? The easiest solution to your GFF problem is to write a small program to convert it to a proper GTF file. How easy this is will be entirely dependent upon how familiar you are with a scripting language like perl or python. Alternatively, you might be able to import the GFF file into R and export it as a proper GTF file (you'll have to look up the details of that).
Yes the post is updated with the corresponding GTF file for the Bowtie alignment shown.
I noted by michael.ante, you aligned against the transcriptome rather than the genome. Align against the genome and things will work better.
Hi one of the easy solution may be to try converting the gff file to gig using gff2gtf.. As such I guess this is formatting issue with GTE file
You mapped apparently against the transcritome, since the mapping position in the bam file corresponds to the gene I'd of your annotation file.
I intended to map RNAseq data to the genome. I see multiple reads mapping to each gene... is this incorrect? I don't think I understand your comment, sorry.
Your reads are mapped against entities which are like "
fig|6666666.35312.peg.1
". HTSeq-count is looking for exon covering reads which are located at entries with the name "scaffold-0
". Since there is no overlap, HTSeq is reporting only "__no_features
".You need either to map against the genomic reference (the fasta containing scaffold-0 and so on) and use htseq with the corresponding GTF.
Alternatively, you can map against the transcriptome (the fasta containing names as fig|6666666.35312.peg.1), extract the uniquely mapping reads, and preforme a
samtools idxstats
.These are the headers on the genomic fasta file:
I am specifically telling HTSeq count to read the ID, which as above in the GFF says ID=fig|6666666.35312.peg.1 So why would it be looking for Scaffold-0? I made the Bowtie index using the genomic fasta file with headers shown above
It looks like using the Scaffold FASTA file worked (with the original GFF)! Thank you everyone for your help, it is greatly appreciated!