I am analyzing Nanopore direct RNA seq data. I tried to analyze the reads using htseq (gives me 0 count for all reads), featureCounts (0 successful alignment assignment), and stringtie2 (WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences).
Based on some searches, it should be a mismatch between the naming convention of chromosome in my bam file versus the gtf file. I used minimap2 to do the alignment.
I found that in my bam file, there is no chromosome number information. Usually, the chromosome number should be in the third column, but instead, it looks like this:
77407ca3-9603-4b25-8a43-bcab8384e080 0 AT2G34250_ID9 1135 0 2S52M1D9M3I4M1D62M1D22M5D5M1D75M5D42M1I24M1D6M1D14M2D3M3D16M5S * 0 0 GTTGATTTCATGGGAGCCATCGGGTCGGGAACCGGAATTCTG ...
The third column is the transcript ID.
Could it be the reason why all these methods do not work? If so, how should I address it?
These are the codes I am using.
htseq-count -f bam -q $a.sorted.bam AtRTD2_19April2016_sorted.gtf > a.htseq.txt
featureCounts -F gtf -s 2 -p -T 15 -a AtRTD2_19April2016_sorted.gtf -o a_featurecounts_m6A.txt a.sorted.bam
stringtie -L -G AtRTD2_19April2016_sorted.gtf -o a_stringtie.gtf a.sorted.bam
Looks like you are using transcript sequence to align against? Your GTF file must match. Otherwise use the entire genome for alignment and then use the GTF file with gene models in it.
If you are aligning to transcripts you could simply count number reads aligning to each transcript by using
samtools idxstats
.Thanks! But I would also love to use StringTie to annotate. For example, I would want to calculate percentage of reads mapped to CDS vs 5'UTR, 3'UTR, etc. Do you have any advice on that?