featureCounts from the subRead package works quite well with cDNA reads aligned to genomic DNA_ref.fa files, but I am having difficulty getting featureCounts to work with cDNA files aligned to the cDNA_ref.fa.
minimap2 used for mapping Nanopore cDNA reads over to the different reference files. The data revealed by samtools flagstat indicates that my cDNA data aligns better to the cDNA?_ref.fa file, and would like to use this data... if only featureCounts would cooperate. But I always get 0.00% reads aligned with the cDNA .
It was suggested to me to use samtools idxstats and using the table directly from there as my gene output, but it doesn't quite feel right.
Any suggestions?
PS. I have to use R for subReads package as it is not yet fluid with the newer Apple M1 chips, however I do not suspect this is the problem as I experienced this also on Ubuntu terminal.
Problem in this case is likely that your GTF file references the chromosome names where as your cDNA file has names for cDNA's. So those two don't match. You could create a simple SAF format file for your cDNA simply listing the names of cDNA's and their start/stops. You could do that using the suggestion you got
It was suggested to me to use samtools idxstats and using the table
directly from there as my gene output, but it doesn't quite feel
right.
Attached error message from trying to make the Subreader file for MacOS. Rosetta is installed on my computer, checked by forcing some apps to open using rosetta.
But I believe this issue is probably better suited for its own thread.
I created a SAF file of this format using the GTF file and some regex
magic in R.
This is not correct. cDNA entries in your transcript files are standalone sequences. They are of length indicated in first screenshot by LN field. You can't substitute chromosomal coordinates in your SAF file. So the entry in your SAF files should look like this.
ENSRNOT00000033201.6 chr12 1 276 -
When you count the BAM file aligned against the cDNA reference these are the sequences/lengths you need to use.
The subread user guide says these columns are necessary :
GeneID Chr Start End Strand
Leaving out the Chr column generates the error that the data is not in saf format.
Unfortunately, adding in the chromosome column (and playing around with the formatting) still results in 0.0% alignments. Am I just on the wrong path here? Should Nanopore cDNA reads just be mapped against the genome and not the transcriptome? Programs such as IGV, htseq, and featureCounts all seem to work a lot smoother with the genomic mapping, and there really aren't any forum threads online of people facing this problem.
I am posting the example saf file I made. To note is that I tried many different variations of formatting on the GeneID, Chr, and Strand columns, ie: changing +/-, with/without "chr", etc... I also made sure to include transcripts that had high counts in the idxstats output.
Keep in mind that featureCounts does not count multi-mapping reads by default. If you have more than one transcript for genes, this is likely happening. You can add -M option to count multi-mapped reads.
In any case you transcripts are small and not sure how long your nanopore reads are. Examining alignments carefully is in order.
Hm. Unfortunately I don't think it is an issue of multi mapping, but still rather featureCounts is not interpreting some syntax somewhere.
The option -M also produces 0 reads.
To note: I am using a small subset of data for quick alignment and pipeline testing. Hence the low number of reads (this is also just barcode 1 of 12. My reads are 100-1000bp in length, and I have a few thousand reads mapping to the genome (minimap -ax splice ref = genomic file), and ~30000 total primary alignments to the cDNA reference.
Perhaps another column of information is needed in the .saf file? Or a specific syntax of some column is false. I am at a loss.
featureCounts can also be run as a separate tool, but as GenoMax stated, even the R version should run seamlessly on M1 chips