Newbie question here, sorry. I am trying to generate a counts matrix for paired-end non-stranded data. I feel pretty good about the steps I have taken so far, but this is the first time I have used this particular type of data. The counts matrix I have generated has allele specific reads which I wasn't anticipating, and the counts seem vary off, especially between the two alleles. When generating a counts matrix with allele specific information, is there any special steps I need to consider? Also is there a way to get non-allele specific read counts?
I think a potential issue maybe with samtools or ht-seq. Feedback appreciated. I am willing to switch to bioconductor/R tools also. I don't have too much experience outside with RNA-seq outside of these programs though.
samtools sort -n accepted_hits.bam sample_sorted
samtools view sample_sorted.bam >sample.sam
htseq-count -s no --idattr=Parent sample.sam organism.gff > sample.counts
What would be helpful for me to know: Does it look like I'm on the right track? If yes, why are my allele reads so different? If no, what steps did I overlook, and do you have suggestions?
Different how? Could you post an example?
What is the organism and are you mapping to a reference genome containing allele-specific information?
I have a wild, mutant1, and mutant2, 4 replicates each, below is a subset of the counts matrix that was generated. The rows are genes, and I think the A and B allele are marked at the end of the gene name. This is only a small subset, but there are two concerns. The first is that I would expect the alleles to be fairly similar, in some cases the reads between alleles differ by several hundred counts. The second concern is I will occasional see an extreme observation (such as C2_09060C) in which one sample's count differs by several thousand for one replicate but not the others. For example, for C2_09060C Wild, Mutant1, and Mutant2 had about the same numbers for all observations and replicates expect for one sample (Mutant2 below). There are quite a few occurrences of this.
Candida albicans, and I believe the reference genome contains allele specific information.
Indeed, it seems the reference genome assembly is diploid. I don't think HTSeq is the right tool for the job, I would search for allele-specif expression pipelines, or read some papers dealing with RNAseq analysis of Candida albicans.
If I use a haploid reference genome, will it fix the issue?
See my answer to your earlier post C Albicans Genome Count Matrix Total Genes: by selecting just one chromosome set, you may be leaving out a small number of features (and possibly of reference DNA, if there are structural variants between chromosome sets A and B). How small are the differences, I don't know.