I am currently trying to analyze RNA-seq data using DEXSeq. I am facing problem during the step where reads are counted. The output of read counting step generated a file which has very few transcripts which have count more than 0 (around 20,000). The output looks like this:
ENSG00000000003:001 0
ENSG00000000003:002 0
ENSG00000000003:003 0
ENSG00000000003:004 0
ENSG00000000003:005 0
ENSG00000000003:006 0
ENSG00000000003:007 0
ENSG00000000003:008 0
ENSG00000000003:009 0
ENSG00000000003:010 0
The mapping was performed using STAR. The BAM files generated after mapping were sorted by coordinate. Following code was used during mapping:
Indexing
STAR --runThreadN 12 --runMode genomeGenerate --sjdbGTFfile Homo_sapiens.GRCh38.98.gtf --genomeDir /home/erpl/star/indexing --genomeFastaFiles Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
Mapping
STAR --runThreadN 20 --genomeDir /home/erpl/RNA-seq_Alignment_tools/star/indexing --sjdbGTFfile /home/erpl/RNA-seq_Alignment_tools/star/indexing/Homo_sapiens.GRCh38.98.gtf --readFilesIn C24_1_1.fq C24_1_2.fq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/erpl/RNA_seq_Novogene/RNA_Sequencing_Novogene_Results/output_17.10.19/C24_1
For DEXSeq first we have to prepare annotation file followed by read counting using 2 python scripts provided with the R package. Following is the code that I have used to reach to this step:
Counting
python /home/erpl/R/x86_64-pc-linux-gnu-library/3.6/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.98.gtf Homo_sapiens.GRCh38.98.gff
python /home/erpl/R/x86_64-pc-linux-gnu-library/3.6/DEXSeq/python_scripts/dexseq_count.py Homo_sapiens.GRCh38.98.gff -p yes -f bam C24_1Aligned.sortedByCoord.out.bam DEXSeq_C24_1.txt
So is there some problem with the raw data itself or am I missing something here?
What are your alignment statistics with STAR?
Over 90% uniquely mapped reads with all samples:
Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 1041968 % of reads unmapped: too short | 3.20% Number of reads unmapped: other | 63241 % of reads unmapped: other | 0.19% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%
You have really high alignment rate but really low counts... something has gone wrong with the counting imo. Can you try another program such as HTSeq or FeatureCount to compare the counts?
I repeated with Salmon. Indexing was done using transcriptome fasta file. I used the following code:
Generating Transcriptome Fasta file
Generating index and quantification
After doing this also I am getting TPM values more than 0 for about 40,000 transcripts only. The alignment statistics were:
Salmon uses different algorithms for alignment/quantification but nonetheless your alignment rate is extremely low. Which indicates there's something going on with your library. Try to view the BAM files from the previous alignment to see what's mapping and compare to salmon. Or do as I suggested before and use HTseq to count.
Something very strange is going on.
I was able to do it successfully with DEXSeq. The problem was the missing -r flag. Thanks for the valuable suggestions