So, I downloaded sra and then converted to fastq from this link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99671
I have tried salmon and bwa+htseq-count to align and get readcount. Both of the tools are failed to get any result. The salmon result is only 2% mapped and htseq-count cannot even map a single reads to transcript.
My question is, what do you think happen here? Does the sequencing machine affect this? I noticed the platform to sequence is not illumina. The original author use Lifescope to to do the RNA-seq workflow. Anyway to investigate this? I attached Fastqc result
Update:
I have tried to use bowtie to index reference genome in colorspace. The command is this:
bowtie-build -C --threads 18 HG38_90.fa hg38_90_idx
I aligned the file with this command:
bowtie --chunkmbs 500 -C -p 18 -S hg38_90_idx -1 fastq/SRR5644749_1.fastq -2 fastq/SRR5644749_2.fastq SRR5644749.sam
And the result is really bad.
# reads processed: 20826355
# reads with at least one reported alignment: 17671 (0.08%)
# reads that failed to align: 20808684 (99.92%)
Any suggestion how to align in colorspace?
There must something seriously wrong with your setup, it's just hard to tell what. Even with bwa you should be able to achieve >60% aligned reads. However, please change to a a state-of-the-art pipeline first, using either Hisat(2) or STAR as an aligner, then HTseq-count or FeatureCount. Please check that you have the correct and matching versions of the genome or transcriptome reference (for using salmon) and genome annotation.
If you need more help, please also do and report QC at each step using FastQC and MultiQC, and please provide the exact commands run as well as relevant output of fastqc and multiQC.
I have provided the Fastqc report. It seems there are problems with the fastq file itself.
It is rare for current RNA-seq protocols to see failed sequence quality and N content. Check how the fastq extraction went well, just download and extract again using SRA toolkit. Then possibly you need to apply quality trimming or select a better dataset.
I have the sra file. It seems there were no problem when I use sra toolkit fastqdump.
Oh, these could be colorspace data, they need to be analysed differently! Try option -c with bwa and see if that improves the alignment. See http://seqanswers.com/forums/showthread.php?t=16621
I think the latest bwa options are different. In bwa sampe, -c is about: -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]
I am not familiar with colorspace data.
Neither am I, sorry, but maybe you can download an older version of bwa that supports colorspace.
I will try bowtie. It seems bowtie support colorspace
Please read more about Solid data conversion to fastq here. You probably lose information with the conversion to fastq.
I am still confused. I have tried using bowtie to make colorspace index and align in colorspace, but it is still almost 0% reads that can be mapped to the reference. It is really weird.
Can you show with head how your fastq file looks like? Is it really in color space?
It is in colorspace. I have checked it. I will post the head result later.
Just to put aside anything related with SRA, you can directly download fastq files for your project here: https://www.ebi.ac.uk/ena/data/view/PRJNA389279][1]
I have looked at one file and indeed it looks like color-space: