Question

FASTQ alignment result is really bad

0

Entering edit mode

6.0 years ago

bharata1803 ▴ 560

So, I downloaded sra and then converted to fastq from this link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99671

I have tried salmon and bwa+htseq-count to align and get readcount. Both of the tools are failed to get any result. The salmon result is only 2% mapped and htseq-count cannot even map a single reads to transcript.

My question is, what do you think happen here? Does the sequencing machine affect this? I noticed the platform to sequence is not illumina. The original author use Lifescope to to do the RNA-seq workflow. Anyway to investigate this? I attached Fastqc result

Update:

I have tried to use bowtie to index reference genome in colorspace. The command is this:

bowtie-build -C --threads 18 HG38_90.fa hg38_90_idx

I aligned the file with this command:

bowtie --chunkmbs 500 -C -p 18 -S hg38_90_idx -1 fastq/SRR5644749_1.fastq -2 fastq/SRR5644749_2.fastq SRR5644749.sam

And the result is really bad.

# reads processed: 20826355
# reads with at least one reported alignment: 17671 (0.08%)
# reads that failed to align: 20808684 (99.92%)

Any suggestion how to align in colorspace?

enter image description here

RNA-Seq htseq-count salmon • 3.9k views

ADD COMMENT • link updated 6.0 years ago by h.mon 35k • written 6.0 years ago by bharata1803 ▴ 560

0

Entering edit mode

There must something seriously wrong with your setup, it's just hard to tell what. Even with bwa you should be able to achieve >60% aligned reads. However, please change to a a state-of-the-art pipeline first, using either Hisat(2) or STAR as an aligner, then HTseq-count or FeatureCount. Please check that you have the correct and matching versions of the genome or transcriptome reference (for using salmon) and genome annotation.

If you need more help, please also do and report QC at each step using FastQC and MultiQC, and please provide the exact commands run as well as relevant output of fastqc and multiQC.

ADD REPLY • link 6.0 years ago by Michael 55k

0

Entering edit mode

I have provided the Fastqc report. It seems there are problems with the fastq file itself.

ADD REPLY • link 6.0 years ago by bharata1803 ▴ 560

0

Entering edit mode

It is rare for current RNA-seq protocols to see failed sequence quality and N content. Check how the fastq extraction went well, just download and extract again using SRA toolkit. Then possibly you need to apply quality trimming or select a better dataset.

ADD REPLY • link 6.0 years ago by Michael 55k

0

Entering edit mode

I have the sra file. It seems there were no problem when I use sra toolkit fastqdump.

ADD REPLY • link 6.0 years ago by bharata1803 ▴ 560

0

Entering edit mode

Oh, these could be colorspace data, they need to be analysed differently! Try option -c with bwa and see if that improves the alignment. See http://seqanswers.com/forums/showthread.php?t=16621

ADD REPLY • link 6.0 years ago by Michael 55k

0

Entering edit mode

I think the latest bwa options are different. In bwa sampe, -c is about: -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]

I am not familiar with colorspace data.

ADD REPLY • link 6.0 years ago by bharata1803 ▴ 560

0

Entering edit mode

Neither am I, sorry, but maybe you can download an older version of bwa that supports colorspace.

ADD REPLY • link 6.0 years ago by Michael 55k

0

Entering edit mode

I will try bowtie. It seems bowtie support colorspace

ADD REPLY • link 6.0 years ago by bharata1803 ▴ 560

0

Entering edit mode

Please read more about Solid data conversion to fastq here. You probably lose information with the conversion to fastq.

ADD REPLY • link 6.0 years ago by Benn 8.3k

0

Entering edit mode

I am still confused. I have tried using bowtie to make colorspace index and align in colorspace, but it is still almost 0% reads that can be mapped to the reference. It is really weird.

ADD REPLY • link 6.0 years ago by bharata1803 ▴ 560

0

Entering edit mode

Can you show with head how your fastq file looks like? Is it really in color space?

ADD REPLY • link 6.0 years ago by Benn 8.3k

0

Entering edit mode

It is in colorspace. I have checked it. I will post the head result later.

ADD REPLY • link 6.0 years ago by bharata1803 ▴ 560

0

Entering edit mode

Just to put aside anything related with SRA, you can directly download fastq files for your project here: https://www.ebi.ac.uk/ena/data/view/PRJNA389279][1]

I have looked at one file and indeed it looks like color-space:

@SRR5644749.11183773 11183773/2
T22033022222231202002113112222221030032031212322000
+
!33333333333.--------------------------------------
@SRR5644749.11183774 11183774/2
T0320120300100101212000302000232130232310..........
+
!=/////////....--------------------------**********
@SRR5644749.11183775 11183775/2
T001111001023332010010.............................
+
!@?8886666............*****************************

ADD REPLY • link 6.0 years ago by jomo018 ▴ 730

score 0 · Answer 1 · 2018-12-12

I think the problem is the data is really crappy, and Bowtie is discarding the reads because of too many errors at the seed region. I mapped the reads with Subread and got an average mapping rate of 65.9% for the same file (SRR5644749) you mapped. Even so, there were a lot of orphaned pairs:

//================================= Summary ==================================\\
||                                                                            ||
||     Total fragments : 20,826,355                                           ||
||              Mapped : 13,718,668 (65.9%)                                   ||
||     Uniquely mapped : 11,529,522                                           ||
||       Multi-mapping : 2,189,146                                            ||
||                                                                            ||
||            Unmapped : 7,107,687                                            ||
||                                                                            ||
||    Correctly paired : 6,067,553                                            ||
|| Not mapped in pairs : 7,651,115                                            ||
|| Only one end mapped : 7,109,909                                            ||
||   Multi-chromosomes : 265,002                                              ||
||   Different strands : 290,108                                              ||
||  Not in PE distance : 60,605                                               ||
||      Abnormal order : 33,402                                               ||
||                                                                            ||
||              Indels : 0                                                    ||
||                                                                            ||
||        Running time : 23.6 minutes                                         ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Subread was run with:

subread-buildindex -B -c -F -o genome genome.fasta
subread-align -b -t 0 -i genome --multiMapping -B 2 \
    -o SRR5644749.bam -r SRR5644749_1.fastq.gz -R SRR5644749_2.fastq.gz

Probably tweaking Subread settings should result in a higher mapping rate.

You can use FastQC on the bam file and use the -f bam_mapped flag to evaluate only mapped reads, this will produce a report ignoring unmapped read. The problem with unmapped reads is FastQC converts colorspace to basespace, but this conversion is unreliable without a reference. Mapped reads should be of good quality, and already corrected by the mapper.

P.S.: note that your bowtie mapping command is not outputting a sam file, you need to use the -S flag with Bowtie, because it has a custom default output. Initially, I didn't see the -S in your command.