Question

Short sequence alignment with low alignment rate

0

Entering edit mode

7.5 years ago

jesselee516 ▴ 100

Hi, all. I am currently facing a alignment problem, and I don't have any idea right now. Now, I am trying to align Rice DNase-seq to IRGSP build4 genome. I am using SRR094111.sra data, and convert it to fastq with default parameter. I use bowtie2 to do the alignment(parameter:bowtie2 -p 8 --local -x /all_format -U SRR094111.fastq -S SRR094111.sam), But I just got the following low rate alignment. I am not sure what is going on. I have tried to trim last several parameters, but I don't know how many base pair I should trim like that. Can anyone show me the detail pipeline to make right alignment? Thanks a lot.

*20896348 reads; of these:
  20896348 (100.00%) were unpaired; of these:
    20676176 (98.95%) aligned 0 times
    152269 (0.73%) aligned exactly 1 time
    67903 (0.32%) aligned >1 times*

First several line of fastq data（May helpful）:

@SRR094111.1 HWUSI-EAS465_0004:3:1:1043:13479 length=36
GGTAGTAATTGACAAAAGNTCTCGTATGCCGTCTTC
+SRR094111.1 HWUSI-EAS465_0004:3:1:1043:13479 length=36
??6;<@CCCCCC@?@<79!:59897C@CCBCBBCC#

@SRR094111.2 HWUSI-EAS465_0004:3:1:1043:15713 length=36
GAATGCCTGATTGCCTGTAGGTCGTATGCCGTCTTC
+SRR094111.2 HWUSI-EAS465_0004:3:1:1043:15713 length=36
CCCCCBCCCCCCCCCCCCCCCCCC?CCCCCCCCCCC

@SRR094111.3 HWUSI-EAS465_0004:3:1:1043:15796 length=36
ATGGACCATCATCAGCCATCTTCGTATGCCGTCTTC
+SRR094111.3 HWUSI-EAS465_0004:3:1:1043:15796 length=36
CCCCC;CC;CCCBBBCBACCCCCCAA=CCCCBCACC

@SRR094111.4 HWUSI-EAS465_0004:3:1:1043:14078 length=36
TGTTACTTGACGCACAATAATTCGTATGCCGTCTTC
+SRR094111.4 HWUSI-EAS465_0004:3:1:1043:14078 length=36
BAB@:BBA?BBB@B:AAA:<B?BB;B<ABBBBBBB?

The GEO description about data:

The degree of DNase I digestion was assessed by pulsed-field gel electrophoresis (PFGE: 20–60 switch time, 18 h, 6 V/cm; Bio-Rad). High molecular weight (HMW) DNA after DNase I digestion was isolated, blunt ended with T4 DNA polymerase. Biotinylated adaptor I (5’ Bio ACAGGTTCAGAGTTCTACAGTCCGAC and 5’ P- GTCG GACTGTAGAACTCTGAAC) was ligated to the DNA molecules. Dynal M-280 beads (Invitrogen) were used for enriching DNase I digested DNA ends after MmeI digestion. Adaptor II (5’ P-TCGTATGCCGTCTTCTGCTTG and 5’ CAAGCAGAAGACGGCATACGANN) was then ligated to the MmeI treated ends. The DNA sample was amplified by PCR using linker-specific primers (5’ CAAGCAGAAGACGG CATACGA and 5’AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA), and purified by PAGE for isolation of DNA fragments with about 90 bp in size. The final Illumina sequencing was performed using a primer specific to linker I (5’CCACCGACAGGTTCAGAGTTCTACAGTCCGAC).

*The following is my fastQC report with failure information, others are all good.*

https://drive.google.com/file/d/0B-nCMrsqGWH3RmdQOXl3T1dkblU/view?usp=sharing

https://drive.google.com/file/d/0B-nCMrsqGWH3VEVNX09SOVJRUWM/view?usp=sharing

https://drive.google.com/file/d/0B-nCMrsqGWH3bTBQa29ORUVNOE0/view?usp=sharing

bowtie alignment next-gen • 2.0k views

ADD COMMENT • link updated 7.4 years ago by Biostar 20 • written 7.5 years ago by jesselee516 ▴ 100

0

Entering edit mode

While the Q-scores are not great they are atrocious either. Since this is old GAII data I would suggest that you take into account (it is likely in Illumina format, phred+64). Try bowtie (instead of bowtie2) to see if ungapped alignments improve things. Trying to replicate the analysis in what ever paper this came from as closely as possible should be done first before you veer off in other directions.

ADD REPLY • link 7.5 years ago by GenoMax 147k

0

Entering edit mode

The results from (what I think is) the original paper are remarkably different:

We obtained a total of 43 million sequence reads from the seedling libraries and 57 million reads from the callus libraries (Supplemental Table S1). Approximately 70% of the reads were mapped to unique positions in the rice genome.

MAQ with a 1bp mismatch was used to align the DNAseq reads.

ADD REPLY • link 7.5 years ago by h.mon 35k

0

Entering edit mode

Are you sure that you align to the correct reference? Did you download and index the genome yourself or was it provided by someone else, did you sucessfully align other data to that reference before? Would be the most obvious explanation before you start chasing ghosts.

ADD REPLY • link 7.4 years ago by ATpoint 85k