Question

One library, two sequencing runs, two different alignment efficiencies?

1

Entering edit mode

9.1 years ago

twrzes ▴ 20

Dear BioStars Community,

I have a problem with an alignment to the transcriptome.

I have 8 RNA-Seq libraries sequenced on Illumina HiScanSQ system in one lane (2x100bp, paired-end) per sequencing run. These 8 libraries (1 pool) were put into two sequencing runs to obtain a decent number of reads. After demultiplexing (using bcl2fastq-1.8.4) the reads were trimmed using TrimGalore and aligned to the previously assembled transcriptome (because there is no reference genome for the organism - Pinus sylvestris - I am trying to analyze...) by Bowtie2-2.2.6. In the case of 7 libraries there were almost no difference in the alignment efficiency (~85-95%, with ~60-85% of uniquely mapped reads), but in case of one library something strange happened:

Run1:

11109859 reads; of these:
  11109859 (100.00%) were paired; of these:
    1701658 (15.32%) aligned concordantly 0 times
    6666961 (60.01%) aligned concordantly exactly 1 time
    2741240 (24.67%) aligned concordantly >1 times
    ----
    1701658 pairs aligned concordantly 0 times; of these:
      11078 (0.65%) aligned discordantly 1 time
    ----
    1690580 pairs aligned 0 times concordantly or discordantly; of these:
      3381160 mates make up the pairs; of these:
        3218192 (95.18%) aligned 0 times
        86430 (2.56%) aligned exactly 1 time
        76538 (2.26%) aligned >1 times
85.52% overall alignment rate

Run2:

14719563 reads; of these:
  14719563 (100.00%) were paired; of these:
    7641835 (51.92%) aligned concordantly 0 times
    4991995 (33.91%) aligned concordantly exactly 1 time
    2085733 (14.17%) aligned concordantly >1 times
    ----
    7641835 pairs aligned concordantly 0 times; of these:
      7874 (0.10%) aligned discordantly 1 time
    ----
    7633961 pairs aligned 0 times concordantly or discordantly; of these:
      15267922 mates make up the pairs; of these:
        15039673 (98.51%) aligned 0 times
        94443 (0.62%) aligned exactly 1 time
        133806 (0.88%) aligned >1 times
48.91% overall alignment rate

So my question is: what should I do to find out what went wrong? I excluded (maybe too soon...) the human error because this was the same pool used for two runs (from one Eppendorf tube).

I also did FastQC on demultiplexed and trimmed reads - links for this library with low alignment efficiency are provided below:

Run1 (the good one), demultiplexed: http://twrzes.wtvk.pl/run1_R1_fastqc.html and http://twrzes.wtvk.pl/run1_R2_fastqc.html

Run1, after trimming: http://twrzes.wtvk.pl/run1_R1_trimmed_fastqc.html and http://twrzes.wtvk.pl/run1_R2_trimmed_fastqc.html

Run2 (the bad one), demultiplexed: http://twrzes.wtvk.pl/run2_R1_fastqc.html and http://twrzes.wtvk.pl/run2_R2_fastqc.html

Run2, after trimming: http://twrzes.wtvk.pl/run2_R1_trimmed_fastqc.html and http://twrzes.wtvk.pl/run2_R2_trimmed_fastqc.html

Command-line commands I used for:

1) Demultiplexing:

/path/to/configureBclToFastq.pl --input-dir /path/to/folder/with/BCLs/Data/Intensities/BaseCalls --output-dir /path/to/folder/with/BCLs/Unaligned --sample-sheet /path/to/folder/with/BCLs/sample-sheet.csv --fastq-cluster-count 0 --mismatches 1 --with-failed-reads

2) Trimming (TrimGalore-0.4.0, a wrapper for cutadapt-1.8.3):

trimgalore --paired --quality 20 --illumina --stringency 1 -e 0.2 --length 40 -o /path/to/trimmed/fastq --trim1 run1_R1.fastq run1_R2.fastq

3) Alignment (Bowtie2-2.2.6)

bowtie2 -p 12 -I 0 -X 2000 --dovetail --very-sensitive-local -N 1 -x /path/to/index/index -1 run1_R1_trimmed.fastq -2 run1_R2_trimmed.fastq -S /path/to/aligned/sam/run1.sam

If you need any additional info, I would be more than happy to provide it.

Thank you very much for your efforts on solving this problem.

Kind regards,
Tomasz Wrzesinski

--
Tomasz Wrzesinski, MSc
PhD Student
Laboratory of High Throughput Technologies
Institute of Molecular Biology and Biotechnology
Faculty of Biology
Adam Mickiewicz University in Poznan
Umultowska 89/1.117
61-614 Poznan, Poland
tel. +48 61 829 5833
e-mail: twrzes@amu.edu.pl

RNA-Seq alignment next-gen • 3.1k views

ADD COMMENT • link 9.1 years ago by twrzes ▴ 20

0

Entering edit mode

You mixed your links: "run1, after trimming" actually points to raw run2, and "run2, demultiplexed" points to run1 after trimming.

ADD REPLY • link 9.1 years ago by h.mon 35k

0

Entering edit mode

I am very sorry for that, I edited my post so now everything should be OK.

Thank you for pointing out my mistake.

Kind regards,

Tomasz Wrzesinski

ADD REPLY • link 9.1 years ago by twrzes ▴ 20

0

Entering edit mode

Did you run FastQC before and after read cleaning of both runs?

ADD REPLY • link 9.1 years ago by Irsan ★ 7.8k

0

Entering edit mode

Yes, I did, links are provided in my post (below Bowtie2 reports).

Kind regards,

Tomasz Wrzesinski

ADD REPLY • link 9.1 years ago by twrzes ▴ 20

0

Entering edit mode

9.1 years ago

twrzes ▴ 20

Dear h.mon,

I haven't done the assembly yet, but I ran few reads through BLAST and these reads came from Homo sapiens.

After a consultation with a person who actually is responsible for running our sequencer it turned out that she put additional sample (human DNA), but she didn't see that indices for the sample I am analyzing and this human DNA library are identical. Therefore, apparently, the base composition of each read was different.

Thank you once again for your tremendous help - I really appreciate your input.

Kind regards,

Tomasz

ADD COMMENT • link 9.1 years ago by twrzes ▴ 20

1

Entering edit mode

You are welcome. So now you have an easy solution, map run2 against the human genome, the unmapped reads should be mostly from your species.

ADD REPLY • link 9.1 years ago by h.mon 35k

0

Entering edit mode

I will definitely follow your advice - thank you once again for your input.

Kind regards,

Tomasz

ADD REPLY • link 9.1 years ago by twrzes ▴ 20

score 2 · Accepted Answer · 2015-10-13

2

Entering edit mode

9.1 years ago

h.mon 35k

My first approach would be assemble the run2 unmapped reads and blast the contigs, looking for contaminants. Alternatively, you could blast a sample of the unmapped raw reads. It may be interesting to assemble and map run1 unmapped reads as well, as a control.

Are you using a stranded or unstranded library preparation protocol?

Looking at the FastQC reports, both runs seems just fine. The only suspicious thing I noticed is GC content seems to be slightly different (1%-2%) between runs, and on run2 %A is consistently higher than %T.

ADD COMMENT • link 9.1 years ago by h.mon 35k

0

Entering edit mode

Dear h.mon,

Thank you very much for your answer. I will perform the assembly on unmapped reads to search for contaminants. However, if this was the contamination of a library, shouldn't it be visible in these two runs?

This is an unstranded protocol.

I checked other samples for this phenomenon and the difference you wrote about is visible only in the case of this library - remaining 7 libraries look the same when looking at the base %. This is strange...

Thank you very much for once again for your time.

Kind regards,
Tomasz Wrzesinski

ADD REPLY • link 9.1 years ago by twrzes ▴ 20