One library, two sequencing runs, two different alignment efficiencies?
2
1
Entering edit mode
9.1 years ago
twrzes ▴ 20

Dear BioStars Community,

I have a problem with an alignment to the transcriptome.

I have 8 RNA-Seq libraries sequenced on Illumina HiScanSQ system in one lane (2x100bp, paired-end) per sequencing run. These 8 libraries (1 pool) were put into two sequencing runs to obtain a decent number of reads. After demultiplexing (using bcl2fastq-1.8.4) the reads were trimmed using TrimGalore and aligned to the previously assembled transcriptome (because there is no reference genome for the organism - Pinus sylvestris - I am trying to analyze...) by Bowtie2-2.2.6. In the case of 7 libraries there were almost no difference in the alignment efficiency (~85-95%, with ~60-85% of uniquely mapped reads), but in case of one library something strange happened:

Run1:

11109859 reads; of these:
  11109859 (100.00%) were paired; of these:
    1701658 (15.32%) aligned concordantly 0 times
    6666961 (60.01%) aligned concordantly exactly 1 time
    2741240 (24.67%) aligned concordantly >1 times
    ----
    1701658 pairs aligned concordantly 0 times; of these:
      11078 (0.65%) aligned discordantly 1 time
    ----
    1690580 pairs aligned 0 times concordantly or discordantly; of these:
      3381160 mates make up the pairs; of these:
        3218192 (95.18%) aligned 0 times
        86430 (2.56%) aligned exactly 1 time
        76538 (2.26%) aligned >1 times
85.52% overall alignment rate

Run2:

14719563 reads; of these:
  14719563 (100.00%) were paired; of these:
    7641835 (51.92%) aligned concordantly 0 times
    4991995 (33.91%) aligned concordantly exactly 1 time
    2085733 (14.17%) aligned concordantly >1 times
    ----
    7641835 pairs aligned concordantly 0 times; of these:
      7874 (0.10%) aligned discordantly 1 time
    ----
    7633961 pairs aligned 0 times concordantly or discordantly; of these:
      15267922 mates make up the pairs; of these:
        15039673 (98.51%) aligned 0 times
        94443 (0.62%) aligned exactly 1 time
        133806 (0.88%) aligned >1 times
48.91% overall alignment rate

So my question is: what should I do to find out what went wrong? I excluded (maybe too soon...) the human error because this was the same pool used for two runs (from one Eppendorf tube).

I also did FastQC on demultiplexed and trimmed reads - links for this library with low alignment efficiency are provided below:

Run1 (the good one), demultiplexed: http://twrzes.wtvk.pl/run1_R1_fastqc.html and http://twrzes.wtvk.pl/run1_R2_fastqc.html

Run1, after trimming: http://twrzes.wtvk.pl/run1_R1_trimmed_fastqc.html and http://twrzes.wtvk.pl/run1_R2_trimmed_fastqc.html

Run2 (the bad one), demultiplexed: http://twrzes.wtvk.pl/run2_R1_fastqc.html and http://twrzes.wtvk.pl/run2_R2_fastqc.html

Run2, after trimming: http://twrzes.wtvk.pl/run2_R1_trimmed_fastqc.html and http://twrzes.wtvk.pl/run2_R2_trimmed_fastqc.html

Command-line commands I used for:

1) Demultiplexing:

/path/to/configureBclToFastq.pl --input-dir /path/to/folder/with/BCLs/Data/Intensities/BaseCalls --output-dir /path/to/folder/with/BCLs/Unaligned --sample-sheet /path/to/folder/with/BCLs/sample-sheet.csv --fastq-cluster-count 0 --mismatches 1 --with-failed-reads

2) Trimming (TrimGalore-0.4.0, a wrapper for cutadapt-1.8.3):

trimgalore --paired --quality 20 --illumina --stringency 1 -e 0.2 --length 40 -o /path/to/trimmed/fastq --trim1 run1_R1.fastq run1_R2.fastq

3) Alignment (Bowtie2-2.2.6)

bowtie2 -p 12 -I 0 -X 2000 --dovetail --very-sensitive-local -N 1 -x /path/to/index/index -1 run1_R1_trimmed.fastq -2 run1_R2_trimmed.fastq -S /path/to/aligned/sam/run1.sam

If you need any additional info, I would be more than happy to provide it.

Thank you very much for your efforts on solving this problem.

Kind regards,
Tomasz Wrzesinski

--
Tomasz Wrzesinski, MSc
PhD Student
Laboratory of High Throughput Technologies
Institute of Molecular Biology and Biotechnology
Faculty of Biology
Adam Mickiewicz University in Poznan
Umultowska 89/1.117
61-614 Poznan, Poland
tel. +48 61 829 5833
e-mail: twrzes@amu.edu.pl

RNA-Seq alignment next-gen • 3.1k views
ADD COMMENT
0
Entering edit mode

You mixed your links: "run1, after trimming" actually points to raw run2, and "run2, demultiplexed" points to run1 after trimming.

ADD REPLY
0
Entering edit mode

I am very sorry for that, I edited my post so now everything should be OK.

Thank you for pointing out my mistake.

Kind regards,

Tomasz Wrzesinski

ADD REPLY
0
Entering edit mode

Did you run FastQC before and after read cleaning of both runs?

ADD REPLY
0
Entering edit mode

Yes, I did, links are provided in my post (below Bowtie2 reports).

Kind regards,

Tomasz Wrzesinski

ADD REPLY
2
Entering edit mode
9.1 years ago
h.mon 35k

My first approach would be assemble the run2 unmapped reads and blast the contigs, looking for contaminants. Alternatively, you could blast a sample of the unmapped raw reads. It may be interesting to assemble and map run1 unmapped reads as well, as a control.

Are you using a stranded or unstranded library preparation protocol?

Looking at the FastQC reports, both runs seems just fine. The only suspicious thing I noticed is GC content seems to be slightly different (1%-2%) between runs, and on run2 %A is consistently higher than %T.

ADD COMMENT
0
Entering edit mode

Dear h.mon,

Thank you very much for your answer. I will perform the assembly on unmapped reads to search for contaminants. However, if this was the contamination of a library, shouldn't it be visible in these two runs?

This is an unstranded protocol.

I checked other samples for this phenomenon and the difference you wrote about is visible only in the case of this library - remaining 7 libraries look the same when looking at the base %. This is strange...

Thank you very much for once again for your time.

Kind regards,
Tomasz Wrzesinski

ADD REPLY
0
Entering edit mode
9.1 years ago
twrzes ▴ 20

Dear h.mon,

I haven't done the assembly yet, but I ran few reads through BLAST and these reads came from Homo sapiens.

After a consultation with a person who actually is responsible for running our sequencer it turned out that she put additional sample (human DNA), but she didn't see that indices for the sample I am analyzing and this human DNA library are identical. Therefore, apparently, the base composition of each read was different.

Thank you once again for your tremendous help - I really appreciate your input.

Kind regards,

Tomasz

ADD COMMENT
1
Entering edit mode

You are welcome. So now you have an easy solution, map run2 against the human genome, the unmapped reads should be mostly from your species.

ADD REPLY
0
Entering edit mode

I will definitely follow your advice - thank you once again for your input.

Kind regards,

Tomasz

ADD REPLY

Login before adding your answer.

Traffic: 1291 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6