Question

How to improve the alignment performance in targeted sequencing data

1

Entering edit mode

4.2 years ago

SkyL ▴ 10

Hi, I am new to the sequencing field and was trying to align a targeted sequencing (most of the reads have a length range from 90 to 150 bp) to a reference genome.

I did some preprocess such as trimming the adapter and trimming some barcode seq from R1, then I tried to align R1 and R2 to the reference using bowtie2, the summary is as shown in the figure: enter image description here

The overall alignment rate looks ok but I really concern about the time of aligned concordantly, which I think there are a lot of reads that were not aligned.

Also, I got a lof warnings such as: enter image description here

which I think is because trimming the sequence produces a lot of short reads, however, I filter out short read that are less than 20 bp with cutadapt, but the warnings are still there.

Is anyone familiar with targeted sequencing can tell me if this is normal or not? And any suggestion to improve the alignment would be really appreciated! Thanks!

DNA-seq targeted-sequencing alignment • 1.5k views

ADD COMMENT • link updated 4.2 years ago by swbarnes2 14k • written 4.2 years ago by SkyL ▴ 10

score 0 · Answer 1 · 2021-02-12

0

Entering edit mode

4.2 years ago

swbarnes2 14k

I'm not sure how anyone can help you if you don't show all the code you used to trim. If 100% of your reads are paired, that means that either you removed both elements of a pair if one was too short, or you didn't remove anything no matter the length. Given the error messages, it looks like you did the latter.

Anyway, if you have a lot of reads which are mostly adapter, then you can't program your way out of that. You can remove them so the software doesn't complain, but that won't actually improve the # of reads you have.

I hope you are aligning to the entire genome, and not just your target. That is the better way to align. You can filter down to your target after alignment.

ADD COMMENT • link 4.2 years ago by swbarnes2 14k

0

Entering edit mode

Thanks for reply. Yes, I used the whole genome of hg19 as reference. I did the FASTQC and the Adapt Content check shows there might be Illumina Universal Adapter: enter image description here

I checked the adapter which is AGATCGGAAGAG, then I used cutadapt to trim it from 3'-end from both R1 and R2 sequence:

cutadapt -a AGATCGGAAGAG -o R1.trimmed.fastq.gz -p R2.trimmed.fastq.gz R1.fastq.gz R2.fastq.gz

then I find there are repeated sequences near the 5'-end in R1, then trim this from R1 only: cutadapt -g SEPCIFIC_SEQ -o R1.trimmed1.fastq.gz R1.trimmed.fastq.gz

I did not find any pattern seq in R2, so I did not do anything to R2.trimmed.fastq.gz.

I also aligned R1 and R2 separately to the reference genome, from the summary, I found the R1 trimmed adapter and barcode looks normal, but R2 is still bad (the percentage of aligned 0 times):

enter image description here

ADD REPLY • link 4.2 years ago by SkyL ▴ 10

0

Entering edit mode

Your R1 trimmed and R2 trimmed have the same percentages. Whatever you trimmed from read1 worked, there must be something you read through on read 2 that you need to trim.

ADD REPLY • link 4.2 years ago by swbarnes2 14k

0

Entering edit mode

Yes, that was what I thought, R1 trimmed adapter and barcode aligned to the reference well (at least I think so). At this stage, I only have enough information to trim the adapter from R2 and there is no other information about what else I should trim from R2 again. I feel there must be something else that needs to be trimmed from R2 to make most of it align to the reference, but I do not know how to identify such information, do you have any suggestion? Many thanks!!

ADD REPLY • link 4.2 years ago by SkyL ▴ 10