Is there any read-mapping package that could map only the first half or the second half of reads?
1
0
Entering edit mode
7.1 years ago
johnnytam100 ▴ 110

I have a 150bp short read whole-genome library by analyzing which I am trying to discover the breakpoint of a possible insertional mutation in a plant genome.

I am wondering Is there any read-mapping package that could map only the first half (e.g. 75bp) or the second half of reads?

Thank you very much.

mapping bwa sequencing • 1.7k views
ADD COMMENT
0
Entering edit mode

why don't you just cut your fastq before aligning ?

ADD REPLY
0
Entering edit mode

I am doing the exact same thing, but since the depth of coverage of my library is not that high, I am considering if I map 75bp-cut reads would have missed some of the true hit which actually should map as length e.g. 90bp, 120bp etc..

ADD REPLY
0
Entering edit mode

Why not just map everything as is and allow a LOT of soft-clipping? The soft-clipped portion is then likely to be your insert.

ADD REPLY
0
Entering edit mode

How to do soft-clipping in bwa? And how could I extract those soft-clipped reads? Thanks!

ADD REPLY
1
Entering edit mode

How to do soft-clipping in bwa?

bwa does soft-clipping by default.

And how could I extract those soft-clipped reads?

Extract soft-clipped reads from BWA -generated SAM file

ADD REPLY
0
Entering edit mode

Thank you very much!

I have read that CIGAR alignment operation S corresponds to soft-clipped reads. Is it a simpler extraction method using awk to select "S" in the CIGAR field?

ADD REPLY
0
Entering edit mode
7.1 years ago
d-cameron ★ 2.9k

bwa mem performs split read alignment and, for the reads where the breakpoint is indeed in the middle, will report two alignments for the one read. I've found that bwa doesn't report them all. I found that feeding the non-split-read soft clipped alignments back into bwa gives additional split read alignments not initially reported by bwa. I've already written a program (gridss.SoftClipsToSplitReads) to do this.

On a higher level, have you considering using a structural variant calling program to detect your breakpoints? There's over 50 general purpose NGS SV callers and I'm sure at least some are likely to fulfill your requirements. Many of these tool authors (such as myself) will have already considered some data complexities/artefacts that you may not be aware of. For example, are you aware that in the presence of certain sequencing errors and/or micro-homologies, bwa will align reads with long soft clips to position A, reads with medium length soft clips to position B, and reads with short soft clips to also to position A? I've encountered cases where positions A and B are 15bp apart. If your SV detection analysis/tool is not aware of this, you might think you have two different insertion events when in reality, it's just an alignment artefact.

TLDR: use a soft-clipped/split-read based called such as CREST, Gustaf, Socrates, SoftSV, or GRIDSS. I recommend GRIDSS as my benchmarking indicates it produces the best results (followed by CREST, although the CREST call set is conservative).

ADD COMMENT
0
Entering edit mode

Wow! Actually I have checked out several structural variant calling packages (e.g. fermikit, Pindel, breakdancer, Platypus) but I was stuck in the compilation step of those packages... After failing using those packages, and since I am still quite new to Linux, I turned to use more manual methods through which I understand what I am doing each step.

I am checking out your gridss.SoftClipsToSplitReads and the paper about GRIDSS. I will discuss with the PhD student and get back to you. Thank you so much!

ADD REPLY
0
Entering edit mode

Cameron, I have just skimmed through the paper. I have the following questions:

1) Since I don't have the completed genome as a reference (the first step of GRIDSS), do you think I should use de novo contigs assembled from the same full-set of reads (or target-containing reads) OR a completed genome of a close-relative which I am confident that it does not contain the same insert at the same position as the same species I am investigating?

2) How do you compare GRIDSS and other existing structural variant calling programs (e.g. fermikit, Pindel, breakdancer, Platypus)?

Originally I am trying to follow the manual selection method as stated in this paper (FYI): http://www.plantphysiol.org/content/plantphysiol/early/2017/08/15/pp.17.00715.full.pdf (refer to Structural variation breakpoint search) and parallely I would like to try out GRIDSS.

ADD REPLY
1
Entering edit mode

Try bioconda to avoid "stuck at compilation" problems. It will change your life (for the better)

https://bioconda.github.io/recipes.html

ADD REPLY

Login before adding your answer.

Traffic: 2225 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6