I have a 150bp short read whole-genome library by analyzing which I am trying to discover the breakpoint of a possible insertional mutation in a plant genome.
I am wondering Is there any read-mapping package that could map only the first half (e.g. 75bp) or the second half of reads?
I am doing the exact same thing, but since the depth of coverage of my library is not that high, I am considering if I map 75bp-cut reads would have missed some of the true hit which actually should map as length e.g. 90bp, 120bp etc..
I have read that CIGAR alignment operation S corresponds to soft-clipped reads. Is it a simpler extraction method using awk to select "S" in the CIGAR field?
bwa mem performs split read alignment and, for the reads where the breakpoint is indeed in the middle, will report two alignments for the one read. I've found that bwa doesn't report them all. I found that feeding the non-split-read soft clipped alignments back into bwa gives additional split read alignments not initially reported by bwa. I've already written a program (gridss.SoftClipsToSplitReads) to do this.
On a higher level, have you considering using a structural variant calling program to detect your breakpoints? There's over 50 general purpose NGS SV callers and I'm sure at least some are likely to fulfill your requirements. Many of these tool authors (such as myself) will have already considered some data complexities/artefacts that you may not be aware of. For example, are you aware that in the presence of certain sequencing errors and/or micro-homologies, bwa will align reads with long soft clips to position A, reads with medium length soft clips to position B, and reads with short soft clips to also to position A? I've encountered cases where positions A and B are 15bp apart. If your SV detection analysis/tool is not aware of this, you might think you have two different insertion events when in reality, it's just an alignment artefact.
TLDR: use a soft-clipped/split-read based called such as CREST, Gustaf, Socrates, SoftSV, or GRIDSS. I recommend GRIDSS as my benchmarking indicates it produces the best results (followed by CREST, although the CREST call set is conservative).
Wow! Actually I have checked out several structural variant calling packages (e.g. fermikit, Pindel, breakdancer, Platypus) but I was stuck in the compilation step of those packages... After failing using those packages, and since I am still quite new to Linux, I turned to use more manual methods through which I understand what I am doing each step.
I am checking out your gridss.SoftClipsToSplitReads and the paper about GRIDSS. I will discuss with the PhD student and get back to you. Thank you so much!
Cameron, I have just skimmed through the paper. I have the following questions:
1) Since I don't have the completed genome as a reference (the first step of GRIDSS), do you think I should use de novo contigs assembled from the same full-set of reads (or target-containing reads) OR a completed genome of a close-relative which I am confident that it does not contain the same insert at the same position as the same species I am investigating?
2) How do you compare GRIDSS and other existing structural variant calling programs (e.g. fermikit, Pindel, breakdancer, Platypus)?
why don't you just cut your fastq before aligning ?
I am doing the exact same thing, but since the depth of coverage of my library is not that high, I am considering if I map 75bp-cut reads would have missed some of the true hit which actually should map as length e.g. 90bp, 120bp etc..
Why not just map everything as is and allow a LOT of soft-clipping? The soft-clipped portion is then likely to be your insert.
How to do soft-clipping in bwa? And how could I extract those soft-clipped reads? Thanks!
bwa does soft-clipping by default.
Extract soft-clipped reads from BWA -generated SAM file
Thank you very much!
I have read that CIGAR alignment operation S corresponds to soft-clipped reads. Is it a simpler extraction method using awk to select "S" in the CIGAR field?