Question

Identification of short insertions in genome

1

Entering edit mode

7.4 years ago

valerie ▴ 100

Hi guys,

I need to identify insertions of specific sequences in my genomes. I have WGS data with read length of 250 nucl. Previously I was working on identification of large sequences insertions and simply looked for discordantly mapped reads for that using bowtie2 and bbmap. Now I am looking for sequences that are shorter than read length. As I want these data to be comparable to my previous computations I want to use bowtie2 for alignment again. As far as I understand, I should use local mode. I've tried this, but got alignment of very short parts of my reads. Is there a possibility to require the alignment length to be equal to reference sequence length? When I identify the reads that are mapped to the reference sequences, I will find where are their pairs mapped to the reference genome to find the insertion site, is it a correct strategy?

Thanks id advance!

wgs alignment • 3.0k views

ADD COMMENT • link updated 7.3 years ago by Biostar 20 • written 7.4 years ago by valerie ▴ 100

score 3 · Accepted Answer · 2017-06-27

3

Entering edit mode

7.4 years ago

timstuart ▴ 30

I'd recommend using lumpy or delly2 for small indels, rather than trying reinvent your own method

https://github.com/dellytools/delly

https://github.com/arq5x/lumpy-sv

ADD COMMENT • link 7.4 years ago by timstuart ▴ 30

0

Entering edit mode

Thank you! Actually they are almost the size of a read (~230 nucl), is it a small indel?

ADD REPLY • link 7.4 years ago by valerie ▴ 100

0

Entering edit mode

I think delly and lumpy should both be able to detect indels 200-300 bp without much trouble

ADD REPLY • link 7.4 years ago by timstuart ▴ 30

0

Entering edit mode

Great! Thank you, Tim!

ADD REPLY • link 7.4 years ago by valerie ▴ 100

0

Entering edit mode

230bp ! Looks more like tandem duplication than indel. Just a guess.

ADD REPLY • link 7.4 years ago by poisonAlien ★ 3.2k

score 2 · Accepted Answer · 2017-06-27

2

Entering edit mode

7.4 years ago

Brian Bushnell 20k

You can call insertions near read length with BBMap like this (assuming paired interleaved reads that mostly overlap):

bbmerge.sh in=reads.fq out=merged.fq
reformat.sh in=merged.fq out=filtered.fq minlen=350
bbmap.sh in=filtered.fq out=mapped.sam ref=ref.fa slow
callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fa ploidy=2

That's a variant of a longer and slower pipeline that I used to successfully call insertions from 1bp to 219bp using 2x100bp reads, on E.coli. If you have sufficient time, coverage, and memory, you can use this method (adjusted for 2x250bp reads) instead for even longer insertions (it's great on small genomes but would take a lot of memory for human-sized ones):

filterbytile.sh in=reads.fq.gz out=filtered_by_tile.fq.gz
bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz maxns=0 ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=220 ref=adapters.fa
clumpify.sh in=trimmed.fq.gz out=eccc.fq.gz passes=4 ecc unpair repair
tadpole.sh in=eccc.fq.gz out=ecct.fq.gz ecc prefilter=2
tadpole.sh in=ecct.fq.gz out=extended.fq.gz ordered mode=extend el=20 er=20 k=62 prefilter=2
bbmerge-auto.sh in=extended.fq.gz out=merged.fq.gz outu=unmerged.fq.gz rem k=93 extend2=120 prefilter=2
bbmap.sh in=merged.fq.gz out=merged.sam.gz slow ref=ref.fa
callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fa ploidy=2

ADD COMMENT • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Dear Brian, Could you please explain in few words what is going on here? Should I add my sequences of interest to reference genome in genome.fa to use the first option? Thank you!

ADD REPLY • link 7.4 years ago by valerie ▴ 100

1

Entering edit mode

Nope, definitely do not add your sequences to the genome! The genome should just be the reference.

The way this works is...

bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa slow
callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fa

That maps reads to the reference, and calls variants (including insertions) based on the cigar strings of the mapped reads. BBMap is very good at mapping reads with indels, and with the "slow" flag, will allow insertions up to roughly 50% of the read length. So in this case you'd call insertions up to ~125bp using your 250bp reads. However, 230bp is too long for this approach - to call them, you would need longer reads, at least ~460bp. Depending on your insert size distribution, 2x250bp reads can be merged to produce up to ~490bp merged reads. So, the commands merge the overlapping reads, and filter out the short ones (under 350bp; possibly 400bp would be a better cutoff). Mapping the remaining long merged reads, it is possible to call 230bp insertions, though that's near the limit.

The second set of commands does a better job by extending the reads using kmers. This allows merging of non-overlapping reads using information from other reads that span the gap, and additionally lengths both ends of all the reads, potentially resulting in reads over 800bp long, which could allow detection of ~400bp insertions. Though that would require a library with quite large insert sizes. Several of the lines in the second block (such as filterbytile and clumpify) are devoted to error-reduction and error-correction, because the process of read extension is very sensitive to errors - reads with errors near the ends can't be extended.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Brian, thank you so much for your explanation! May I ask you one more question? Actually I am looking for new copies of repetitive elements in cancer genomes. I suppose that tumors are mosaic and I want kind of "measure" the activity of transposition process not identify precisely each insertion. This means that I want to take into account each read pair that corresponds to transposition event, not only well-covered transpositions, that are present in each cell. Can I somehow play with coverage when calling for variants and look for the variants with minimal coverage too?

ADD REPLY • link 7.4 years ago by valerie ▴ 100

1

Entering edit mode

Sure. CallVariants has coverage and allele-frequency flags for allowing low-depth variants to be called. I'd suggest you add the flag "rarity=0.05" if you want to call down to 5% allele-frequency events without penalizing their score, for example.

However, you may not be able to do a good job of extending reads with kmers for low-depth variants, since that process is reliant on coverage (may not work for a variant that is only covered by 3 reads, for example). Simply merging the reads without extension is not reliant on coverage, though.