Question

Masking redundancy/paralogy/repetitiveness in a genome assembly using a k-mer approach

0

Entering edit mode

10.5 years ago

apelin20 ▴ 490

Hello,

As you may all know, most analysis on variation requires exclusion of paralogous regions which tend to inflate diversity.

Common approaches are to filter SNPs based on sequencing depth, in order to avoid repetitive regions.

Is there a way to mask with NNNNN all regions in an assembly which share k-mers with other regions and only keep regions present only once? I know BBDuk can mask regions from one fast file present in another fasta file, but what I am looking for is to work with one fasta file and mask repetitive k-mers.

Thanks,

Adrian

BBMap k-mer assembly NGS paralogy • 4.2k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by apelin20 ▴ 490

Ram · Answer 1 · 2015-02-02

1

Entering edit mode

10.5 years ago

Brian Bushnell 20k

Hi Adrian,

I actually have a couple of programs that can be used for that general purpose, depending on the specifics.

First off, there's BBMask [bbmask.sh], which can mask low-entropy areas in a genome - for example, ATATATATATAT....etc. You can adjust the entropy window size and entropy level; the default settings mask approximately 1% of the human genome, basically covering all of the areas that are low-complexity enough that human shares them exactly with plants and fungi. BBMask can also accept a sam file mapped to a genome and mask everywhere the sam reads hit. You could, for example, shred a genome into 100bp pieces, map them to itself, and make a sam file of only the multi-mapping reads, then mask everywhere they hit.

But if you want a kmer-based approach, you can use kmercountexact.sh to generate a fasta file containing all kmers that exist at least 2 times in the genome, then mask those with BBDuk, like this:

kmercountexact.sh in=ref.fa out=kmers.fa mincount=2 k=31
bbduk.sh in=ref.fa ref=kmers.fa out=masked.fa ktrim=N k=31 mm=f

Please note that for the purposes of calling variations, I highly recommend mapping to the unmasked genome. You can then ignore variations occurring in regions that would have been masked... but if you map to the masked genome, you can end up with a read that came from the masked portion (say, a gene with 2 identical copies) mapping to a homologous-but-not-identical region, causing false-positives. Typically, if I am interested in calling high-quality variants, I throw away ambiguously-mapped (multi-mapped) reads rather than masking the genome.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Brian Bushnell 20k

0

Entering edit mode

I love your last approach:) Trying right now

ADD REPLY • link 10.5 years ago by apelin20 ▴ 490

0

Entering edit mode

Odd, got an error:

@BioPower3-IBM ~/programs/bbmap $ sh kmercountexact.sh in=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_3months.fasta mincount=2 k=23 out=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_masked.fasta
java -ea -Xmx128561m -Xms128561m -cp /home/adrian/programs/bbmap/current/ jgi.KmerCountExact in=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_3months.fasta mincount=2 k=23 out=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_masked.fasta
Executing jgi.KmerCountExact [in=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_3months.fasta, mincount=2, k=23, out=/home/adrian/Encephalitozoon/Assembly/ECIII_Lemming_assembly_masked.fasta]

ways=131
Initial size set to 63972057
Initial:
Memory: max=129189m, total=129189m, free=128515m, used=674m

Initialization Time:          112.656 seconds.
Before Load:
Memory: max=129190m, total=129190m, free=19779m, used=109411m

Exception in thread "Thread-3" java.lang.AssertionError:
An input file appears to be misformatted.  The character with ASCII code 63 appeared where a base was expected.
false

[63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63.....

Maybe it's because scaffold %non-ACGTN=0.03

    at stream.Read.<init>(Read.java:85)
    at stream.Read.<init>(Read.java:51)
    at stream.FastaReadInputStream.generateRead(FastaReadInputStream.java:287)
    at stream.FastaReadInputStream.fillList(FastaReadInputStream.java:207)
    at stream.FastaReadInputStream.hasMore(FastaReadInputStream.java:136)
    at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:642)
    at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:634)

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by apelin20 ▴ 490

0

Entering edit mode

ok. I think it was a couple of "?" chars in my genome... No idea how they got there. Removed them and no error.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by apelin20 ▴ 490

0

Entering edit mode

I see what you mean by mapping to an unmasked genome. Indeed, mapping to a masked genome generates a lot of false positives.

Is there anyway to generate a .bed file with information as to what regions were masked? Then it becomes easy to filter variants called in these regions.

ADD REPLY • link 10.5 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

None of my programs generate bed files, but I will plan to add that capability. Perhaps there is an existing tool somewhere that can create a bed file indicating the locations of all the Ns in a genome?

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Brian Bushnell 20k

0

Entering edit mode

I was thinking about that too. If there are 20-50 Ns in a row, there might be no point is masking it, might just be a coincidence, but if we get regions of significant length masked (an I am not sure what those would be, maybe over 100bp?) then we can chose to exclude those. I have long looked for .bed generators of NNNN regions, they would be useful for several tools (Scaffolders do not produce .bed files and some tools request .bed locations of gaps in the assembly). I will look into this some more.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Adrian Pelin ★ 2.7k