Hi,
I've read the post bbmap and chip, I'm sure that bbmap is ok also for other chromatin-level applications. I'm trying to map some maize lines (DNA from FAIRE experiment) to reference line. So there are problem, that the aligner must cope with a number of indels, snp and so on. I wonder, how could I allow bbmap for this variation. Now, setting mapq=10 filter in samtools I see, that my mapping percentage is only 15-30 ! The maize line in question is somehow different from the reference.
My current code is for i in {38..39}; do ~/bin/bbmap/bbmap.sh in=../reads_bbtrim/hds${i}_clean.fastq.gz ref=Zea_mays.B73_RefGen_v4.dna.toplevel.fa outm=../bbmaped/vs${i}.bam vslow k=8 maxindel=20 ambig=random minratio=0.1 32bit showprogress=100000 -Xmx100g statsfile=../bbmaped/stats${i}vs covstats=../bbmaped/covstats${i}vs ; done
I thought, that very sensitive setting and some parameters from the cited post would be ok. So there are two problems: 1. use bbmap for FAIRE-seq reads, 2. equalize settings for maize line not-very-similar to the reference maize line.
ok, I have to stop my script, it would end in 21 days. It's too long for me.
Before making any changes to the default you should simply adjust
maxindel=
as Brian had suggested in other thread. You should preferably index the maize genome separately if you are going to reuse the index again. Please usethreads=N
if you have more cores available that will speed things up significantly. This is a tough problem for any aligner (because of ploidy etc) so be patient.Thanks, I've generated index in advance and use all available cores. I'll try adjusting
maxindel
If you have generated index in advance then use
path=
instead ofref=
. Usingref=
will recreate the index each time. I don't seethreads=
option in your command line so I asked about that. Include it explicitly. Why do you have the word32-bit
in the command line?Thanks, I must misunderstood the meaning of
path
andref
, in fact index has been recreated each time. I don't usethreads
because I saw, that bbmap uses all threads by default, I'll add this option, certainly it won't hurt. Most of the time I've also not set memory usage. I've used32bit
because, as written in the documentation:In fact I don't know if I need this, I want that coverage is calculated properly. I've read the bbmap documentation but, as you see, not all parameters are clear for me.
Ok,
maxindel=20
didn't help. Still majority of the reads have mapq<10. Or maybe I'm interpreting this value incorrectly and I should expect that when there are snps and indels?That is because most of your reads are likely multi-mapping. Since maize has a large genome and you have short reads. You could get around that problem by using
ambig=random
. This will pick a random location to place a read that is otherwise multi-mapping. Decide if that makes sense in your specific use case.maxindel=
parameter controls length of opening a gap when aligning two parts of a read. This will generally be determined by how far apart your spliced exons are.I've tried both
ambig=random
and default setting, which isbest
and I believe chooses first best match. My full code, is given below. It give mapq=3 for majority of the reads (55%).bbmap.sh in=hds38_clean.fastq.gz path=b73_2020 outm=mapped38.bam maxindel=20 ambig=random threads=24 -Xmx100g showprogress=250000
. I think I must read something about mapq values meaning.Take a look at this blog post for mapq values. I don't remember how
bbmap
implements these. You can take a look at code if you are a programmer. Brian Bushnell (author of BBMap) unfortunately no longer participates on public forums but I will try to email him and ask.Ok, I'll look at it. I'm not a programmer. BTW, I wanted to use
bbmap
mainly because it seems like cohesive package, i.e. can do many operations, e.g. trimming. I also likebbwrap
because it allows to use both pe and se reads at the same time - I think it is quite unique capability.bbmap
has also good references CIPHERIn fact I also have maize genome (ph207) which is closer to the line I work on (s68). Surprisingly, the mapping is even worse than for the best known reference genome (b73). I suppose it could need some experimentation with the alignment parameters. Second thing, I also have resequenced genome of my investigated line (s68). So potentially I could use it somehow for alignment of FAIRE-seq reads from that line (s68). However I don't know, how to integrate info. from resequencing and variant-calling and FAIRE-seq alignment.