Entering edit mode
3.7 years ago
boczniak767
▴
870
Hi,
I have FAIRE-seq data from two cultivars: reference B73 and less known S68.
I assumed I can align reads from both cultivars to B73 reference sequence, which is standard.
I used bbmap
with default settings.
I've seen, that MAPQ>10 read filtering is frequently applied. The problem is, that (I think) too low percentage of reads pass this filter. It is ca. 50-60% for B73, which I assume should be mapped very well, and ca. 40-48% for S68.
I have folowing questions:
- What settings, for example for
bbmap
orbwa mem
are good for alignment of chromatin-level data from non-standard cultivar to reference genome? - What MAPQ threshold should I apply?
- I haven't filtered
fastq
files for base qualities (looks good infastqc
after adapter removal). Should I do this filtering?
Here are MAPQ plots from bamqc
b73 cultivar
s68 cultivar
How long are your reads? There is usually no need to change defaults, FAIRE is just normal DNA-seq in the end. Could it be that large parts of the genome are quite low-complexity so repetitive?
My reads are 125bp single-end. Yes maize genome is highly repetitive, nearly 85% according to Schanble et al 2009. Good point, I was focused rather on existence of snp's and indels which certainly makes reads to map badly. I've read, that mapping to masked genome is not recommended. So, is the MAPQ>10 filter ok?
I use 20 for mouse and human but there is no special reason for 20 over 10, 15, 18.636261 etc. Try 10 and see whether results make sense I‘d say.
Thanks, now, as I look at my charts I realized there will be slight difference between 10 and 20. At least in terms of reads remaining.
Prior thread for reference discussion : bbmap for FAIRE-seq and other line than reference
I suggest that you run
clumpify.sh
(A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ). It is possible that you have a large amount of sequence duplication. Even at 125 bp it is surprising that your reads are multi-mapping but then this is the maize genome.Good idea, I thought that
clumpify
is just better compressing alghoritm. I've found Stampy, the authors claim it is good for mapping when snp's and indels are expected. I'll try several methods. I must say, I've found two publications (on maize and mosquito) with similar percent of filtered reads.Real
snp
andindels
is one thing but since you don't know the ground truth there is no good way for you to judge if bbmap (or any other aligner) is doing a reasonable job. If you allow for additional errors then you may have a worse problem with multi-mapping.Post the results of
clumpify
and let us see if you have sequence duplicates that can be removed prior to alignment.I've run
~/bin/bbmap/clumpify.sh in=hds35_clean.fastq.gz out=clumped35.fastq.gz groups=24
Below is the info printed on the screen.
In fact the cultivar I work on has been resequenced, so I have some info about variants. I planned to use this info after peak calling, to look for a nucleotide-level differences.
I've read about removal of optical duplicates but I didn't think it is a big concern. After all I'm looking for peaks, generated from regions with high number of reads. Does tools like
clumpify
cause peaks to disappear?Can you run
clumpify.sh
with thededupe
option on ORIGINAL data. Do not use trimmed reads. Do you have single-end reads or are these interleaved? If you have paired-end reads then please use both. You don't need to provide anout=
file if you don't want to get the deduped reads. This will still produce the stats output you see above.Note: Depending on what sequencer this was sequenced on you will need to change thedist
andspantiles
options. Useoptical dist=40 spantiles=f
after making necessary changes. Take a look at in-line help.I edited the full log to keep stat part we need above.
/thanks for tidying-up my post, I can't figure out how to paste such text properly. Here is the ouptut with original data and
dedupe
. I have reads from Illumina, in this case it is single-end.So you don't have a large fraction of duplicate reads but there are some (~15%). Because you have single end data that is probably contributing to not having dual anchors to place the reads on genome. You could use de-duped sequences and see if that helps improve mapq values to some extent.
Part of post you want to present with mono-spaced font can be selected by highlighting and then you can use the
101010
button in top edit bar to format that part as code.So you propose that I dedupe raw reads using above command, trim deduped file and use it to mapping?
Doesn't that compromise peak-calling?
Without UMI's there is no definite way to confirm if those 15% sequence duplicates are PCR or otherwise. Since you were worried about mapq being low we have been discussing these possibilities to see if that could be improved.
If you have a different option to try at this point then try that out but this may be the last thing to try with BBMap.
OK, thanks for help guys, in the end I'm not that worried about mapq. I'll experiment with some aligners and see what works best.
Nevertheless I've trimmed deduped fastq file and mapped it using default setings of
bbmap
. Here is the result ofbamqc
As I expected there is some improvement in
mapq
. This is a difficult problem to tackle. Let us know if you fare better with other aligners though I suspect this may be one of the better results.What you could do is to take the reference fasta file you have, simulate reads of the same length and then see what the alignment rate is, so whether this low MAPQ / multimapping problem is indeed a consequence of the repetitiveness of the reference, or due to some artifacts in the samples you have.
Ok, so I've tried
bwa
,bwa
+ realignment withstampy
(option--bamkeepgoodreads
). Thebamqc
plot looks somehow 'smeared' in the low mapq range with ca. 10% less high scoring alignments.stampy
helped somehow, but in the endbbduk
gives better results, at least in term of % of high scoring mappings. Ah, and it is much faster. I think I'll rely onbbduk
withmaxindel=20 ambig=random
, maybe I'll also look at peak-calling results on alignment bybwa
+stampy
. As of simulations, unfortunately I don't have time to try it, I think I've already delved too much in aligners.