Hi, I've been trying to map a published ChIP-seq dataset and getting very low mapping stats. I have raw single-end 75bp fastqs, I tried mapping first with bwa mem and got low mapping rates. Then I tried bwa aln then bwa samse and saw no improvement.
so I checked the method section from the paper and it says:
"ChIP-seq data processing. The reads were aligned to the hg19 reference genome using BWA with samse (for single-end libraries) or mem (for paired-end libraries). For single-end sequencing, the reads were 5’ extended to 150 bp before aligning. For paired-ends sequencing, reads with a corresponding pair were retained for the subsequent analyses. Reads with mapping quality scores <10 were discarded and the reads that aligned to the same genomic coordinates were counted only once."
I'm not sure how one can extend reads before aligning. Is there tools for that? or should I map first then extend reads and then map again?
Thanks!
EDIT: Link to the paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8650129/
It is possible to do read extensions (within limits) by doing local k-mer assemblies.
tadpole.sh
from BBMap suite can do that. A guide is available: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/tadpole-guide/That said extending 75bp reads only on 5'end sounds suspicious. Has the paper described how that was done (I can't access it now).
Personally I would still say that if a ChIP-seq experiment does not align then leave it be and use another dataset or create your own. Even with short reads like 1x50bp you should get reasonable results if it worked properly, else the experiment is just crappy...but good to know these types of tools exist.
Thanks I’ll try tadpole!
Unfortunately that’s all it says about the ChIPseq data processing. no other info even in the suppl.
I agree with @atpoint. If the data is not aligning well then don't spend time on it. I posted above comment just as information.
Thanks! that's what I thought too... Do you know when it's appropriate to do read extension? or proper way to improve short reads alignment?
I have never heard of read extension before alignment. After alignment it can be done, e.g. when creating coverage tracks (bigWigs) to make the tracks look nicer, but before...not really meaningful imo. I think short-read alignment has very good defaults these days. If it aligns poorly then the ChIP is likely of poor quality so you better look for alternative datasets.