Question

extending reads before aligning

0

Entering edit mode

2.3 years ago

hksk2 ▴ 20

Hi, I've been trying to map a published ChIP-seq dataset and getting very low mapping stats. I have raw single-end 75bp fastqs, I tried mapping first with bwa mem and got low mapping rates. Then I tried bwa aln then bwa samse and saw no improvement.

so I checked the method section from the paper and it says:

"ChIP-seq data processing. The reads were aligned to the hg19 reference genome using BWA with samse (for single-end libraries) or mem (for paired-end libraries). For single-end sequencing, the reads were 5’ extended to 150 bp before aligning. For paired-ends sequencing, reads with a corresponding pair were retained for the subsequent analyses. Reads with mapping quality scores <10 were discarded and the reads that aligned to the same genomic coordinates were counted only once."

I'm not sure how one can extend reads before aligning. Is there tools for that? or should I map first then extend reads and then map again?

Thanks!

EDIT: Link to the paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8650129/

chipseq shortreads bwa alignment • 1.6k views

ADD COMMENT • link updated 2.3 years ago by GenoMax 148k • written 2.3 years ago by hksk2 ▴ 20

score 1 · Answer 1 · 2022-09-09

1

Entering edit mode

2.3 years ago

ATpoint 86k

I do not see how this would be possible. For read extension you have to add sequence content and for sequence content you need the alignment position. Data are what they are. If the dataset is crap then that’s the reality. You cannot magically fix that unfortunately, any custom fiddling just adds uncertainty to the downstream results. Extending and remapping adds bias because the extension has added non-existing information to the read so the alignment is somewhat a self fulfilling prophecy.

ADD COMMENT • link 2.3 years ago by ATpoint 86k

1

Entering edit mode

It is possible to do read extensions (within limits) by doing local k-mer assemblies. tadpole.sh from BBMap suite can do that. A guide is available: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/tadpole-guide/

That said extending 75bp reads only on 5'end sounds suspicious. Has the paper described how that was done (I can't access it now).

ADD REPLY • link 2.3 years ago by GenoMax 148k

0

Entering edit mode

Personally I would still say that if a ChIP-seq experiment does not align then leave it be and use another dataset or create your own. Even with short reads like 1x50bp you should get reasonable results if it worked properly, else the experiment is just crappy...but good to know these types of tools exist.

ADD REPLY • link 2.3 years ago by ATpoint 86k

0

Entering edit mode

Thanks I’ll try tadpole!

Unfortunately that’s all it says about the ChIPseq data processing. no other info even in the suppl.

ADD REPLY • link 2.3 years ago by hksk2 ▴ 20

1

Entering edit mode

I agree with @atpoint. If the data is not aligning well then don't spend time on it. I posted above comment just as information.

ADD REPLY • link 2.3 years ago by GenoMax 148k

0

Entering edit mode

Thanks! that's what I thought too... Do you know when it's appropriate to do read extension? or proper way to improve short reads alignment?

ADD REPLY • link 2.3 years ago by hksk2 ▴ 20

0

Entering edit mode

I have never heard of read extension before alignment. After alignment it can be done, e.g. when creating coverage tracks (bigWigs) to make the tracks look nicer, but before...not really meaningful imo. I think short-read alignment has very good defaults these days. If it aligns poorly then the ChIP is likely of poor quality so you better look for alternative datasets.

ADD REPLY • link 2.3 years ago by ATpoint 86k