Question

How to define the open chromatin region in ATACseq? And how to filter the .bam file based on insert size?

1

Entering edit mode

2.4 years ago

Dan ▴ 180

Hello,

How to define the open chromatin region in ATACseq? The tutorial (https://rockefelleruniversity.github.io/RU_ATAC_Workshop.html) saids insert size less than 100 bp is open chromatin region and insert size in (180, 240) is mono-nucleosome region, I am wondering what is the insert size in (100, 180) belong to? I know the 9th column of .bam file is the inset size, how to filter the .bam file based on insert size? I tried

samtools view $(FILE)"_mm10.bam" | awk '$9<100' | awk '$9>0' | samtools view -b > openregionbam/$(FILE)"_mm10.bam"

But get error

[E::sam_parse1] missing SAM header
[W::sam_read1] parse error at line 1
[main_samview] truncated file.

What should I do? Thanks a lot.

ATAC • 639 views

ADD COMMENT • link updated 2.4 years ago by ATpoint 85k • written 2.4 years ago by Dan ▴ 180

score 3 · Accepted Answer · 2022-06-30

Try this:

It adds -h to the first samtools command to preserve the header (absence of header is the cause of your error), it preserves that header in the awk command (skipping lines starting with @) and it respects that alignments to the bottom DNA strand would have negative insert size so you have to filter for insert size being larger than 100 and smaller than -100.

LEN=100
samtools view -h in.bam | awk -v LEN=$LEN '{if ($9 <= LEN && $9 >= -(LEN) && $9 != 0 || $1 ~ /^@/) print $0}' | samtools view -o out.bam

Yes, 100bp is a cutoff I've seen people (and myself) using before but maybe there are (depending on your application) specialized tools that call NFRs mor reliably. NucleoATAC was one that can do it back afair it was quite a pita to use.

Pro-tip: You can replace awk by mawk to speed up the process a lot.