You first align you data with default settings using BWA against your reference genome. Be sure to have trimmed the Nextera adapter from your reads. If you have not done that, you might use skewer to do so:
skewer -x CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -y CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -m pe -n -q 30 -Q 25 -o ${BASENAME} ${BASENAME}_1.fastq.gz ${BASENAME}_2.fastq.gz
Once you have your BAM file, use this code snipped to get the reads you want:
samtools view -h $1 | awk -v LEN=$2 '{if ($9 <= LEN && $9 >= -(LEN) && $9 != 0 || $1 ~ /^@/) print $0}' | samtools view -bh -o $3 -
Save it as a bash script, then use:
./script.sh in.bam 100 out.bam
This will give you reads (fragments) with an insert size below 100bp. If you want further size selection, just modify the if-condition according to your needs. You can speed things up by adding the multithreading option -@
to samtools and using mawk
instead of awk
.
EDIT: There is also a python package called NucleoATAC to call NFRs and histone positions directly from ATAC-seq data. If always found the documentation quiet difficult to digest, but maybe you may find it helpful. That having said, I gave up using it because I simply did not really understand what exactly it does and what all these output files mean, so I ended up defining NFRs as peaks called from a BAM file with fragments < 100bp. These peaks were required to overlap with peaks called from the full BAM file. This then (what a surprise) typically gives you peaks in the center of the full-BAM peaks, just the width is smaller.
are you talking about the read length or about the fragment length (distance between a read and its mate ) ??
Apologies, I mean fragment length.
You can filter the alignments after wards to keep required fragment length (Bam : Extract Only Alignment With A Specific Length ).