Question

How can I map reads only of a certain length?

0

Entering edit mode

6.4 years ago

a.rex ▴ 350

I have recently done an ATAC-seq experiment -

I wish to selectively map reads of <100bp from a fragment distribution of fragments from 20bp to 450bp following illumina next-seq PE sequencing. This way I can look at reads that most likely originate from nucleosome-free regions...

How can I achieve this using an aligner such as BWA or bowtie? Can I use a command prior or even following mapping that looks only fragments of this length?

Thank you

bwa atac-seq sequencing • 1.9k views

ADD COMMENT • link updated 6.4 years ago by Pierre Lindenbaum 164k • written 6.4 years ago by a.rex ▴ 350

0

Entering edit mode

are you talking about the read length or about the fragment length (distance between a read and its mate ) ??

ADD REPLY • link 6.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Apologies, I mean fragment length.

ADD REPLY • link 6.4 years ago by a.rex ▴ 350

1

Entering edit mode

You can filter the alignments after wards to keep required fragment length (Bam : Extract Only Alignment With A Specific Length ).

ADD REPLY • link 6.4 years ago by GenoMax 147k

score 1 · Answer 1 · 2018-07-04

You first align you data with default settings using BWA against your reference genome. Be sure to have trimmed the Nextera adapter from your reads. If you have not done that, you might use skewer to do so:

skewer -x CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -y CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -m pe -n -q 30 -Q 25 -o ${BASENAME} ${BASENAME}_1.fastq.gz ${BASENAME}_2.fastq.gz

Once you have your BAM file, use this code snipped to get the reads you want:

samtools view -h $1 | awk -v LEN=$2 '{if ($9 <= LEN && $9 >= -(LEN) && $9 != 0 || $1 ~ /^@/) print $0}' | samtools view -bh -o $3 -

Save it as a bash script, then use:

./script.sh in.bam 100 out.bam

This will give you reads (fragments) with an insert size below 100bp. If you want further size selection, just modify the if-condition according to your needs. You can speed things up by adding the multithreading option -@ to samtools and using mawk instead of awk.

EDIT: There is also a python package called NucleoATAC to call NFRs and histone positions directly from ATAC-seq data. If always found the documentation quiet difficult to digest, but maybe you may find it helpful. That having said, I gave up using it because I simply did not really understand what exactly it does and what all these output files mean, so I ended up defining NFRs as peaks called from a BAM file with fragments < 100bp. These peaks were required to overlap with peaks called from the full BAM file. This then (what a surprise) typically gives you peaks in the center of the full-BAM peaks, just the width is smaller.

score 1 · Answer 2 · 2018-07-04

1

Entering edit mode

6.4 years ago

Pierre Lindenbaum 164k

using samjdk: http://lindenb.github.io/jvarkit/SamJdk.html

java -jar dist/samjdk.jar -e 'return record.getReadPairedFlag() && !record.getReadUnmappedFlag() && !record.getMateUnmappedFlag() && record.getReferenceName().equals(record.getMateReferenceName()) && Math.abs(record. getInferredInsertSize())< 100;' input.bam