How can I map reads only of a certain length?
2
0
Entering edit mode
6.4 years ago
a.rex ▴ 350

I have recently done an ATAC-seq experiment -

I wish to selectively map reads of <100bp from a fragment distribution of fragments from 20bp to 450bp following illumina next-seq PE sequencing. This way I can look at reads that most likely originate from nucleosome-free regions...

How can I achieve this using an aligner such as BWA or bowtie? Can I use a command prior or even following mapping that looks only fragments of this length?

Thank you

bwa atac-seq sequencing • 1.9k views
ADD COMMENT
0
Entering edit mode

are you talking about the read length or about the fragment length (distance between a read and its mate ) ??

ADD REPLY
0
Entering edit mode

Apologies, I mean fragment length.

ADD REPLY
1
Entering edit mode

You can filter the alignments after wards to keep required fragment length (Bam : Extract Only Alignment With A Specific Length ).

ADD REPLY
1
Entering edit mode
6.4 years ago
ATpoint 85k

You first align you data with default settings using BWA against your reference genome. Be sure to have trimmed the Nextera adapter from your reads. If you have not done that, you might use skewer to do so:

skewer -x CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -y CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -m pe -n -q 30 -Q 25 -o ${BASENAME} ${BASENAME}_1.fastq.gz ${BASENAME}_2.fastq.gz

Once you have your BAM file, use this code snipped to get the reads you want:

samtools view -h $1 | awk -v LEN=$2 '{if ($9 <= LEN && $9 >= -(LEN) && $9 != 0 || $1 ~ /^@/) print $0}' | samtools view -bh -o $3 -

Save it as a bash script, then use:

./script.sh in.bam 100 out.bam

This will give you reads (fragments) with an insert size below 100bp. If you want further size selection, just modify the if-condition according to your needs. You can speed things up by adding the multithreading option -@ to samtools and using mawk instead of awk.

EDIT: There is also a python package called NucleoATAC to call NFRs and histone positions directly from ATAC-seq data. If always found the documentation quiet difficult to digest, but maybe you may find it helpful. That having said, I gave up using it because I simply did not really understand what exactly it does and what all these output files mean, so I ended up defining NFRs as peaks called from a BAM file with fragments < 100bp. These peaks were required to overlap with peaks called from the full BAM file. This then (what a surprise) typically gives you peaks in the center of the full-BAM peaks, just the width is smaller.

ADD COMMENT
0
Entering edit mode

Thank you ATpoint - I agree, I have also found NucleoATAC to be a bit difficult to comprehend. I have used your code, and after having plotted the NFRs, it does seem very similar but narrower than the full bam file. I suppose you could compare reads <100bp with that of mononucleosome reads (147-200bp)? This would then give you where 'open chromatin' and where the nucleosomes are?

ADD REPLY
1
Entering edit mode
6.4 years ago

using samjdk: http://lindenb.github.io/jvarkit/SamJdk.html

java -jar dist/samjdk.jar -e 'return record.getReadPairedFlag() && !record.getReadUnmappedFlag() && !record.getMateUnmappedFlag() && record.getReferenceName().equals(record.getMateReferenceName()) && Math.abs(record. getInferredInsertSize())< 100;' input.bam
ADD COMMENT
0
Entering edit mode

I presume if I change to '....> 140;' this will give me reads greater than 140?

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6