I have a BAM file containing many amplicon RNA-seq reads mapped to a small genomic region.
Most of these reads appear to be correctly spliced (see ), but some contain bases that map to the intronic region. I want to extract (into a separate BAM file) only those reads who have bases that map to the intronic region.
I have searched for hours and haven't been able to find a solution.
Everything I've tried has been some version of:
samtools view \
-hb \
-o out.misspliced.bam \
in.bam \
chr12:132676220 # an intronic position
However, this command still extracts reads that are correctly spliced and don't have bases that cross into the intron (see "after" visual in posted image). When I pass the coordinates to samtools view
, it seems to still include reads that are correctly splice but stil span the intron, even if they don't have bases that map to the intron.
I'm new to working with SAMtools and I'm unfamiliar with BEDtools (as of now).
Can anyone help me extract only the RNA-seq reads that have bases mapping to intronic positions between two exons to a new BAM file?
EDIT [SOLVED]:
I eventually solved this by using the -split
option with bedtools intersect
. This is the option that successfully makes bedtools view a spliced read as not covering an intronic region if it doesn't actually have bases mapping to that intronic region. First, I made a one-line BED file for the intron I wanted reads from:
> echo "chr12 132676204 132676545" > intron-region.bed
Then I used the following command to retrieve only reads that actually have bases mapping into the intron and save them to a new BAM:
bedtools intersect \
-a in.bam \
-b intron-region.bed \
-wa -split > misspliced-all.bam
Now to find how to do this just using samtools, because there must be a way...
I hope this helps anyone with a similar problem.
If you were you able to do this, can you advise what you plan on doing next? I ask because if you next intend to tabulate how many reads extend into each end of each intron, then you might profitably skip extracting them and directly count them through creative use of RSubread's FeatureCounts: A General-Purpose Read Summarization Function, or the Subread package, which it wraps.
Thanks, I will look into how to do this with Subread/FeatureCounts! We managed to do it with a lot of grepping, but I knew there had to be a tool already out there.