As genomax said, you can only split aligned BAM by chromosome. To split an indexed BAM you can exploit samtools view
's ability to retrieve alignments by chromosome, followed by samtools fastq
within a loop:
OUTPUTNAME="out"
samtools idxstats in.bam | cut -f1 | grep -v '*' | \
while read CHR
do
samtools view -h in.bam $CHR | \
samtools sort -n | \
samtools fastq \
-1 ${OUTPUTNAME }_${CHR}_1.fastq.gz \
-2 ${OUTPUTNAME }_${CHR}_2.fastq.gz \
-
done < /dev/stdin
The first line with idxstats
will list all available chromosome in the BAM (you might modify the grep
to remove random and unplaced contigs etc) which will then be parsed to samtools view
in order to extract the alignments by chromosome, sorted by name and then piped into fastq
(-
indicates to read from stdin
) to get the original sequences back. The script might look a bit weird if you are not familiar with pipes but I definitely recommend getting a background with it as these kind of scripts are very convenient in order to avoid unnecessary intermediate files.
Edit 16.3.19:
As in a new closed thread ( Error when split WES data into sub-files ) OP mentioned a problem extracting alignments for chr10. Therefore, here a basic workflow. Given we have a file genome.fa
:
## Index with faidx and then extract chr10:
samtools faidx genoma.fa
samtools faidx genoma.fa chr10 > genome_chr10.fa
## For the BAM file extract chr10 reads and convert to fastq:
samtools index your.bam
samtools view -b your.bam chr10 | samtools fastq -1 chr10_1.fastq.gz -2 chr10_2.fastq.gz -
The loop above in my original answer does this for all chromosomes. As OP asked what /dev/stdin
is: The command on top (samtools idxstats in.bam | cut -f1 | grep -v '*'
) simply lists all chromosomes and then sends them to the while
loop via a Unix pipe indicated by the |
. The loop then reads these information from so-called stdin
which is a pointer to these data coming from the stream (=the pipe). The way to feed streams into while
loops is to add the command < /dev/stdin
after the done
(done < /dev/stdin
), essentially telling stdin
to send its data into the loop.
So question to 948077241 , what exactly did not work?
You can't do this until you align the data to a reference. Fastq data is just reads that came from library fragments you sequenced. They have no location information.
Once you align the data you can separate the reads by using
samtools view region
followed by conversion back to fastq/selection using ID's from original data files.But if I have align it to a reference by BWA, it will become one (.bam) file. Will it produce both reverse and forward (.fastq) files if I convert the aligned (.bam) file back to (.fastq) file?
It should. Make sure you name sort the BAM file before converting back to fastq.
Thanks a lot! It helps a lot! And I am working on it now. BTW will samtools view region work the same if I need to split gatk bundles as well, like ucsc.hg.19 ?
So...my question now is somewhat weird.
Let's talk about it in more detail (may be profix, but I cannot figure out some better way)
Given I have two raw data that are in fastq format. And my target is to split them (paired-end data) and get a sub-file containing data from chr10 only.
So I firstly mapped them to reference with bwa.
And I do for my reference:
But when I run bwa again with my new-got sub-file and sub-reference, the output bam file is only 1kb. My code is like this:
I got some words like this:
But if I run codes like this:
Then it runs well (the size of the output chr10.bam is usual, not 1kb as before), but saying:
And now my question is: since my data is paired-end, I am not quite sure if the argument
-p
will affect my result. I wonder why I will get the resultspaired reads have different names
and how I can due with it.Please give me some advice! I'll appreciate all your helps!