Create smaller bam file based on the list of sequences
3
0
Entering edit mode
10.2 years ago

I have a list of sequences of interest and a huge bam file. I would like to create new bam file that would contain only reads mapping to one of those sequences as a reference. So far I was filtering SAM file with grep -f to achieve this, but I am looking for quicker and more elegant solution, something like seqtk subseq <in.fa> <name.list>

Thanks

filter bam • 6.0k views
ADD COMMENT
0
Entering edit mode

You can blat those sequences against the reference genome and find out their chromosomal positions. Then you can use a list of those regions and samtools to extract sequences from the bam file.

ADD REPLY
0
Entering edit mode

Did you ever find a suitable answer? I wanna do the same thing.

ADD REPLY
1
Entering edit mode
10.2 years ago
matted 7.8k

The methods that traverse the entire file are inefficient, especially if your sequences of interest are only a small fraction of the total. It's better to use the bam index to only include the reads you want. You can do this with samtools view by making a trivial bed file that consists of your chromosome names, 0, and a large number, in order to select all reads on that chromosome:

awk '{print $1"\t0\t1000000000"}' references.txt | samtools view -F 4 -b -L /dev/stdin input.bam > output.bam
ADD COMMENT
1
Entering edit mode

I like your solution, but it might not be the most efficient. I think -L option makes samtools search the file once per input region. See this benchmark:

cat /tmp/a.bed
LmjF.01
LmjF.05
LmjF.10
LmjF.20
LmjF.36

With -L option:

time awk '{print $1"\t0\t1000000000"}' /tmp/a.bed \
> | samtools view -L /dev/stdin $bam | wc -l
5015

real    0m1.114s
user    0m1.031s
sys    0m0.057s

Passing individual regions:

time samtools view $bam LmjF.01 LmjF.05 LmjF.10 LmjF.20 LmjF.36 | wc -l
5015

real    0m0.052s
user    0m0.026s
sys    0m0.018s
ADD REPLY
4
Entering edit mode

Ah, looks like you're right. I forgot about passing multiple regions to samtools view alone; that's better.

I guess to convert it to the same form I had before, your better solution would be:

cat references.txt | xargs samtools view -F 4 -b input.bam > output.bam
ADD REPLY
0
Entering edit mode

Excellent. This is exactly what I had in mind.

ADD REPLY
1
Entering edit mode
10.2 years ago

Wouldn't this suffice assuming the bam file is sorted & indexed?

samtools view -b myaln.bam contig3 contig5 contig7 > subset.bam
ADD COMMENT
0
Entering edit mode
10.2 years ago
RafaelMP ▴ 120

awk '$3 != "*"' file.sam > newfile.sam

ADD COMMENT
0
Entering edit mode

I don't want all reads that are mapped. I want just those that map to contig3, contig5 and contig7 in case that's on my list. This is significantly smaller list than list of all reference sequences. Sorry for not describing this precisely.

ADD REPLY
0
Entering edit mode

Do you want to extract reads that map to a contig or is it some short sequence ? I am just curious because you can easily extract reads that map to a particular contig if your reference fasta file has a separate fasta entry for that contig.

ADD REPLY

Login before adding your answer.

Traffic: 953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6