Run Samtools on multiple files
5
3
Entering edit mode
7.4 years ago
V ▴ 410

I'm trying to run samtools sort in a folder containing about 380 bam files.

I've tried the following command and nothing happens. No errors, it just returns blank. Can anyone help me with what I've done wrong? Thank you :)

samtools sort -@ 7 *.bam > *.sorted.bam
samtools rnaseq • 19k views
ADD COMMENT
12
Entering edit mode
7.4 years ago
Medhat 9.8k

Try this;

for f in *.bam; do filename="${f%%.*}"; samtools sort -@ 7 $f ${filename}.sorted.bam; done

Or

for f in *.bam; do  samtools sort -@ 7 $f   ${f/.bam/sorted.bam}; done
ADD COMMENT
0
Entering edit mode

Seems to be working, hasn't thrown out an error! thanks alot!!

ADD REPLY
0
Entering edit mode

I modified a bit based on Medhat's script. Seems better. for f in *.bam; do samtools sort -@ 7 $f ${f/bam/sorted}; done

ADD REPLY
0
Entering edit mode

hi all, I am working Linux cluster and ı have 1000 bam files under bam/ directory and they have pretty similar name "bam_runid_bf2ff0593235864c9d423d28f1746a42d62df29e_0_0" and just first "0" changing into folder like 1,2,3,.. I am wondering how can ı sort them as a single and merge them after indexing. I tried "for f in bam/*_0.bam; do samtools sort -@ 7 $f ${f/bam/sorted}; done" what u advised but it did not worked. sorry ı am very new in this field .:( could you explain to me more simple? thanks

ADD REPLY
1
Entering edit mode

post this as a new question with a reference to this thread. Show us any error message.

ADD REPLY
10
Entering edit mode
7.4 years ago

Using gnu parallel in its simplest form:

ls *.bam | parallel -j 8 'samtools sort -@ 7  {} > {.}_sorted.bam'

To show progress and expected time of completion

ls *.bam | parallel --progress --eta -j 8 'samtools sort -@ 7  {} > {.}_sorted.bam'
ADD COMMENT
4
Entering edit mode

--eta implies --progress, so --progress is not needed. Also try the newer --bar. If you want one per cpu core, leave out -j8.

ADD REPLY
1
Entering edit mode

Thanks! I am huge fan of parallel, but so many options :o

ADD REPLY
4
Entering edit mode
7.4 years ago

create the following 'Makefile'

SHELL=/bin/bash
BAMS=$(shell ls -1 *.bam)
define run

$$(addsuffix .sorted.bam,$$(basename $(1))): $(1)
    samtools sort $$< $$(basename $$@)
    samtools index $$@
endef

all: $(addsuffix .sorted.bam,$(basename ${BAMS}))

$(eval $(foreach B,${BAMS},$(call run,${B})))

and run it in parallel for you 380 bam using the option -j for the number of parallel operations

make -j 16
ADD COMMENT
2
Entering edit mode

On a separate note, this is really cool, suggestions to learn how to use "make" (tutorial links) would be helpful for all.

ADD REPLY
0
Entering edit mode

May be I am asking silly question, does that "16" means the no. of threads?

ADD REPLY
0
Entering edit mode

does that "16" means the no. of threads?

yes

ADD REPLY
2
Entering edit mode
7.4 years ago
Charles Plessy ★ 2.9k

The * wildcard character can only expand to already existing files (*.bam in your example). The command-line interpreter can not guess your intent and provide contents for *.sorted.bam. Moreover, samtools sort is not designed to run on multiple files in parallel.

However, there are generic tools for doing what you want. See for instance the two following links.

ADD COMMENT
0
Entering edit mode

Hi Charles,

Sorry My bad, I seem to have explained what I want to do wrong. I didn't mean I want to run it on all of them in parallel. I meant I want samtools to sort them out one by one. Usually what I would do would be:

samtools sort -@ 7 File1.bam File1.sorted.bam
samtools sort -@ 7 File2.bam File2.sorted.bam

and then as soon as it finishes with file1 it would go automatically to file2. But having 380 or so files, you can see the issue with having to write all of them out manually :)

ADD REPLY
0
Entering edit mode

Indeed, a for loop better answers to your question. Nevertheless, if you run parallel and limit the number of concurrent jobs to 1 (--jobs 1) you will also be able to run the commands sequentially, basically like in a for loop. And you will have learned a new tool :)

ADD REPLY
0
Entering edit mode
7.4 years ago

try this:

for in in *.bam; do sample_name=`echo $i | awk -F "." '{print $1}'`; samtools sort -@ 7 $i >  ${sample_name}.sorted.bam ; done
ADD COMMENT
1
Entering edit mode

Just a comment, when you are defining output filename, it's always better to use basename in order to remove just the extension and add output format. Something like outFile=$(basename "$BAM" .bam | awk '{print $1".sorted.bam"}').

Why? Because sometimes the filenames include several . which explain a lot about the file. E.g. ICGC_File1.cohortA.rmdup.bam. With your approach you'll get an output which looses some information ICGC_File1.sorted.bam.

ADD REPLY
0
Entering edit mode

+1 Thanks venu for the suggestion, though I am clueless why the loop I suggested does not work for V

ADD REPLY
0
Entering edit mode

Hello,

Thanks. I've tried it but keep getting this error appearing multiple times:

awk: read error (Is a directory)

Usage:   samtools sort [options] <in.bam> <out.prefix>

Options: -n        sort by read name
         -f        use <out.prefix> as full file name instead of prefix
         -o        final output to stdout
         -l INT    compression level, from 0 to 9 [-1]
         -@ INT    number of sorting and compression threads [1]
         -m INT    max memory per thread; suffix K/M/G recognized [768M]
ADD REPLY
0
Entering edit mode

try this: Removed ">" and its for i in not for in in ; i edited

for i in *.bam; do sample_name=`echo $i | awk -F "." '{print $1}'`; samtools sort -@ 7 $i ${sample_name}.sorted; done

I assume you run it for a directory having files:

sample1.bam
sample2.bam
sample3.bam

and so on

ADD REPLY
0
Entering edit mode

Same Error coming up :/

ADD REPLY
0
Entering edit mode

yes, thats correct. They aren't named "sample" they have a serial number but yes all bam

ADD REPLY

Login before adding your answer.

Traffic: 1842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6