SAMTOOLS flagstat outputs
1
0
Entering edit mode
8.0 years ago
jt358 ▴ 10

Hi all,

I have a series of SAM files, which I know that I have to convert to BAM and sort for using FLAGSTAT and VIEW in samtools. Having ref aligned data and previously removing reads that don't map I now want to produce a single file with the remaining number of mapped read outputs into a single file. The below command outputs the result from one single sample, but can anyone tell me how to run multiple (500) files into a single list of outputs as I cannot find a way to do this at all, whatever I run it only puts in the file for 1 single sample - many thanks for your kind help.

samtools view sample1.bam | cut -f1 | sort | uniq | wc -l > output

SAMTOOLS • 6.5k views
ADD COMMENT
0
Entering edit mode

Was a little confused, are you basically wanting to count the number of reads that are mapped in a BAM file? And then do that for 500 BAM files? A bash loop would do it.

In your command, try >> (append) instead of >, as > will delete the file first, and then add data.

ADD REPLY
0
Entering edit mode

Sorry Tonor and thank you for the reply - Not a cumulative total, I want an output of how many reads aligned for each sample independently but without having to run 'view' in samtools singularly for each sample.

In truth it would be easier of there was a way to do it from sam files as that is what my samples are in at the moment but it seems converting to bam may be the only option.

Best

Jamie

ADD REPLY
0
Entering edit mode

jt358 : Please use ADD COMMENT/ADD REPLY when responding to existing posts. This helps keep threads logically organized.

ADD REPLY
0
Entering edit mode

Apologies genomax2 - my misunderstanding

ADD REPLY
0
Entering edit mode

For a sam file, you could do this much more efficiently with the BBMap package:

reformat.sh in=sample1.sam out=out.sam mappedonly primaryonly

I don't think your current command will produce correct output if you are using paired-end data, as only a single line would be retained for each pair of reads.

I don't really write loops in bash, which is one possible solution. Instead, what I typically do is to use Excel:

ls *.sam > ls.txt

Paste that into Excel, write a command like:

="reformat.sh in="&A1&" out="&B1&".sam mappedonly primaryonly"

...drag it down, and paste that into a new shell script.

Or see Tonor's solution for writing a loop in bash :)

But if you only need the number of mapped reads, you don't need the "out=" flag. Reformat will print the total numbe of reads that passed the filters (primary and mapped) at the end.

ADD REPLY
0
Entering edit mode

Hi Brian - thanks for the answer.

For project specific reasons, I've concatenated pairs into a single file before aligning and then look at linkage later down the line.

My input sam files from which I want the read counts also only contained reads that I know map as opposed to total reads mapped or unmapped

Cheers

ADD REPLY
0
Entering edit mode

That's fine, but bear in mind that the sam format mandates that read 1 and read 2 have the exact same name if they are mapped as pairs, and if they are not mapped as pairs, many programs will still truncate the names at the first whitespace, meaning that normal Illumina names (such as "hiseq:1:2:3:4 1:ACGT") will be truncated such that read 1 and 2 will have identical names, so after sort/uniq you will be missing half of your reads.

ADD REPLY
0
Entering edit mode

Thanks everyone on the thread for all your kind help - really grateful!!!!

ADD REPLY
2
Entering edit mode
8.0 years ago
Tonor ▴ 480

If all your BAMs are in a single directory, you should check out a bash for loop, e.g.:

for i in *.bam; do
    samtools view $i | cut -f1 | sort | uniq | wc -l >> output
done

samtools view will work on SAM file

ADD COMMENT
0
Entering edit mode

Also - you probably want to check out this post for info on how to calculate the number of mapped reads: Number of mapped reads from BAM file

ADD REPLY
0
Entering edit mode

Hey Tonor -

Many thanks - I've never really got loop scripts but that makes perfect sense and produces the required output. Is it possible to include the sample prefix in the output file?

Trying to run it on the raw sam files from ref allignment however, produces the errors ' this is not a bam file' - I have change bam to sam in the loop script + EOF marker is absent + fail to read the header 0 I assume this is an issue with lack of sorting

Many thanks

ADD REPLY
0
Entering edit mode

You could do following to get a separate outfile for each bam.

for i in *.bam; do
    samtools view $i | cut -f1 | sort | uniq | wc -l > $i.out
done
ADD REPLY
0
Entering edit mode

For adding sample names in the output file, simply do something like this:

for i in *.bam; do
    echo $i >> output
    samtools view $i | cut -f1 | sort | uniq | wc -l >> output
done

This will create a file containing a name on odd lines followed by the number of reads on the even line. There is probably a way to get them on one line, too ;-) I would probably run something like:

for i in *.bam; do
    echo ${i}% >> temp.output
    samtools view $i | cut -f1 | sort | uniq | wc -l >> temp.output
done
tr '%' '\t' < temp.output > output

But that just looks ridiculous.

ADD REPLY
0
Entering edit mode

Many thanks WDC - the second script didn't seem to work, but the first was great and I could pull out the data I needed - many thanks indeed

ADD REPLY
0
Entering edit mode

Right, now I see my mistake in the second loop. I've corrected it...

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6