Hi,
I would like to calculate the total number of mapped bases for whole exome data from Illumina.
I'm already calculating the coverage per base with samtools mpileup or samtools depth. I was thinking of just taking the sum of the coverage of each base, but when doing this in Linux with 'awk' I get an 'out of memory alloc' error. That's why I was wondering if there are some specific tools that calculate the total number of mapped bases?
What's your awk command? That should be doable in awk.
cat myfile.txt | awk '{SUM += $NF} END {print SUM}'
That seems like an odd way to do things.
NF
is the number of fields in the current line. Also, you never initializedSUM
(although perhaps things are auto-initialized to 0 in awk). I would assume that something likecat pileup.txt | awk 'BEGIN{$SUM=0}{$SUM += $3}END{print $SUM}'
would work.I want to take the sum of all the values in the last column of my file, except for the first line (which is a header)
Ah, it would be really helpful if you posted a couple lines from that file. It's rather difficult to read your mind. I'm going to reply below in a comment about what I suspect the root of the "out of memory alloc" error is.
It's just a tab-delimited file with 'n' columns and 'm' lines. First line is a header line. I found a command to skip the header line with awk: awk '{if(NR>1)print}'. If I combine this with awk '{print $NF}', it only prints the values from the last column ( without header line). Then I found this command to sum the values of a column: paste -sd+ | bc So when I combine those 4 commands, I thought it would calculate the sum of the values in the last columns. But when doing that I get an error of memory alloc
Oh, I see what you're doing wrong. You're linearizing the last column, which is probably pretty long, and then trying to parse it with awk. As I mentioned below, this sort of approach is pretty memory intensive. I'll just post an answer with a single line solution. In the future, please post all of the relevant information when you post the question.
The root cause of the "out of memory alloc" error is likely that awk is trying to read your whole file as a single line. awk has a good bit of overhead when dealing with arrays, so this is going to eat up all of the memory on your system. You might try either changing the line endings on the original file or, better yet, just tell awk what the correct line ending is. Alternatively, just parse the pileup as in the example from my comment above.