Add head to coverage output for many files
1
0
Entering edit mode
7.4 years ago
Cindy ▴ 70

Hi, all, I used samtools depth to output read depth for all samples, those files have three columns but no head. I want to merge all samples together to calculate the mean read depth and sd, so I need to add sample name as a header before merge them together.

When I use echo, it also output ref pos and $F as head, but what I need is that $F is the sample name. My code is

find ReadDepth/ -name "OUT_*" | while read F ; 
do
echo "$(echo 'ref pos $F' | cat - $F)" > ${F%}_head
done

Please help me to correct it.

Thanks

sequencing • 2.0k views
ADD COMMENT
0
Entering edit mode

Can you illustrate by an example? Btw the sequencing tag in Q is irrelevant!

ADD REPLY
1
Entering edit mode
7.4 years ago
James Ashmore ★ 3.5k

You could also use the bedtools suite to create your coverage matrix with header information:

Calculate read coverage across the entire genome for multiple BAM files:

bedtools genomecov -bga -i sample1.bam -g genome.txt > sample1_genomecov.bedGraph
bedtools genomecov -bga -i sample2.bam -g genome.txt > sample2_genomecov.bedGraph
bedtools genomecov -bga -i sample3.bam -g genome.txt > sample3_genomecov.bedGraph

#  Easier and faster to do this with GNU parallel if you have it installed...
parallel 'bedtools genomecov -bga -i {} -g genome.txt > {.}_genomecov.bedGraph' ::: sample*.bam

Combine read coverage files into a single file with header:

bedtools unionbedg -header -names sample1 sample2 sample3 -g genome.txt -empty -i sample1_genomecov.bedGraph sample2_genomecov.bedGraph sample3_genomecov.bedGraph > genomecov.txt

You could then write a quick script to extract the statistics you want from the coverage file:

#!/usr/bin/env python3

import collections
import csv
import numpy

with open('genomecov.txt'. 'r') as handle:
    reader = csv.reader(handle, delimiter = '\t')
    header = next(reader)
    Row = collections.namedtuple('Row', header)
    for line in reader:
        row = Row(*line)
        coverage = row[3:]
        # Print the mean and standard deviation
        print(row.chrom, row.start, row.end, numpy.mean(coverage), numpy.std(coverage))

There are probably more elegant solutions available, but this should work if you're in a squeeze...

ADD COMMENT

Login before adding your answer.

Traffic: 1580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6