Calculate coverage from SAM file
0
1
Entering edit mode
4.9 years ago
NB ▴ 960

I have a really old sam format file that does not have any header or information on what build it has been aligned to. It looks something like this

    NS500377:28:H16A7BGXX:3:21410:1181:5368 99      chr11   3000035 255     40M     =       3000039 44      GTTTTTCACTGTTTCTCCCCATATTCCAGGTCTTACAGTG        AAAAAFA<AFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFF        XA:i:1  MD:Z:17G10A11   NM:i:2
NS500377:28:H16A7BGXX:3:21410:1181:5368 147     chr11   3000039 255     40M     =       3000035 -44     TTCACTGTTTCTCCCCATATTCCAGGTCTTACAGTGTGTG        FFF<FFFFFFFFFFFFFAFFFFAFFFFFFFFFFFFAAAAA        XA:i:2  MD:Z:13G10A15   NM:i:2
NS500377:28:H16A7BGXX:2:22105:14414:16397       163     chr11   3000074 255     39M     =       3000199 164     GTGTGTTTCTTATTTTGTAGGTTTATTCAGTGTTTGTGG AAAAAFF<FFAFFFFFFFAFFFFF.FFFFFAFFFFFFFA XA:i:1  MD:Z:24T14      NM:i:1
NS500377:28:H16A7BGXX:1:23206:13035:2291        163     chr11   3000187 255     40M     =       3000325 178     ACCATAGGTGTGTCTCATTTTCACATTTTATCAGTTTACA        AA<AAFF7<FFAFFAFA)7.)FFFAFF)<)F.<F)AF.F.        XA:i:0  MD:Z:29T9T0     NM:i:2

I have been asked to check for coverage, ie, calculating how many times part of a read maps to each coordinate in the genome.

But cannot think of any way of doing so other than using the tools such as picard, bedtools etc. But those will not work as this SAM file is incomplete/invalid without header/build info.

Does anyone have a custom script or know of a way to calculate coverage for such files ?

Thank you very much.

sam coverage no header samtools • 2.1k views
ADD COMMENT
0
Entering edit mode

Do you know the reference genome? If so you can simply take the header of any BAM file that was produced against the same reference. Extract the @HD and @SQ part of the header and then do something like:

cat header.txt this.sam | samtools view -o out.bam

From there on use any standard tool of choice.

ADD REPLY
0
Entering edit mode

unfortunately not. these are really old clinical samples that have been sent across.

ADD REPLY
0
Entering edit mode

See if you can infer the genome version from the first/last coordinates of mapped reads on each chromosome. The chromosome lengths and how much of them are hard masked has changed between genome versions.

If that doesn't work, just choose what you hope to be the correct genome version, since the results won't be off by much.

ADD REPLY
0
Entering edit mode

Samtools recently added a coverage sub-command, didn't it? Try samtools coverage and make sure you have the latest version of the tool (as I said it was added recently.

ADD REPLY
0
Entering edit mode

The problem here is that OP has no information about the genome version, it is not about tools.

ADD REPLY
0
Entering edit mode

I see. In this case, I would re-create the original FASTA from the SAM he has (it contains sequence id and sequence), and would properly realign to the reference (which seems to be the human genome). And then use samtools coverage.

Does it make sense @Nandini?

ADD REPLY

Login before adding your answer.

Traffic: 2808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6