I have a mapping tool I wrote that outputs mapping formation in a tab delimited format. Unfortunately, my tool does not output in SAM format. I want to use of other tools that require SAM files to analyze my output. I was wondering if there is a command line tool that will take mapping formation as arguments and out put in SAM format. Its fine if I have to do this one mapping at a time. Just a matter of writing a script.
Yes, I could learn everything about SAM format and write code to do this but that takes brain power and time.
Thanks.
If you already have the mapping information computed, then it seems like it would be less work and time to output it to the SAM text format (which isn't all that complicated compared to writing a mapping algorithm), rather than finding an existing tool that will translate text information to SAM text and then tailoring your output to that tool's requirements. All you need to do is take a look at the SAM specification and work your way through it column-by-column.
One thing you could do is use
samtools view
to sanity-check your output. You could pretty easily make a unit-testing jig like this, which would help you work out the kinks in your output.Thanks Deedee. That's not the answer I was hoping for. You kind of convinced me that it might not be that complicated. My mapping information is kind of a special case, however, since it is alignment free, i.e. I don't use an alignment algorithm as a base method. Basically, my mapping information consists of "x read maps to y location with z mismatches". I don't know what that would mean for mapping quality and CIRGAR strings, for example.
I think Pierre has a great approach as well. Depending on what you want to accomplish, you can fudge some of the data. For example, you could just put [number of bases]M as the CIGAR string for every mapped read (indicating that every base was a perfect match), and have the MAPQ be the same for every mapped read as well. This is essentially what bed2bam is doing. In the end, if you don't care about the accuracy for the majority of the fields, you should just output to BED (which is far simpler) and then use bed2bam.
I should mention that based on what you said, you should be able to easily construct CIGAR strings. CIGAR is pretty simple:
10M1X39M
means that, for the given read, the first ten bases matched the reference, the eleventh base mismatched, and the next 39 matched.Ah I see. Looks like I have to do a little reading regarding bedtools, bed format, and SAM format. Thanks people.