Alright, so what you're actually asking is background info about the SAM header.
Just to cover all the basics (you may be aware of them already, but bear with me), here's an image I often use when I teach, which shows a schema of a typical SAM/BAM file.
The header section and the alignment section are very different in terms of their content and their format as you can see.
- the header contains information about how the alignment was generated and stored
- every line belonging to the header section begins with
@
, followed by a "record type", such as SQ
, followed by tag:value
pairs where tag
is a two-letter string (such as LN
). Every record type has well defined tags that belong to it and every tag has a specific way in which its values are denoted. Take, for example, the record type SQ
, which stands for "reference sequence dictionary" in SAM spec speak or "reference genome" in bioinfo terms
- if you look up the SAM file specs, pages 3-5, you can see that for
SQ
the following tags are allowed: SN, LN, AH, AN, AS, DS, M5, SP, UR
A typical entry for a hypothetical organism with 3 chromosomes of length 1000, 1500, and 3000, could be represented as follows in the header section:
@SQ SN:chr1 LN:1000
@SQ SN:chr2 LN:1500
@SQ SN:chr3 LN:3000
So, in summary:
- the header is theoretically optional, but often the very basic information such as the lengths of the chromosomes of the reference genomes are required by downstream tools
- EDIT following a comment by Genomax: if you decide to include a header with certain entries, such as
SQ
, there are tags that may be required for a properly formatted SAM/BAM file (those are marked by asterisks in the SAM specs)
- the
CO
line is handy to keep track of the specific alignment command that was used to generate a BAM file -- if you're merging multiple BAM files, you either want to have multiple CO lines to indicate the differences between the commands that may have been used or you may just want to retain a single one if you used the same command for all the individual files or something else entirely - the choice is yours as to how much meta-data you want to keep in the header.
If you're just starting out you probably don't want to add your own custom-brewed entries to the header, I would recommend to use the one that contains the info that are relevant and correct for all the BAM files you're merging.
One more comment: I don't think you meant RQ
, I assume you're referring to @RG
. To find out more about the significance of that particular entry, you may find this biostars post helpful.
And one last question: Why do you want to merge the files in the first place?
How did you cut the BAM files? Can you post the command that you used?
I used Galaxy for that so I avoid downloading them in my PC. I left the default options. I just introduced the region that I wanted.
Which function did you use in Galaxy and with which settings?
Just so you get consistent help you may want to post this over at https://help.galaxyproject.org/ which is the official support site for PSU Galaxy.
out of curiosity: why didn't you the "Merge BAM Files" function in Galaxy?
I have merged them in both galaxy and samtools. It worked in both cases. But this is only for two BAMs. I am going to have to work soon with hundreds of samples. Right now, I am doing investigations about merging to make sure that this is the appropriate next step. I want to calculate the coverage per locus from the resulting BAM, in the end, after some more downstream processing.
I still do not understand what is the exact link between merging and the headers in SAM and why some people choose to add/replace headers (like @RQ and @CO). I mean, I know that @RQ has to be unique but besides this?