Question

Merging SAM/BAM files

0

Entering edit mode

6.4 years ago

Sammy ▴ 30

Hello. I am fairly new to bioinformatics. I want to merge quite a few BAM files resulted from TopHat alignments. So I can play with them easier (as I don't have a lot of storage space) i cut them by the region of interest. So now I only have around 10kb per BAM file. I am not sure how to correctly merge BAM files with samtools (in Linux). Do I have to change the headers? More specifically to remove the headers for which I don't have the alignment lines anymore? I noticed that I have the same headers before and after I cut them. Also, do I need to introduce an @RQ header before merging them, assuming that I don't have one?

RNA-Seq • 5.7k views

ADD COMMENT • link 6.4 years ago by Sammy ▴ 30

0

Entering edit mode

How did you cut the BAM files? Can you post the command that you used?

ADD REPLY • link 6.4 years ago by Friederike 9.0k

0

Entering edit mode

I used Galaxy for that so I avoid downloading them in my PC. I left the default options. I just introduced the region that I wanted.

ADD REPLY • link 6.4 years ago by Sammy ▴ 30

0

Entering edit mode

Which function did you use in Galaxy and with which settings?

ADD REPLY • link 6.4 years ago by Friederike 9.0k

0

Entering edit mode

samtools view -b -h my.bam RNAME chr11:1200000-1300000

ADD REPLY • link 6.4 years ago by Sammy ▴ 30

0

Entering edit mode

Just so you get consistent help you may want to post this over at https://help.galaxyproject.org/ which is the official support site for PSU Galaxy.

ADD REPLY • link 6.4 years ago by GenoMax 153k

0

Entering edit mode

out of curiosity: why didn't you the "Merge BAM Files" function in Galaxy?

ADD REPLY • link 6.4 years ago by Friederike 9.0k

0

Entering edit mode

I have merged them in both galaxy and samtools. It worked in both cases. But this is only for two BAMs. I am going to have to work soon with hundreds of samples. Right now, I am doing investigations about merging to make sure that this is the appropriate next step. I want to calculate the coverage per locus from the resulting BAM, in the end, after some more downstream processing.

I still do not understand what is the exact link between merging and the headers in SAM and why some people choose to add/replace headers (like @RQ and @CO). I mean, I know that @RQ has to be unique but besides this?

ADD REPLY • link 6.4 years ago by Sammy ▴ 30

score 2 · Answer 1 · 2019-04-22

2

Entering edit mode

6.4 years ago

Friederike 9.0k

Alright, so what you're actually asking is background info about the SAM header.

Just to cover all the basics (you may be aware of them already, but bear with me), here's an image I often use when I teach, which shows a schema of a typical SAM/BAM file.

The header section and the alignment section are very different in terms of their content and their format as you can see.

the header contains information about how the alignment was generated and stored
every line belonging to the header section begins with @, followed by a "record type", such as SQ, followed by tag:value pairs where tag is a two-letter string (such as LN). Every record type has well defined tags that belong to it and every tag has a specific way in which its values are denoted. Take, for example, the record type SQ, which stands for "reference sequence dictionary" in SAM spec speak or "reference genome" in bioinfo terms
- if you look up the SAM file specs, pages 3-5, you can see that for SQ the following tags are allowed: SN, LN, AH, AN, AS, DS, M5, SP, UR
  - all of this is optional

A typical entry for a hypothetical organism with 3 chromosomes of length 1000, 1500, and 3000, could be represented as follows in the header section:

@SQ SN:chr1 LN:1000
@SQ SN:chr2 LN:1500
@SQ SN:chr3 LN:3000

So, in summary:

the header is theoretically optional, but often the very basic information such as the lengths of the chromosomes of the reference genomes are required by downstream tools
EDIT following a comment by Genomax: if you decide to include a header with certain entries, such as SQ, there are tags that may be required for a properly formatted SAM/BAM file (those are marked by asterisks in the SAM specs)
the CO line is handy to keep track of the specific alignment command that was used to generate a BAM file -- if you're merging multiple BAM files, you either want to have multiple CO lines to indicate the differences between the commands that may have been used or you may just want to retain a single one if you used the same command for all the individual files or something else entirely - the choice is yours as to how much meta-data you want to keep in the header.

If you're just starting out you probably don't want to add your own custom-brewed entries to the header, I would recommend to use the one that contains the info that are relevant and correct for all the BAM files you're merging.

One more comment: I don't think you meant RQ, I assume you're referring to @RG. To find out more about the significance of that particular entry, you may find this biostars post helpful.

And one last question: Why do you want to merge the files in the first place?

ADD COMMENT • link 6.4 years ago by Friederike 9.0k

0

Entering edit mode

Yes, I meant @RG. Thanks for the answers. Sounds like probably I would want the CO record type.

Right, so to answer your question: I want to merge them so I can have a single BAM file with all the reads there. After that, I am planning to use samtools to create another tab-delimited file (by extracting certain fields with samtools view from the BAM) with the number of reads per position (that comes out from the all 100 samples). Seems pretty straight forward but I actually haven't talked with anyone about this. Am I talking Sci-Fi?

I mean creating the tab-delimited file seems pretty simple. I just have to copy the fields in a .txt file. At this point I just have to learn how to create columns and rows and add them to the same .txt file from my BAM file.

ADD REPLY • link 6.4 years ago by Sammy ▴ 30

0

Entering edit mode

Why do you have separate BAM files to begin with?

Going from SAM to text file may not be problematic in a toy example, but if you have a typical NGS experiment with millions to hundred millions of reads, it can become computationally daunting. What is the final question that you'd like to address?

ADD REPLY • link 6.4 years ago by Friederike 9.0k

0

Entering edit mode

Because multiple alignments, multiple BAMs? No more questions. Thanks for your help.

ADD REPLY • link 6.3 years ago by Sammy ▴ 30

0

Entering edit mode

sure, but why don't you just do one alignment for all the fastq files, creating one bam file from the get-go? (not saying it's wrong the way you do it, just curious)

ADD REPLY • link 6.3 years ago by Friederike 9.0k

0

Entering edit mode

Huh. Probably because I don't think I know what you mean by that. Can you please give me more details? I am intrigued.

ADD REPLY • link 6.3 years ago by Sammy ▴ 30

0

Entering edit mode

I guess I don't know enough about your set-up and overall goal. Typically, there is little need to merge BAM files if you have access to the FASTQ files. For example, if you have multiple FASTQ files because your samples were multiplexed (sequenced across different lanes) and very deeply, then you could supply all FASTQ files belonging to the same sample to the alignment tool, generating one BAM file. So, how did you end up having multiple BAM files to begin with?

EDIT: I guess, I've done too much RNA-seq these days. I may very well be that only STAR accepts multiple FASTQ files, but BWA, for example, might not.

ADD REPLY • link 6.3 years ago by Friederike 9.0k