Hi.
I performed an RNAseq analysis using RSEM on a computer cluster. It made an output BAM file, which I downloaded and tried to view in the Genome Viewer so I could verify whether the expression that it mapped really maps to the established genes. However, when I imported my BAM file, it gave an error, since my organism's default genome expects chromosomes in the form "chr1, chr2, chr3" whereas my BAM file has "I, II, III".
I tried editing the header file with Samtools using this command:
samtools view -H file.bam | sed -e 's/SN:III/SN:chr3/' -e 's/SN:II/SN:chr2/' -e 's/SN:I/SN:chr1/' | samtools reheader - file.bam > newfile.bam
However, it still didn't work, giving me the same error. I converted it to SAM and opened it up and basically found that while the header files contain my new naming convention, the rest of the file still uses Roman numerals. Is there a simple way that I can rename all instances of the old chromosome naming convention?
Thanks.
Because you used
-H
option for samtools view only headers got converted.Mmhh I think what the OP has done should work. S/he used -H to edit only the header of the original bam file and the edited header sent to
samtools reheader
.But that left the identifiers in the rest of the file as they were. @edsouza was trying to change them in the entire file.
Right. Is there an option to
sed
all the instances of Roman Numerals throughout the body of the file too?There are no identifiers in the rest of the file. The only identifiers are in the header.
Hi Devon,
That's what I first thought. However, after doing a
head
for the first 500 lines of the original file, I got data in this form:The column that says "II" or "III" is a reference to the chromosome, and appears in more than just the header.
You didn't modify the original file, you made a new file with the modified header. The SAM output is distinct from the underlying BAM file. In the BAM file, the alignments each have an index into the header saying which chromosome they're on. When that's printed out as a SAM file, the associated chromosome name is then used (the same goes for
=
in the 7th column).Ah! It seems that I'd been confusing the BAM file with the unpacked SAM file. Meaning, when you unpack BAM to SAM, all that matters is the BAM header; is this correct? I'm going to re-download the edited bam file and see if it works in IGV.
Correct, all that matters in BAM is the header. That's one of the reasons that
samtools reheader
can be as quick as it is (and also why BAM files aren't larger).BTW, there's apparently now an
-i
option that can be used to tellsamtools reheader
to modify a file in place (it's a new option to me, though perhaps I'd just never noticed it before).duplicate of Bam File: Change Chromosome Notation
Hi, that's actually the link I based my script off of. However, like @genomax2 said, the identifiers in the rest of the file have stayed the same.
You used
-H
instead of-h
withsamtools view
(@Pierre's example in other thread is the correct usage). Small difference but a very different result.My guess is that samtools reheader isn't dealing nicely with the pipe. Write the modified header to a file and then try reheader. This will also allow you to double check that you got everything with sed.