Bam File: Change Chromosome Notation
3
15
Entering edit mode
13.2 years ago
Lisanne ▴ 160

Dear All,

I am using BAM files for chip-seq analysis. The chromosome notation in a usual BAM file is like: chr1. In my file the chromosomme notation is 1. Is there a way to change that in the BAM file? For further analysis it is very important to change the notation.

Many thanks!

Greetz Lisanne

samtools bam • 50k views
ADD COMMENT
16
Entering edit mode
13.2 years ago

Edit: ~10 years later.

Some options:

1) See I wrote http://lindenb.github.io/jvarkit/ConvertBamChromosomes.html

2) You can use samtools view to dump your data with the header: http://samtools.sourceforge.net/samtools.shtml and replace the chromosomes with sed or awk. (not tested but it could be something like below):

samtools view -h file.bam |\
sed -e '/^@SQ/s/SN\:/SN\:chr/' -e '/^[^@]/s/\t/\tchr/2'|\
awk -F ' ' '$7=($7=="=" || $7=="*"?$7:sprintf("chr%s",$7))' |\
tr " " "\t"

3) You could also use PICARD ReplaceSamHeader.

4) If you need to work with samtools or the gatk, another hack is to create a 'mock' faidx index to the reference genome. See my blog.

ADD COMMENT
1
Entering edit mode

I know it works, but i would like to understand it a bit better, if possible. I tried the same without the awk and tr and it also works fine. the sed command add the 'chr' into the right place. What exactly does awk do afterwards?

ADD REPLY
1
Entering edit mode

the 7th column contains the name of the chromosome. You have to add the chr prefix if this value is different from "=" or "*"

ADD REPLY
0
Entering edit mode

ok. I get it. somehow my bam file has the chromosomes on the 3rd. column:

HWI-EAS225_0031_FC:2:1:19139:1546#0     99      2R...   
HWI-EAS225_0031_FC:2:1:19139:1546#0     147     2R...

On the 7th column I have only [*|=]

My problem was that it added in my header on the 7th column also a 'chr'.

ADD REPLY
0
Entering edit mode

@Pierre This command works fine, but what if you want to save the changes so that you create a new .bam file with the new chr notations? I just putted " > newfile.bam" at the end (after the final " in the tr command) but that doesn't seem to work. Am I overlooking something?

ADD REPLY
0
Entering edit mode

there's a command 'replace header' for samtools. See the doc.

ADD REPLY
0
Entering edit mode

@Pierre, Could you please give me an example of how to replace the chromosome names with the sed or awk commands? Because in the samtools view i can not find the commands. Thanks you!

ADD REPLY
0
Entering edit mode

@Pierre, is there a concern about sequences that may not be the same in the aligned BAM and in the new ref fasta she wants to set? I mean, perhaps, alignment has been made with a patched version or something else... in a word are we sure by doing this that a particular position is exactly the same in both references ?

ADD REPLY
0
Entering edit mode

quick note: not all sed versions can match a tab with t ... pretty annoying I agree ... in those case one should rewrite the sed line to awk with gsub

ADD REPLY
0
Entering edit mode

@Pierre I tried the code but something was not right. It really replaced '1' with "chr1". But next I failed to get the index file for the bam file. The error is "samtools index: file.bam is in a format that cannot be usefully indexed". I used samtools view -H to check @HD VN:1.3 SO:coordinate chr @SQ SN:chr1 LN:249250621 AS:NCBI37 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:file:/net/1000g/mktrost/seqshop/gotcloud/gotcloud.ref/human.g1k.v37.fa chr

ADD REPLY
1
Entering edit mode

I guess the issue comes with the awk line of the code as it checks the 7th column; being either "=" or "*" and adds "chr" if doesn't match. Then the header lines are also added a "chr" on the 7th column. I don't know about the speed but modifying the awk and tr lines as below could work awk -F '\t' 'BEGIN{OFS="\t";} { if ($1 !~ /^@/ && $7 != "=" && $7 != "*") {$7 = "chr"$7; print$0;} else print$0}'

ADD REPLY
0
Entering edit mode

what is the output of the following command

file file.bam

?

ADD REPLY
15
Entering edit mode
9.9 years ago
petervangalen ▴ 210

The following line works well for me. It will replace the names of chromosome 1 to 22, X, Y and MT (mitochondrial DNA). It will create a new file with the name test_chr.bam instead of test.bam.

use Samtools

samtools view -H test.bam |\
   sed -e 's/SN:1/SN:chr1/' | sed -e 's/SN:2/SN:chr2/' | \
   sed -e 's/SN:3/SN:chr3/' | sed -e 's/SN:4/SN:chr4/' | \
   sed -e 's/SN:5/SN:chr5/' | sed -e 's/SN:6/SN:chr6/' | \
   sed -e 's/SN:7/SN:chr7/' | sed -e 's/SN:8/SN:chr8/' | \
   sed -e 's/SN:9/SN:chr9/' | sed -e 's/SN:10/SN:chr10/' | \
   sed -e 's/SN:11/SN:chr11/' | sed -e 's/SN:12/SN:chr12/' | \
   sed -e 's/SN:13/SN:chr13/' | sed -e 's/SN:14/SN:chr14/' | \
   sed -e 's/SN:15/SN:chr15/' | sed -e 's/SN:16/SN:chr16/' | \
   sed -e 's/SN:17/SN:chr17/' | sed -e 's/SN:18/SN:chr18/' | \
   sed -e 's/SN:19/SN:chr19/' | sed -e 's/SN:20/SN:chr20/' | \
   sed -e 's/SN:21/SN:chr21/' | sed -e 's/SN:22/SN:chr22/' | \
   sed -e 's/SN:X/SN:chrX/' | sed -e 's/SN:Y/SN:chrY/' | \
   sed -e 's/SN:MT/SN:chrM/' | samtools reheader - test.bam > test_chr.bam

If you want to do this for a bunch of bam files at once, use a loop. The variable $filename is your file without the extension (assuming there is no period except the one in ".bam").

for file in *.bam
do
  filename=`echo $file | cut -d "." -f 1`
  samtools view -H $file | \
      sed -e 's/SN:1/SN:chr1/' | sed -e 's/SN:2/SN:chr2/' | \
      sed -e 's/SN:3/SN:chr3/' | sed -e 's/SN:4/SN:chr4/' | \
      sed -e 's/SN:5/SN:chr5/' | sed -e 's/SN:6/SN:chr6/' | \
      sed -e 's/SN:7/SN:chr7/' | sed -e 's/SN:8/SN:chr8/' | \
      sed -e 's/SN:9/SN:chr9/' | sed -e 's/SN:10/SN:chr10/' | \
      sed -e 's/SN:11/SN:chr11/' | sed -e 's/SN:12/SN:chr12/' | \
      sed -e 's/SN:13/SN:chr13/' | sed -e 's/SN:14/SN:chr14/' | \
      sed -e 's/SN:15/SN:chr15/' | sed -e 's/SN:16/SN:chr16/' | \
      sed -e 's/SN:17/SN:chr17/' | sed -e 's/SN:18/SN:chr18/' | \
      sed -e 's/SN:19/SN:chr19/' | sed -e 's/SN:20/SN:chr20/' | \
      sed -e 's/SN:21/SN:chr21/' | sed -e 's/SN:22/SN:chr22/' | \
      sed -e 's/SN:X/SN:chrX/' | sed -e 's/SN:Y/SN:chrY/' | \
      sed -e 's/SN:MT/SN:chrM/' | samtools reheader - $file > ${filename}_chr.bam
done
ADD COMMENT
8
Entering edit mode

or just

sed  -e 's/SN:\([0-9XY]*\)/SN:chr\1/' -e 's/SN:MT/SN:chrM/'
ADD REPLY
4
Entering edit mode

Thanks that is much more elegant. With an additional slight improvement my line is now:

for file in *.bam
do
    filename=`echo $file | cut -d "." -f 1`
    samtools view -H $file | sed -e 's/SN:\([0-9XY]\)/SN:chr\1/' -e 's/SN:MT/SN:chrM/' | samtools reheader - $file > ${filename}_chr.bam
done
ADD REPLY
0
Entering edit mode

Hi petervangalen and Pierre,

I also downloaded BAM files, where prefix "chr" is missing.

Is it necessary to add prefix "chr" only in header or also in each line of BAM/SAM ?

For eg:

samtools view -H HDG04N.exome.bam | sed -e 's/SN:\([0-9XY]*\)/SN:chr\1/' -e 's/SN:MT/SN:chrM/'  | head -5

@SQ    SN:chr1    LN:249250621
@SQ    SN:chr2    LN:243199373
@SQ    SN:chr3    LN:198022430
@SQ    SN:chr4    LN:191154276
@SQ    SN:chr5    LN:180915260

samtools view HDG04N.exome.bam  | awk 'BEGIN {OFS="\t"} { print $1,$2,"chr"$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15 }' | head -4
HWI-ST1033:90:C0K0PACXX:8:2201:8091:30260    163    chr1    10004    8    100M    =    10048    144    CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJGHIIII@FHIIIGIIHGCCFDDFFAECCDB@C?ABDBDCA?BDDC?BDDDDDDDD    NM:i:0    AS:i:100    XS:i:100    RG:Z:1
HWI-ST1033:90:C0K0PACXX:8:2201:8091:30260    83    chr1    10048    8    100M    =    10004    -144    CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC    8883DDB<<9DCA<8+BA<2<5DCA?=5DCA=>.CEB@;)HHHA=)IHGC@.IHCCB6IHDD?JJHGD:JIHFEAJJIHGEJJIGHGHHHGHDFFFFBBB    NM:i:0    AS:i:100    XS:i:93    RG:Z:1
HWI-ST1033:90:C0K0PACXX:8:2302:17841:157944    163    chr1    10332    40    21M1I74M    =    10365    135    CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTA    CCCFFFFFHHHGHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJGGIJJIIJEGJCGGIJIIGHGEDDBDCDD@ABBAB<?BBB???BD@BCC?AABC    NM:i:1    AS:i:88    XS:i:88    RG:Z:1
HWI-ST1033:90:C0K0PACXX:8:1304:9484:179936    163    chr1    10343    40    10M1I90M    =    10358    117    CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCC    CCCFFFFFHHHHHIIGJJJJIGJJIJIIJJIICHHFHFFGHIJIHCGHIICGGGGG;==ECEFDBDEEAAC@CCD2=<ACB(:>A3@8A?CC8<AACC@?<    NM:i:2    AS:i:88    XS:i:88    RG:Z:1

Concatenate the output from both files as SAM and convert to BAM. Or simply converting it only in header is fine.

Thanks !

ADD REPLY
0
Entering edit mode

@Petervangalen if I had to use the above command for a vcf file then is the below command correct

| sed -e 's/SN:1/SN:chr1/' | sed -e 's/SN:2/SN:chr2/' | sed -e 's/SN:3/SN:chr3/' | sed -e 's/SN:4/SN:chr4/' | sed -e 's/SN:5/SN:chr5/' | sed -e 's/SN:6/SN:chr6/' | sed -e 's/SN:7/SN:chr7/' | sed -e 's/SN:8/SN:chr8/' | sed -e 's/SN:9/SN:chr9/' | sed -e 's/SN:10/SN:chr10/' | sed -e 's/SN:11/SN:chr11/' | sed -e 's/SN:12/SN:chr12/' | sed -e 's/SN:13/SN:chr13/' | sed -e 's/SN:14/SN:chr14/' | sed -e 's/SN:15/SN:chr15/' | sed -e 's/SN:16/SN:chr16/' | sed -e 's/SN:17/SN:chr17/' | sed -e 's/SN:18/SN:chr18/' | sed -e 's/SN:19/SN:chr19/' | sed -e 's/SN:20/SN:chr20/' | sed -e 's/SN:21/SN:chr21/' | sed -e 's/SN:22/SN:chr22/' | sed -e 's/SN:X/SN:chrX/' | sed -e 's/SN:Y/SN:chrY/' | sed -e 's/SN:MT/SN:chrM/' | input.vcf > output.vcf
ADD REPLY
0
Entering edit mode

Running this command is taking a longer time than expected more than an hour. Is this true or there must be some mistake in the command?

ADD REPLY
3
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thanks a lot for the tool. This is something I might need quite a lot when switching between annotations.

I was wondering though how to use the --dict parameter when using the tool. The *dict file from picard contains in the second column a list of chromosomes. How does your tool know which chromosome to switch with which? Does it go top to bottom?

Also, Is it possible (or make sense at all) to use both the --dict and the -f parameters together?

ADD REPLY
0
Entering edit mode

I agree it's not clear. --dict is the source of the destination dictionary. I'm trying to map the chromosomes by adding/removing the chr prefix. You can also set the mapping by yourself by using the option -f . When using both, the mapping is validating versus the dictionary.

ADD REPLY
0
Entering edit mode

Maybe I need to test it first, before asking so many questions. But it is still not clear to me how to use it.

If I want to change from UCSC to Ensembl do I need to write all chromosomes? like chrI\tI\nchrII\tII\nchrIII\tIII\n...\nchrM\tmtDNA\n or would just chr\t\n would be enough to remove all prefixes?

What do you mean by the destination dictionary? would this file be created when I run the command?

I'll test it and see how far I get with the command

Thanks again for this helpful tool.

ADD REPLY
0
Entering edit mode

yes a simple mapping file should be enough.

ADD REPLY

Login before adding your answer.

Traffic: 1756 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6