Reheadering a subsetted bam gives truncated file
3
0
Entering edit mode
6.4 years ago
Luciferine • 0

Hi all,

I have a problem with reheadering a bam file which I subset before. I mapped against a merged genome of human and mouse (since it is a patient derived xenograft sample) and want to remove residual mouse reads. I did this with samtools (version 1.8) by simply subsetting for the human chromosomes.

samtools view -b mybam.bam chrM chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY > no-mouse.bam

I want to remove the mouse chromosomes from the header by removing the respective lines from the header with sed (there are also some ERCCs etc. in the reference, which is why I have to remove so many lines) and reheadering the bam afterwards.

samtools view -H  no-mouse.bam | sed '27,253d' | samtools reheader - no-mouse.bam > no-mouse.rehead.bam

I end up with a truncated bam file, if I want to look at e.g. the end of the file with

samtools view no-mouse.rehead.bam | tail

I get

[main_samview] truncated file.

Following steps (e.g. AddOrReplaceReadGroups) give errors as well.

I've seen some posts on similar topics but didn't find any helpful solution unfortunately.

Can someone help me with that? Thanks a lot!

samtools bam • 3.1k views
ADD COMMENT
0
Entering edit mode

I mapped against a merged genome of human and mouse

Do you have a reference that this is good practice and reliable? I could imagine that because of homologies between the species, you'll get plenty of multimappers.

ADD REPLY
3
Entering edit mode

You can probably find a reference in the allele-specific literature, though we also use this strategy in the WGBS world.

In general, if you know your sample is a composite of a couple organisms then concatenating them minimizes the bias.

ADD REPLY
0
Entering edit mode

Ok, never heard of this strategy. Thanks!

ADD REPLY
0
Entering edit mode

Here some other things that didn't do the job either:

  • Converting to sam, changing header, reconverting to bam
  • Sorting the input bam again after subsetting (samtools sort)
  • ReorderSam to the human-only reference, this only works if chromosomes in header are the same as in reference.

I get the following error when I start any Picard tools (e.g. ReorderSam, AddOrReplaceReadGroups), maybe this might help to find the problem:

Exception in thread "main" java.lang.IllegalArgumentException: Reference name for '195' not found in sequence dictionary.

The numbers vary in different files, I also got '185' and others.

Hope somebody can help me! Thanks in advance!

ADD REPLY
0
Entering edit mode

Are you sure that your header order of chromosomes really is the same order as no-mouse.bam?

ADD REPLY
0
Entering edit mode

That's a good hint, thanks! If I look at the bam file the first reads I get are from chr1, whereas the header starts with chrM. But I don't really know how to fix this. I'll try ReorderSam on the subsetted file with the old header, maybe that'll work (reordering the subsetted file with the new header doesn't).

ADD REPLY
1
Entering edit mode

See my cat solution in the comment below, it should work without needing to use ReorderSam.

ADD REPLY
1
Entering edit mode
6.4 years ago

The thing with reheadering files is that you have to be absolutely sure that you're only removing chromosomes from the end of the header and never from the beginning, otherwise you'll end up borking the file. My guess is that your sed command is wrong, so write the header to a file and manually look at it.

ADD COMMENT
0
Entering edit mode

I tried saving the header separately, removing the lines I don't want (which are at the end of the header) and reheadering the file. It still gives me a truncated file.

ADD REPLY
0
Entering edit mode

Is the original file truncated before reheadering?

ADD REPLY
0
Entering edit mode

Nope, the original file is perfectly okay...

ADD REPLY
1
Entering edit mode

What happens if you cat fixed_header <(samtools view no-mouse.bam) | samtools view -bo no-mouse.rehead.bam -? That's a less efficient reheadering, but avoids issues due to swapping chromosome order.

ADD REPLY
0
Entering edit mode
6.4 years ago
Luciferine • 0

Thank you Devon, that indeed worked to find out the actual problem, which seems to be, that I have chimeric reads, where one read of a pair maps to a human chromosome and the other one to a mouse chromosome. These read pairs of course lack one read after filtering out all non-human mapped reads, leaving a human read with a reference to a non-existent mouse read. This causes problems in the following steps. I'm trying to get rid of those now.

ADD COMMENT
0
Entering edit mode
5 months ago
kalavattam ▴ 280

As Devon Ryan wrote in another post on this page,

The thing with reheadering files is that you have to be absolutely sure that you're only removing chromosomes from the end of the header and never from the beginning, otherwise you'll end up borking the file. My guess is that your sed command is wrong, so write the header to a file and manually look at it.

That said, there may still be cases in which users want to remove chromosomes from the beginning of the header, perhaps to filter corresponding alignments from the file and ensure the header accurately reflects that. One workaround is to convert the BAM file to a SAM file, then filter the header and body (alignments) using standard text-processing tools such as grep and awk; then convert the processed SAM file to a BAM file.

For example:

#!/bin/bash

#  This is an example of wanting to remove all chromosomes from a BAM file
#+ except those for the spike-in species in my data, S. pombe.
#+ 
#+ First, initialize a variable containing a space-separated string of
#+ chromosomes of interest. These are custom, non-standard names I gave the
#+ spike-in chromosomes, so you'll want to change this to reflect your data.
chromosomes="SP_I SP_II SP_III"

#  Convert the space-separated string of chromosomes to a pipe-separated string
#+ for use in pattern matching, then construct a `grep` RegEx to (a) retain
#+ the chromosomes of interest in `@SQ` lines and program (`@PG`) lines, all
#+ while (b) excluding alignment lines [those that do not begin with @
#+ (`^[^@])`]. Also, retain the line beginning with `@HD`.
#+ 
#+ Below, shellcheck encourages us to do string manipulation within the brace
#+ expansion, but that makes things hard to read IMO, so I manipulate the
#+ string by piping it to `sed`.
# shellcheck disable=SC2001,SC2028
adapted="$(echo "\\b${chromosomes}" | sed 's/ /\\b|\\b/g')\\b"
pattern="^@HD|^@SQ.*SN:(${adapted})|^@PG|^[^@]"

#  Run `samtools view` to convert a coordinate-sorted BAM infile, `${inbam}`,
#+ to a SAM infile, `${insam}`.
samtools view \
    -@ "${threads}" \
    -h \
    -o "${insam}" \
    "${inbam}"

#  Filter pertinent lines from the SAM header, and perform filtering for
#+ alignment lines by meeting the following conditions: Focus on lines that (1)
#+ don't begin with `@` (meaning they are not in the header) and (2) contain
#+ strings in variable `${chromosomes}` within field 3, the RNAME field (more
#+ info: samtools.github.io/hts-specs/SAMv1.pdf).
#+ 
#+ I like the "useless use of `cat`" here because of its readability, so I
#+ disable the corresponding note from shellcheck.
# shellcheck disable=SC2002
cat "${insam}" \
    | grep -E "${pattern}" \
    | awk \
        -v chromosomes="${chromosomes}" \
        'BEGIN {
            split(chromosomes, chrom_arr, " ")
            for (i in chrom_arr) {
                chrom_map[chrom_arr[i]] = 1
            }
        }
        {
            if ($0 ~ /^@/) {
                print $0
            } else {
                if ($3 in chrom_map) {
                    print $0
                }
            }
        }' \
            > "${outsam}"

#  Convert the SAM outfile `${outsam}` into a BAM outfile, `${outbam}`, then
#+ index the BAM outfile.
samtools view -Sb "${outsam}" > "${outbam}"
samtools index -@ "${threads}" "${outbam}"

#  Remove the intermediate SAM files, which potentially consume a lot of space.
rm "${insam}" "${outsam}"
ADD COMMENT

Login before adding your answer.

Traffic: 2206 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6