As Devon Ryan wrote in another post on this page,
The thing with reheadering files is that you have to be absolutely sure that you're only removing chromosomes from the end of the header and never from the beginning, otherwise you'll end up borking the file. My guess is that your sed command is wrong, so write the header to a file and manually look at it.
That said, there may still be cases in which users want to remove chromosomes from the beginning of the header, perhaps to filter corresponding alignments from the file and ensure the header accurately reflects that. One workaround is to convert the BAM file to a SAM file, then filter the header and body (alignments) using standard text-processing tools such as grep
and awk
; then convert the processed SAM file to a BAM file.
For example:
#!/bin/bash
# This is an example of wanting to remove all chromosomes from a BAM file
#+ except those for the spike-in species in my data, S. pombe.
#+
#+ First, initialize a variable containing a space-separated string of
#+ chromosomes of interest. These are custom, non-standard names I gave the
#+ spike-in chromosomes, so you'll want to change this to reflect your data.
chromosomes="SP_I SP_II SP_III"
# Convert the space-separated string of chromosomes to a pipe-separated string
#+ for use in pattern matching, then construct a `grep` RegEx to (a) retain
#+ the chromosomes of interest in `@SQ` lines and program (`@PG`) lines, all
#+ while (b) excluding alignment lines [those that do not begin with @
#+ (`^[^@])`]. Also, retain the line beginning with `@HD`.
#+
#+ Below, shellcheck encourages us to do string manipulation within the brace
#+ expansion, but that makes things hard to read IMO, so I manipulate the
#+ string by piping it to `sed`.
# shellcheck disable=SC2001,SC2028
adapted="$(echo "\\b${chromosomes}" | sed 's/ /\\b|\\b/g')\\b"
pattern="^@HD|^@SQ.*SN:(${adapted})|^@PG|^[^@]"
# Run `samtools view` to convert a coordinate-sorted BAM infile, `${inbam}`,
#+ to a SAM infile, `${insam}`.
samtools view \
-@ "${threads}" \
-h \
-o "${insam}" \
"${inbam}"
# Filter pertinent lines from the SAM header, and perform filtering for
#+ alignment lines by meeting the following conditions: Focus on lines that (1)
#+ don't begin with `@` (meaning they are not in the header) and (2) contain
#+ strings in variable `${chromosomes}` within field 3, the RNAME field (more
#+ info: samtools.github.io/hts-specs/SAMv1.pdf).
#+
#+ I like the "useless use of `cat`" here because of its readability, so I
#+ disable the corresponding note from shellcheck.
# shellcheck disable=SC2002
cat "${insam}" \
| grep -E "${pattern}" \
| awk \
-v chromosomes="${chromosomes}" \
'BEGIN {
split(chromosomes, chrom_arr, " ")
for (i in chrom_arr) {
chrom_map[chrom_arr[i]] = 1
}
}
{
if ($0 ~ /^@/) {
print $0
} else {
if ($3 in chrom_map) {
print $0
}
}
}' \
> "${outsam}"
# Convert the SAM outfile `${outsam}` into a BAM outfile, `${outbam}`, then
#+ index the BAM outfile.
samtools view -Sb "${outsam}" > "${outbam}"
samtools index -@ "${threads}" "${outbam}"
# Remove the intermediate SAM files, which potentially consume a lot of space.
rm "${insam}" "${outsam}"
Do you have a reference that this is good practice and reliable? I could imagine that because of homologies between the species, you'll get plenty of multimappers.
You can probably find a reference in the allele-specific literature, though we also use this strategy in the WGBS world.
In general, if you know your sample is a composite of a couple organisms then concatenating them minimizes the bias.
Ok, never heard of this strategy. Thanks!
Here some other things that didn't do the job either:
I get the following error when I start any Picard tools (e.g. ReorderSam, AddOrReplaceReadGroups), maybe this might help to find the problem:
The numbers vary in different files, I also got '185' and others.
Hope somebody can help me! Thanks in advance!
Are you sure that your header order of chromosomes really is the same order as no-mouse.bam?
That's a good hint, thanks! If I look at the bam file the first reads I get are from chr1, whereas the header starts with chrM. But I don't really know how to fix this. I'll try ReorderSam on the subsetted file with the old header, maybe that'll work (reordering the subsetted file with the new header doesn't).
See my
cat
solution in the comment below, it should work without needing to useReorderSam
.