I want to create pseudo-replicates from a paired-end BAM file (Both mates from a pair should be placed in the same BAM file). I wrote the following script to achieve this, but wondered if there was software available which already provides this functionality, or avoids loading all the BAM file into memory (which my script does during the read shuffling section of the code). I thought it would be great if samtools implemented a pseudo command to deal with this.
#!/bin/bash
# Set header
set -e
set -u
set -o pipefail
# Command arguments
FILENAME=$1
THREADS=$2
# Global variables
FILEPATH="${PWD}/${FILENAME}"
BASENAME=${FILEPATH%.*}
# Count mapped reads
PAIRS=$(samtools view -c -f 2 "${FILEPATH}")
MATES=$((PAIRS / 2))
# If mates is odd, round down to even number
if [[ $((MATES % 2)) -eq 0 ]];
then LINES=$MATES;
else LINES=$((MATES - 1));
fi
# Shuffle and split file
samtools sort -n -T "${BASENAME}_NAMESORT" -@ "${THREADS}" "${FILEPATH}" | samtools view -f 2 - | awk '{printf("%s%s",$0,(NR%2==0)?"\n":"\0")}' | shuf | tr "\0" "\n" | split -d -l "${LINES}" - "${BASENAME}"
samtools view -H "${FILEPATH}" > "${BASENAME}_HEAD.sam"
# Replace header
cat "${BASENAME}_HEAD.sam" "${BASENAME}00" | samtools sort -@ "${THREADS}" -T "${BASENAME}_PR1" -o "${BASENAME}_PR1.bam" -
cat "${BASENAME}_HEAD.sam" "${BASENAME}01" | samtools sort -@ "${THREADS}" -T "${BASENAME}_PR2" -o "${BASENAME}_PR2.bam" -
# Clean up
rm -f "${BASENAME}_HEAD.sam"
rm -f "${BASENAME}00"
rm -f "${BASENAME}01"
rm -f "${BASENAME}02"
seqtk
can sub-sample files. You will need to convert your BAM to fastq though.Hi, I tried your code but right after the first sorting step I get the following error from samtools view
[bam_header_read] invalid BAM binary header (this is not a BAM file). [main_samview] fail to read the header from "-".
Please advice. Many thanks!
Without looking at your BAM file I can only guess that it's because you have a different version of samtools. Since writing this script there has been a few updates and I believe the way you pass in data through standard input may have changed. You don't specify which line the script fails on, if it's the first instance of samtools view (line 27) try replacing "samtools view -" with "samtools view - -", if it's the second instance of samtools view (line 28) it seems that your original BAM file doesn't have a header or is corrupted in someway.
Thanks James, it is the first instance (line 27). I will try your fix later and keep you posted. The same BAM has been used subsequently for other BASH and R based analyses without problems. That been said I am using samtools-0.1.19 which also generated the known "EOF marker is absent" bug which I ignored since my hexdump -C output is correct.
Hi James, as guessed it was a compatibility issue with samtools, it runs smoothly now. I have another question though. I notice a significant size difference between the original BAM file and the two pseudo-reps. For example an original BAM may have a size of 6 Gb and the sum of the two pseudo-rep BAMs may total to 4.5 Gb. My original mapping is performed with tophat (I an doing RNA-seq analysis) with a concordant read percentage up to 90%, therefore the discorcordant reads cannot simply account for this difference. Have you observed something similar? Is there a certain bias that I should keep in mind? I will perform severral diagnostics on my side I just wanted to have your opinion about it.
Once again thank for your time
Glad to heard it still works! The script only extracts properly paired reads. Have a look how many reads are classed as properly paired in your original BAM file, and then double check the number in each of the pseudo BAM files.
Yeap the maths do sum up now thanks James!