I did this recently but I aligned with bowtie2. Please note I provide code but you would have to modify it slightly for your case.
To create the genome index, I edited the chromosome names in each genome fasta file. So instead of chr1, chr2, chr3... I made it hg_chr1, hg_chr2, hg_chr3. And likewise for the other genome fasta.
e.g.
zcat Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz | \
sed -E "s/>(.+)( dna:.+)/>chr\1_hg38\2/" > hg38.tmp
zcat Mus_musculus.GRCm38.dna.primary_assembly.fa.gz | \
sed -E "s/>(.+)( dna:.+)/>chr\1_mm10\2/" > mm10.tmp
cat hg38.tmp | cat - mm10.tmp > hg38_mm10.fa
rm hg38.tmp
rm mm10.tmp
Then I combined both genome files with the unique chromosome IDs and used that as the index.
bowtie2-build --threads 8 -f ./hg38_mm10.fa hg_mm
After alignment, you can easily distinguish which genome each read is aligned to based on the chromosome name (hg_chr1 vs mm_chr1). I used the XS:i
flag in the BAM file to only pull out uniquely mapped reads. It may sound like a lot of work but it was pretty simple to execute.
Here was the more complicated part of separating the BAM into each species with only unique reads.
for file in *.sam
do
# filter out multimappers (reads with 'XS:i' inform us of multimapping)
cat $file | grep -v 'XS:i' > tmp.unique.sam
# filter for high quality reads and convert to bam (below)
samtools view -b -@ 6 -q 35 *unique.sam > tmp.filt.unique.bam
rm tmp.unique.sam
# split into separate human and mouse sam files
samtools view tmp.filt.unique.bam | grep -E $'\tchr.+hg38' > hg38.noheader.sam
samtools view -H tmp.filt.unique.bam | grep -v 'mm10' | cat - hg38.noheader.sam | sed 's/_hg38//' | \
samtools view -b -@ 6 > ./hg38/"${file:0:-4}".hg38.bam
rm hg38.noheader.sam
##
samtools view tmp.filt.unique.bam | grep -E $'\tchr.+mm10' > mm10.noheader.sam
samtools view -H tmp.filt.unique.bam | grep -v 'hg38' | cat - mm10.noheader.sam | sed 's/_mm10//' | \
samtools view -b -@ 6 > ./mm10/"${file:0:-4}".mm10.bam
rm mm10.noheader.sam
##
rm tmp.filt.unique.bam
done
I suggest that you should try
bbsplit.sh
as noted in this thread (A: Tool to separate human and mouse ran seq reads ). It will make this process much simpler.You can filter your original fastq files using BBSplit.sh
@Antonio: I am moving this to a comment since this is not an answer for the original question but an alternate way of achieving the end result OP wants.
Many thanks genomax and Antonio,
I've came across with this tool which I found very useful. However, my end goal is to see hoy many reads across samples are from mouse and how many are from plasmodium from the total number of reads. That being so (maybe I'm wrong) but I thought filtering the fastq files before will be easier but won't allow me to see "the big picture" about the distribution of reads per species in my samples. Can I make this comparison with the fastq files filtered? Should I align afterwards with the same reference file and have two different bam files per sample (one with unique mouse reads and other with unique plasmodium reads)?
Again, many thanks for your help.
You are not filtering the reads but actually binning them into pools by using
bbsplit.sh
(by aligning them to the references provided). This allows you to find uniquely mapping reads for each genome and get then in individual files. The program will also bin reads that multi-map across species in separate (ambiguous data) files. While it is possible to getbam
files directly frombbsplit.sh
it is better to do the split and then do alignments on split data files withbbmap.sh
to avoid issues visualizing the data with a genome viewer like IGV.