I aligned an NGS read dataset to a conactenated reference consisting of the human, mouse and pig genomes. Now I would like to extract the human (GRCh38) aligned reads only into a seperate .bam file (and same for pig and mouse).
One option I found in a paper used the following workflow (for GRC37/hg19):
samtools view -b ${SAMPLE}.aligned_to_ConcatRef.bam chrM.hg19 chr1.hg19 ... chrX.hg19
chrY.hg19 > ${SAMPLE}.aligned_to_ConcatRef.hg19_only.bam
But how could this work? If I look at the headers in the human genone they are as per the following
grep 'GRCh38' GCF_000001405.39_GRCh38.p13_genomic.fna
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly
I replaced '.hg19' with 'hg38' in the samtools view command above and as expected the regions (eg chr1.hg38) are invalid.
The other option I can think of is to use SeqIO and pySam.
1) read the animal chromosome, get all the accesion numbers in the genome 2) use pySam to read the bam file, get each matching read from the input bam file to each of the accessions in 1) 3) write out the matching reads in a sam file format and convert to bam
But I assume this would be a fairly common thing to do and no need to spend many hours messing with loops and sam file formatting just to get each animal genome aligned reads from a bam file. Any help would be appreciated.
I am not answering your question since you seem to know what is needed.
What I will suggest that in future you use a
bbsplit
from BBMap suite to bin the reads into genome specific bins.bbsplit
allows you to intelligently map multi-mapping reads. See this post to get started.Thanks, I have used BBsplit as well, but would like a 'second optinion' as per Jo et al. (2019) https://doi.org/10.1186/s13059-019-1849-2