Question

Extract human, mouse and pig aligned reads from .bam file alignment to conactenated genomes

0

Entering edit mode

3.1 years ago

daewowo ▴ 80

I aligned an NGS read dataset to a conactenated reference consisting of the human, mouse and pig genomes. Now I would like to extract the human (GRCh38) aligned reads only into a seperate .bam file (and same for pig and mouse).

One option I found in a paper used the following workflow (for GRC37/hg19):

samtools view -b ${SAMPLE}.aligned_to_ConcatRef.bam chrM.hg19 chr1.hg19 ... chrX.hg19
chrY.hg19 > ${SAMPLE}.aligned_to_ConcatRef.hg19_only.bam

But how could this work? If I look at the headers in the human genone they are as per the following

grep 'GRCh38' GCF_000001405.39_GRCh38.p13_genomic.fna

>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly

I replaced '.hg19' with 'hg38' in the samtools view command above and as expected the regions (eg chr1.hg38) are invalid.

The other option I can think of is to use SeqIO and pySam.

1) read the animal chromosome, get all the accesion numbers in the genome 2) use pySam to read the bam file, get each matching read from the input bam file to each of the accessions in 1) 3) write out the matching reads in a sam file format and convert to bam

But I assume this would be a fairly common thing to do and no need to spend many hours messing with loops and sam file formatting just to get each animal genome aligned reads from a bam file. Any help would be appreciated.

Extract ConcatRef bam • 1.6k views

ADD COMMENT • link 3.1 years ago by daewowo ▴ 80

1

Entering edit mode

I am not answering your question since you seem to know what is needed.

What I will suggest that in future you use a bbsplit from BBMap suite to bin the reads into genome specific bins. bbsplit allows you to intelligently map multi-mapping reads. See this post to get started.

ADD REPLY • link 3.1 years ago by GenoMax 148k

0

Entering edit mode

Thanks, I have used BBsplit as well, but would like a 'second optinion' as per Jo et al. (2019) https://doi.org/10.1186/s13059-019-1849-2

ADD REPLY • link 3.1 years ago by daewowo ▴ 80

score 1 · Accepted Answer · 2021-12-08

You can use the following python code to do this

from Bio import SeqIO
import pysam

def get_fasta_ids(sra_file):
    #read in the chromosome ids from the genome reference
    chr_list=[]
    for record in SeqIO.parse(sra_file, "fasta"):
        chr_list.append(record.id)
    return chr_list

def write_sam(bam_file, chr_list, out_file, bin_code='wb'):
    #write out a subset of reads in the bam_file, 'w' for sam, 'wb' for bam
    sam = pysam.AlignmentFile(bam_file, 'rb')
    with pysam.AlignmentFile(out_file, bin_code, header=sam.header) as outf:
        if sam is not None:
            for record in sam:
                if record.reference_name in chr_list:
                    outf.write(record)