Question

Extract uniquely mapped reads from one species

0

Entering edit mode

6.8 years ago

PSB • 0

Hi,

Using STAR and a reference fasta file with both the genomes of mouse and plasmodium I've mapped my fastq files (with reads from both mouse and plasmodium). I used a GFF annotation file too in the alignment. Now I have BAM files with reads mapped uniquely to mouse or plasmodium genome and multimapped to both mouse and plasmodium.

I'm interested on extracting the uniquely mapped reads of mouse, the uniquely mapped reads from plasmodium and the multimapped reads.

Since the uniquely mapped reads in the BAM file could be from mouse or plasmodium, I was wondering how can I extract the uniquely mapped reads from each specie. My first idea is to use bamtools/bedtools with the GFF file but I'm unsure about how to do this. Any ideas?

Thanks in advance and thanks for reading.

RNA-Seq dual RNA-seq host-pathogen • 5.0k views

ADD COMMENT • link updated 6.8 years ago by goodez ▴ 640 • written 6.8 years ago by PSB • 0

1

Entering edit mode

I suggest that you should try bbsplit.sh as noted in this thread (A: Tool to separate human and mouse ran seq reads ). It will make this process much simpler.

ADD REPLY • link 6.8 years ago by GenoMax 150k

1

Entering edit mode

You can filter your original fastq files using BBSplit.sh

ADD REPLY • link 6.8 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

@Antonio: I am moving this to a comment since this is not an answer for the original question but an alternate way of achieving the end result OP wants.

ADD REPLY • link 6.8 years ago by GenoMax 150k

0

Entering edit mode

Many thanks genomax and Antonio,

I've came across with this tool which I found very useful. However, my end goal is to see hoy many reads across samples are from mouse and how many are from plasmodium from the total number of reads. That being so (maybe I'm wrong) but I thought filtering the fastq files before will be easier but won't allow me to see "the big picture" about the distribution of reads per species in my samples. Can I make this comparison with the fastq files filtered? Should I align afterwards with the same reference file and have two different bam files per sample (one with unique mouse reads and other with unique plasmodium reads)?

Again, many thanks for your help.

ADD REPLY • link 6.8 years ago by PSB • 0

1

Entering edit mode

However, my end goal is to see hoy many reads across samples are from mouse and how many are from plasmodium from the total number of reads.

You are not filtering the reads but actually binning them into pools by using bbsplit.sh (by aligning them to the references provided). This allows you to find uniquely mapping reads for each genome and get then in individual files. The program will also bin reads that multi-map across species in separate (ambiguous data) files. While it is possible to get bam files directly from bbsplit.sh it is better to do the split and then do alignments on split data files with bbmap.sh to avoid issues visualizing the data with a genome viewer like IGV.

ADD REPLY • link 6.8 years ago by GenoMax 150k

score 2 · Answer 1 · 2018-06-20

2

Entering edit mode

6.8 years ago

goodez ▴ 640

I did this recently but I aligned with bowtie2. Please note I provide code but you would have to modify it slightly for your case.

To create the genome index, I edited the chromosome names in each genome fasta file. So instead of chr1, chr2, chr3... I made it hg_chr1, hg_chr2, hg_chr3. And likewise for the other genome fasta.

e.g.

zcat Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz | \
sed -E "s/>(.+)( dna:.+)/>chr\1_hg38\2/" > hg38.tmp

zcat Mus_musculus.GRCm38.dna.primary_assembly.fa.gz | \
sed -E "s/>(.+)( dna:.+)/>chr\1_mm10\2/" > mm10.tmp

cat hg38.tmp | cat - mm10.tmp > hg38_mm10.fa

rm hg38.tmp
rm mm10.tmp

Then I combined both genome files with the unique chromosome IDs and used that as the index.

bowtie2-build --threads 8 -f ./hg38_mm10.fa hg_mm

After alignment, you can easily distinguish which genome each read is aligned to based on the chromosome name (hg_chr1 vs mm_chr1). I used the XS:i flag in the BAM file to only pull out uniquely mapped reads. It may sound like a lot of work but it was pretty simple to execute.

Here was the more complicated part of separating the BAM into each species with only unique reads.

for file in *.sam
do
        # filter out multimappers (reads with 'XS:i' inform us of multimapping)
        cat $file | grep -v 'XS:i' > tmp.unique.sam
        # filter for high quality reads and convert to bam (below)
        samtools view -b -@ 6 -q 35 *unique.sam > tmp.filt.unique.bam
        rm tmp.unique.sam
        # split into separate human and mouse sam files
        samtools view tmp.filt.unique.bam | grep -E $'\tchr.+hg38' > hg38.noheader.sam
        samtools view -H tmp.filt.unique.bam | grep -v 'mm10' | cat - hg38.noheader.sam | sed 's/_hg38//' | \
        samtools view -b -@ 6 > ./hg38/"${file:0:-4}".hg38.bam
        rm hg38.noheader.sam
        ##
        samtools view tmp.filt.unique.bam | grep -E $'\tchr.+mm10' > mm10.noheader.sam
        samtools view -H tmp.filt.unique.bam | grep -v 'hg38' | cat - mm10.noheader.sam | sed 's/_mm10//' | \
        samtools view -b -@ 6 > ./mm10/"${file:0:-4}".mm10.bam
        rm mm10.noheader.sam
        ##
        rm tmp.filt.unique.bam
done

ADD COMMENT • link 6.8 years ago by goodez ▴ 640

0

Entering edit mode

Most of the ideas in this post are already in the original question. Unless you add actual code or more details this would be moved to a comment instead. If you have code snippets/workflow available you can host that via a gist on GitHub and include the link here.

ADD REPLY • link 6.8 years ago by GenoMax 150k

0

Entering edit mode

It's unclear in the question how they did the alignment. If the chromosome names are unique as I'm suggesting, then it's simple to pull out reads from either organism with grep. I suggested using XS:i to only take uniquely mapped reads, which I don't see mentioned above. Move my post wherever you'd like though.

ADD REPLY • link 6.8 years ago by goodez ▴ 640

0

Entering edit mode

Thanks for adding the example code.

Give the bbsplit.sh method suggested above a try as a simpler alternative in future :-)

ADD REPLY • link 6.8 years ago by GenoMax 150k

0

Entering edit mode

Hi,

Thanks for your answer. That's an interesting approach and it'll be very helpful if you can provide explanation/code about how you did it.

Many thanks.

ADD REPLY • link 6.8 years ago by PSB • 0

0

Entering edit mode

Sure I will edit my answer to include some code

ADD REPLY • link 6.8 years ago by goodez ▴ 640

0

Entering edit mode

Hi goodez,

Thank you for the answer and the code. I do have unique chromosome names for each genome, so the file looks like

7001289F:143:CBY2FANXX:1:1107:2105:1964 99      chr11   96938063       255     1S124M  =       96938130        192     NGACCCAGGCTGAAGTGGGGGATCAACACAGGACAAGGTTGCCTCGGGAGGAGCCTTTGTTAGTGACAGAGGCTGAAGCACACAGCCACAAAGTCAAAACAAGGCAAGTGTGTCCAGGGTGATAA
#<<@AF;EGEGGGGGGG>FGGGGGEGGGGGEGGGGGGGGGFGGGFGGGGGGDGGGGGGGGGGGGEGGF>GEGGGBGGGGGE>DBGGGBGBGGG0;0F@DGGGE=@FGGGE@@DBGEGGG.@D6EG NH:i:1  HI:i:1  AS:i:247        nM:i:0

My idea us to use chr*, 255 (MAPQ) and NH:i:1 to extract the unique mapped reads of mouse, and the same but changing the chr for the appropriate name to extract the unique mapped reads of plasmodium. I was thinking to use grep or samtools. Do you think this is a good approach? To filter for both MAPQ and NH:i:x?

Thanks for your time and help.

ADD REPLY • link 6.8 years ago by PSB • 0

0

Entering edit mode

Yes that sounds good to me

ADD REPLY • link 6.8 years ago by goodez ▴ 640