How to extract only the alignments to the chloroplast reference genome of a sam file
1
0
Entering edit mode
5.6 years ago

Hi! I’m pretty new to bioinformatics in linux :D

My lab received paired-end reads (151 bp) of WGS of purple corn and we mapped them (using bowtie2) against the maize B73v4 reference genome (against all the 10 chromosomes, mitochondria and chloroplast genomes). They got a SAM file (zm.sam) of 371,9 GB.

We used the following commands:

#Maize B73v4 reference genome: GCF_000005005.2_B73_RefGen_v4_genomic.fna

#To index the reference genome:
bowtie2-build GCF_000005005.2_B73_RefGen_v4_genomic.fna GCF_000005005.2_B73_RefGen_v4_genomic.fna

#To align the reads to the reference genome:
bowtie2 -p 10 -I 0 -X 600 --fr --very-fast-local -x ../ref/GCF_000005005.2_B73_RefGen_v4_genomic.fna -1 ZM1_R1.fastq.gz -2 ZM1_R2.fastq.gz -S zm.sam

My task is to assembly the chloroplast genome of purple maize, so I only want the alignments of the reads that mapped/aligned with the chloroplast reference genome (Name: Pltd, NC_001666.2).

Please, can someone guide me in how to extract only the alignments to the chloroplast reference genome (Name: Pltd, NC_001666.2) from that SAM file?

(I will map the reads against the chloroplast reference genome alone later, but my advisor wants me to do this first, to extract the alignments with the chloroplast reference genome from this big sam file)

I have been told to use grep, but I have looked at other posts and I don’t know if that is the best way to deal with this.

Thank you in advance for your help :)

genome alignment next-gen Assembly • 2.6k views
ADD COMMENT
0
Entering edit mode

You could use grep... but use samtools. (After compressing the sam to bam)

ADD REPLY
0
Entering edit mode

Yes! I'm going to do that! thank you :)

ADD REPLY
4
Entering edit mode
5.6 years ago
ATpoint 85k

Hi macielrodriguez2,

a couple of things:

First, do not store SAM files. They are uncompressed and only take up space. Use BAM files via:

bowtie2 (commands...) | samtools sort -o zm_sorted.bam

Second, do not align against single chromosomes. Always align against the full genome as using single chromosomes can lead to false-positive alignments. The aligner always try to find the best match and if the actual chromosome that a read belongs to is not present in the reference it will be mapped to the next best (=false) region that is included in the reference. For quick extraction of a chromsome, index the bam file and then use samtools view:

samtools index zm.bam
samtools view -o zm_chloroplast.bam zm.bam NC_001666.2
ADD COMMENT
0
Entering edit mode

Oh! So, it is not convenient to map my WGS paired-end reads only against the chloroplast reference genome???? I wanted to do that because I read papers in which the authors did just that (map all WGS reads against a chloroplast reference genome :O )

Thank you so much ATpoint! I'll try that right away :D

ADD REPLY
0
Entering edit mode

Dear macielrodriguez2, how do you map your FASTQ against a chloroplast reference genome? I have a chloroplast genome from another species (but very close with my study species) maybe I could use this genome. What do you think? Best

Mariana

ADD REPLY
0
Entering edit mode

Dear ATpoint, Can I find chloroplast SNPs from my FASTQ if I do not have a reference genome? I´m working on a species with no reference genome, I used UNEAK to find SNPs but now I would like to find only chloroplast SNPs and I do not know how to do it, could you help me please? Thanks! Mariana

ADD REPLY
0
Entering edit mode

Please open a new question after using google and the search function towards your issue :)

ADD REPLY

Login before adding your answer.

Traffic: 2106 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6