Question

Realign BAM files to other reference file

0

Entering edit mode

6.4 years ago

marongiu.luigi ▴ 730

Dear all,

I completed an alignment of two groups of fastq files to a reference. To re-align the bam files to another reference I did the following:

# align
bwa mem -R <read_group1> <ref.fa> <file1_1.fq> <file1_2.fq> -o <file1_aln.sam>
bwa mem -R <read_group2> <ref.fa> <file2_1.fq> <file2_2.fq> -o <file2_aln.sam>
# convert
samtools view -Sb <file1_aln.sam>  > <file1_aln.bam>
samtools view -Sb <file2_aln.sam>  > <file2_aln.bam>
# sort
samtools sort <file1_aln.bam> -o <file1_alnSRT.bam>
samtools sort <file2_aln.bam> -o <file2_alnSRT.bam>
# merge
samtools merge <file_alnSRT.bam> <file1_aln.bam> <file2_aln.bam>
# select mapped
samtools view -h -F 4 <file_alnSRT.bam> -o <file_alnMAP.bam>
# convert to fastq
samtools fastq -1 <filemap_1.fq.gz> -2 <filemap_1.fq.gz> <file_alnMAP.bam>
# re-align
bwa mem -R <read_group> <ref2.fa> <filemap_1.fq.gz> <filemap_1.fq.gz> | samtools sort -o <file_realign.bam>

but I got this error at the last step:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 831790 sequences (80000139 bp)...
[M::process] read 872512 sequences (80000021 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1981, 3107, 95, 1551)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (7, 14, 27)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 67)
[M::mem_pestat] mean and std.dev: (16.00, 12.21)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 87)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (42, 47, 60)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (6, 96)
[M::mem_pestat] mean and std.dev: (52.98, 15.79)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 116)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (5897, 5897, 5897)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (5897, 5897)
[M::mem_pestat] mean and std.dev: (5897.00, 0.00)
[M::mem_pestat] low and high boundaries for proper pairs: (5897, 5897)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (6, 17, 31)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 81)
[M::mem_pestat] mean and std.dev: (19.94, 15.35)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 106)
[M::mem_pestat] skip orientation RF
[mem_sam_pe] paired reads have different names: "HISEQ:90:C3UNJACXX:1:1107:16388:61141", "HISEQ:90:C3UNJACXX:3:1303:2008:58095"
[mem_sam_pe] [mem_sam_pe] paired reads have different names: "HWI-ST1437:64:C3UM1ACXX:4:1214:3833:62731", "HWI-ST1437:64:C3UM1ACXX:4:1202:18519:45906"

[mem_sam_pe] paired reads have different names: "HISEQ:90:C3UNJACXX:1:2111:17172:73588", "DHGDKXP1:247:C3MF6ACXX:2:1304:11299:10212"
[mem_sam_pe] @HD    VN:1.3  SO:coordinate
@SQ SN:V    LN:348678459
@PG ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa mem -t 8 [...]

Apparently, the two read groups have given the problem.

How can I solve it?

Thank you

bam samtools • 11k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 6.4 years ago by marongiu.luigi ▴ 730

1

Entering edit mode

While I've never used samtools fastq before, my impression is that it will simply output reads in the same order as they are found in the bam file. The error suggests you need to sort your bam file by name before converting it to fastq.

ADD REPLY • link 6.4 years ago by Eric Lim ★ 2.2k

0

Entering edit mode

are you using a latest version?

ADD REPLY • link 6.4 years ago by naive_user ▴ 80

0

Entering edit mode

i am using Version: 1.7

ADD REPLY • link 6.4 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Newer versions of samtools sort will sort sam files, so you can skip the samtools view step.

Also, why are you filtering away unmapped reads if you are aligning to a different reference? How can you be sure those reads won't align to your new reference?

Your alignment to the new reference is obliterating read groups from the old files. If you don't care about read groups, you could just cat the initial files together, so you don't have to run everything twice and then merge.

ADD REPLY • link 6.4 years ago by swbarnes2 14k

0

Entering edit mode

dear drkennetz, see comment below. the goal i would like to achieve is filtering stuff out...

ADD REPLY • link 6.4 years ago by marongiu.luigi ▴ 730

1

Entering edit mode

6.4 years ago

drkennetz ▴ 560

You don't want to realign aligned.bams, just start over from the file_1.fq and file_2.fq and repeat the same alignment process.The original fastqs wouldn't have been modified in any of these steps so just start from the beginning with the second reference.

ADD COMMENT • link 6.4 years ago by drkennetz ▴ 560

0

Entering edit mode

but what I want to do is to get what aligns to the first reference and from that filtering out what aligns to the second reference. I reckon the problem is with the reading group. If I have one sample that is run on two different occasions (hence the two reading groups) can I simply merge them? or at least use the same reading group? The reading group is required for the downstream analysis, for instance with GATK.

ADD REPLY • link 6.4 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

You can merge the reads if they are from the same sample and if they are the same prep (IE WES, WGS, RNASeq). Merge them before alignment, so you don't have to do it after and re-sort them.

For your specific analysis, we will call it a metagenomic comparison because you are trying to see if reads are aligning to two genomes.

1) Create a new reference genome by just merging the two reference genomes of interest, and then using bwa index:

$ bwa index merged_genome.fasta

2) align multimapped reads to new merged_genome.fasta (the merged.fastqs are just the merged fastqs of the same sample from two sequencing runs):

bwa mem merged_R1.fastq merged_R2.fastq merged_genome.fasta > merged_aln_pe.sam

3) convert to bam:

samtools view -h -b -S merged_aln_pe.sam > merged_aln_pe.bam

4) If you want to extract reads only aligned to the merged_genome.fasta (if not just skip this step):

samtools view -b -F 4 merged_aln_pe.bam > merged_aln_pe.mapped.bam

5) extract uniquely mapped reads (reads that only mapped to one place, so if a read mapped to both genomes it will get tossed out):

samtools view reads.bam | grep 'XT:A:U' | samtools view -bS -T referenceSequence.fa - > merged_reads.uniqueMap.bam

The XT:A:U flag is a bwa flag meaning the read was unambiguously uniquely mapped (mapped to only one place).

I hope this gives you what you are looking for!

ADD REPLY • link 6.4 years ago by drkennetz ▴ 560

0

Entering edit mode

Thank you, but in this way, the computational demands increase (the reference file is bigger than the original one); the filtering step I am seeking is actually carried out in some publications to filter out, for instance, the human genome and speed up the analysis...

ADD REPLY • link 6.4 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

The read group is not the problem. The problem is that bwa expects the first read of file one to be the mate of the first read of file 2. That assumption obviously does not hold throughout the whole file, because sorting by coordinates and throwing away reads ruins that.

ADD REPLY • link 6.4 years ago by swbarnes2 14k

0

Entering edit mode

Hi Swbarnes2,

So, your mean is to sort based on name instead of coordinate for bwa? I also have an issue with bwa and identifying paired read by it. Could you please take a look at my issue and let me know your idea?

ADD REPLY • link 5.5 years ago by seta ★ 1.9k

0

Entering edit mode

No, I realigned by coordinates as follows:

# keep only pairs not aligned
samtools view -u -f12 -F256 <AlnSrt.bam> > <R1uR2u.bam> 
# sort
samtools sort -n -O BAM <R1uR2u.bam> -o <UmpSrt.bam>
# convert
bamToFastq -i <UmpSrt.bam> -fq <Ump_1.fq> -fq2 <Ump_2.fq>

ADDENDUM Anyway, I eventually used the following approach:

(1) map to composite genome primary + secondary

(2) deduplicate sambamba markdup -r -t 8 --overflow-list-size 1000000 --hash-table-size 1000000 --tmpdir=SCRATCH_DIR <AlnSrt.bam> <AlnSrtDed.bam> (3) select reads that are read pair, read pair mapped, first in pair (-f 67) and that are not second in pair, secondary alignment, optical or PCR duplicate (-F 1408) on the chromosome of interest samtools view -h -q 10 -f67 -F1408 <AlnSrtDed.bam> "chr..." and work on the reads directly.

ADD REPLY • link 5.5 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Thanks for the feedback. I think my problem is related to bwa mem, the previous steps sounds well work and I got correct PE reads during converting the extracted bam file to fastq reads. however, mapping these fastq reads with my own reference by bwa mem is not good. Please let me know if you have any suggestion

ADD REPLY • link 5.5 years ago by seta ★ 1.9k

0

Entering edit mode

Hi, what kind of 'not good'? I also found that recreating the fastq generates issues so I chose the second approach, merge the genome of reference in the other one and take it from there.

ADD REPLY • link 5.5 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

My mean is: after mapping the generated fastq reads to my own reference with bwa mem, the properly paired percentage is very low. I explained the issue in this post. Could you please take a look at it and let me know your idea.

Thanks

ADD REPLY • link 5.5 years ago by seta ★ 1.9k

score 4 · Accepted Answer · 2018-07-16

4

Entering edit mode

6.4 years ago

marongiu.luigi ▴ 730

Looks like I solved it by using PICARD: extracting the aligned reads but without sorting, I used

java -jar picard.jar SamToFastq I=<file_alnMAP.bam> FASTQ=<filemap_1.fq> SECOND_END_FASTQ=<filemap_2.fq>
bwa mem -R <read_group> <ref2.fa> <filemap_1.fq> <filemap_1.fq>

From samtools flagstat the output looks fine. I'd say case closed.

ADD COMMENT • link 6.4 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

okay. you have converted them (bam) back to fastqs and then realigned. Thanks.

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

How long does picard.jar SamToFastq run? samtools fastq?

ADD REPLY • link 6.4 years ago by Eric Lim ★ 2.2k

0

Entering edit mode

I haven't count it, but it was for a few minutes. I can update next time I'll run it...

ADD REPLY • link 6.4 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Why is this superior to just using the original fastqs?

ADD REPLY • link 6.4 years ago by swbarnes2 14k

0

Entering edit mode

The fact that I have removed hundreds of thousands of reads that I did not want, which were a noise that increased the computational demand. Now, instead of hours, I can re-align in minutes.

ADD REPLY • link 6.4 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

You are saying that you filtered away huge numbers of reads with -F 4 without disturbing the read 1 and read 2 pairing? -F 12 would probably work, I'm really surprised that Picard took all those orphan reads in stride.

ADD REPLY • link 6.4 years ago by swbarnes2 14k

0

Entering edit mode

How would I check whether there is a difference between the files produced by -F4 and -F12? Specifically, how to check that the mates are still in sync?

ADD REPLY • link 6.3 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

I think the problem originally was due to the wrong flag: the mapped reads are selected with -f, not -F. When I re-run the thing with -f, the filtered file could be easily converted in fastq with Picard:

java -jar picard.jar SamToFastq I=<file_sorted.bam> FASTQ=file_1.fq SECOND_END_FASTQ=file_2.fq

even if the original bam file was sorted. There is truly no need to keep intermediate unsorted bam.

ADD REPLY • link 6.2 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

No, samtools -f 4 gets unmapped reads, samtools -F 4 gets mapped ones.

-f INT only include reads with all of the FLAGs in INT present [0] -F INT only include reads with none of the FLAGS in INT present [0]

ADD REPLY • link 6.2 years ago by swbarnes2 14k

0

Entering edit mode

This was a bit confusing. I went back to the BIOSTAR's manual and I understand this:

f – selects reads that match the flag’s number
F – filters the reads that match the flag’s number

So I got not the f/F flag wrong, but actually the value: 4 does not mean mapped but UNMAPPED. So with -F 4 I am kind of doing a double negation. My bad.

ADD REPLY • link 6.2 years ago by marongiu.luigi ▴ 730