Question

Skip orientation as there are not enough pairs by bwa mem

0

Entering edit mode

5.8 years ago

seta ★ 1.9k

Hi all,

I extracted the specific region from a BAM file generated by aligning fastq (Illumina/whole genome sequencing) on hg38 and converted to corresponding fastq reads of that region by Picard (SamTofastq) and try to realign to another (my own) reference sequence by bwa mem. But, it seems that bwa mem cannot understand the sequencing is as paired end since it frequently says “skip orientation FF as there are not enough pairs” during the mapping, it also happened for other orientations (FR, RF, and RR); however, there is 1259352 paired read. Subsequently, the properly paired percentage is very low. Could you please kindly help me out how I can solve the problem?

Edit and update: I also found that during the conversion of bam to fastq by SamTofastq (picard), it returned me: SAM validation error: ERROR: Found 13548 unpaired mates. I think, it may not be a real problem; actually, it seems that Picard just consider paired read to creat Fastq file. So, the real problem is with bwa mem, yes? Please kindly share your idea with me?

Here is the used commands:

samtools sort file.bam

sammtools index file.bam file.bam.bai

Extracted the region of interest:
samtools view -h -b file.bam  chr6:28,510,120-33,480,577 > extracted_new.bam

Create fastq
java -jar SamToFastq.jar I=extracted_new.bam F=file_1.fastq F2=file_2.fastq   


Alignment by BWA-MEM 

bwa index my_reference.fasta
bwa mem -L 10000 -a my_reference.fasta file_1.fastq F2=file_2.fastq > output.sam

Many thanks

whole genome bwa paired end sequencing • 11k views

ADD COMMENT • link 5.8 years ago by seta ★ 1.9k

0

Entering edit mode

Show us the headers of files you extracted from original alignment. They likely have lost read origination information.

While trivial this error has come up in the past when people try to use the same file (e.g. R1/R1 instead of R1/R2) when doing the second alignment. Something to check on.

ADD REPLY • link 5.8 years ago by GenoMax 150k

0

Entering edit mode

Sorry, file is on the cluster and can't transfer it. Could you please tell me where is the read orientation information in the BAM header, what I should look for? I did alignment with two fastq file R1/R2.

ADD REPLY • link 5.8 years ago by seta ★ 1.9k

0

Entering edit mode

What does zcat filename_R1.fq.gz | head -4 and zcat filename_R2.fq.gz | head -4 show? If your files are uncompressed then replace zcat with cat in the command.

ADD REPLY • link 5.8 years ago by GenoMax 150k

0

Entering edit mode

The read name in two fastq files generated by SamTofastq is the same except for /1 and /2. Also, the read count of two fastq file is the same. So, the problem is another place. In addition, someone says "bwa expects the first read of file one to be the mate of the first read of file 2. That assumption obviously does not hold throughout the whole file, because sorting by coordinates and throwing away reads ruins that" in this post. What do you think?

ADD REPLY • link 5.8 years ago by seta ★ 1.9k

0

Entering edit mode

You should post the exact commands you used to extract the reads from the bam file and subsequently to map the fastq files again

You should also do your best to provide the information requested when someone asks for additional information.

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

I added the used commands to the original post.

ADD REPLY • link 5.8 years ago by seta ★ 1.9k

0

Entering edit mode

Either the original alignment or your conversion from samtofastq eliminated the identifiers that show which of the reads is from R1 and which is from R2. Since you assure us that reads from R1/R2 are in sync, you can fix that issue by using following tool from BBMap suite:

reformat.sh in1=R1.fq.gz in2=R2.fq.gz out1=fixed.R1.fq.gz out2=fixed.R2.fq.gz

add one of the two options below. Either should work.

addslash=f              Append ' /1' and ' /2' to read names, if not already present. 
addcolon=f              Append ' 1:' and ' 2:' to read names, if not already present.

ADD REPLY • link 5.8 years ago by GenoMax 150k

0

Entering edit mode

Hi genomax.

Here

is the output of head -4 of two fastq files (sorry if the resolution is not enough high); as you can see read name in both fastq files is the same and are along with /1 and /2, respectively; so the read name and their identifier (/1 and /2) is correct. Is it possible the read name or the related identifiers become wrong in the whole fastq files? do you still suggest using repair.sh?

Thank you

ADD REPLY • link updated 5.8 years ago by h.mon 35k • written 5.8 years ago by seta ★ 1.9k

0

Entering edit mode

My apologies. I meant to say reformat.sh instead of repair.sh (fixed now).

BTW: We don't see your fastq examples. You can copy and paste the lines (don't post as an image) and then format using the code button.

code_formatting

Your reads do have /1 and /2. So it is possible that there just are not enough reads for bwa to do that estimate.

ADD REPLY • link 5.8 years ago by GenoMax 150k

0

Entering edit mode

Yes, the reads have /1 and /2 based on the head of two fastq files. There is 1259352 paired read, it isn't enough for bwa mem in your opinion? I also tested bowtie2, but the overall alignment rate was very bad (about 5.4%). With bwa mem, the mapping percentage is about 95%, but the properly paired is very low (about 10%). Please kindly share me any suggestions and ideas?

ADD REPLY • link 5.8 years ago by seta ★ 1.9k

0

Entering edit mode

Are you getting that error/warning with a) original data AND b) extracted reads (~13K) for a specific region. I thought it was only the latter and thus there may not be enough data.

Why are you cherry picking data like this BTW? Is your own reference just for this smaller region? If it is for the whole genome why can't you use the full data to align?

/1 and /2 have not been used for a while. Is this old data?

ADD REPLY • link 5.8 years ago by GenoMax 150k

0

Entering edit mode

Hi genomax,

Thank you for following the issue. Regarding your questions, I get the error/warning with extracted reads; there is more than 1M PE reads (not 13K) in the extracted bam file as I mentioned in my previous comment. Assuming Bwa mem may predict the wrong insert size distribution, I used CollectInsertSizeMetrics (from picard) to predict the insert size distribution in the extracted bam file; however, I’m not sure how these numbers should be feed to bwa mem via I flag. Could you please kindly show me an example?

2) yes, my own reference is related to just this small region. Actually, this region is HLA and I would like to visualize the reported alleles from HLA typing (of whole genome sequencing data) in IGV. So, I extracted this region, converted to fastq files and try to re-align to hla genomic sequences as reference. It will be great if you please share me any idea/suggestion on this issue.

Thanks

ADD REPLY • link 5.8 years ago by seta ★ 1.9k

0

Entering edit mode

When you re-align these, are you not supposed to re-align to the HLA FASTA sequences from, e.g., IMGT?

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, not re-align to HLA fasta sequence! please kindly tell me what is the reference sequence to realign? I'm getting more confused

ADD REPLY • link 5.8 years ago by seta ★ 1.9k

0

Entering edit mode

I think that I showed you (in some other thread) a published work by my colleagues at UCL (?) - in that, there is a few steps to follow.

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Yes, your mean is LOHHLA I think. However, at the moment I would like to focus just HLA alleles visualization (HLA typing from whole genome sequencing was done by another person). After solving the current problem, I certainly test LOHHLA; but it requires Novalign that is not free!. So, now for just visualization of the typed HLA alleles in IGV, I realigned the extracted fastq reads to HLA fasta sequence as reference by bwa mem and faced with very low properly paired mapping. Considering that the reads in two generated fastq files is paired based on head output, which I showed in the previous comments in this post, I think that bwa mem may wrongly estimate the insert size distribution. That's why I calculated it by picard, but don't sure how to feed it to bwa mem via I option. Kevin, could you please kindly help me to solve the current issue?

ADD REPLY • link 5.8 years ago by seta ★ 1.9k

0

Entering edit mode

I did try to replicate that pipeline myself (LOHHLA), but ended up becoming too busy. I actually had a meeting with the developer (Dr. McGranahan) but cannot recall the finer details of it, at this stage. Yes, NovoAlign was a strange choice, and remains so. However, I believe that older versions of it are free?

HLA typing has been an obscure and strange area in bioinformatics these past few years. My first attempt at it was in 2015 on Roche 454 data and I did everything manually - it was excruciating.

Can you take a look at the suggestion by Wouter, perhaps? - see below.

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks. As I found there are several programs for HLA typing, so no need work manually these days, but I prefer to work manually on one sample for better learning; please kindly share me any well-documented guide/tutorial if you know. The tool suggested by Wouter is another HLA typing program; as I told you, at the moment I have to focus just on the HLA alleles visualization.