Hi all,
I extracted the specific region from a BAM file generated by aligning fastq (Illumina/whole genome sequencing) on hg38 and converted to corresponding fastq reads of that region by Picard (SamTofastq) and try to realign to another (my own) reference sequence by bwa mem. But, it seems that bwa mem cannot understand the sequencing is as paired end since it frequently says “skip orientation FF as there are not enough pairs” during the mapping, it also happened for other orientations (FR, RF, and RR); however, there is 1259352 paired read. Subsequently, the properly paired percentage is very low. Could you please kindly help me out how I can solve the problem?
Edit and update: I also found that during the conversion of bam to fastq by SamTofastq (picard), it returned me: SAM validation error: ERROR: Found 13548 unpaired mates. I think, it may not be a real problem; actually, it seems that Picard just consider paired read to creat Fastq file. So, the real problem is with bwa mem, yes? Please kindly share your idea with me?
Here is the used commands:
samtools sort file.bam
sammtools index file.bam file.bam.bai
Extracted the region of interest:
samtools view -h -b file.bam chr6:28,510,120-33,480,577 > extracted_new.bam
Create fastq
java -jar SamToFastq.jar I=extracted_new.bam F=file_1.fastq F2=file_2.fastq
Alignment by BWA-MEM
bwa index my_reference.fasta
bwa mem -L 10000 -a my_reference.fasta file_1.fastq F2=file_2.fastq > output.sam
Many thanks
Show us the headers of files you extracted from original alignment. They likely have lost read origination information.
While trivial this error has come up in the past when people try to use the same file (e.g. R1/R1 instead of R1/R2) when doing the second alignment. Something to check on.
Sorry, file is on the cluster and can't transfer it. Could you please tell me where is the read orientation information in the BAM header, what I should look for? I did alignment with two fastq file R1/R2.
What does
zcat filename_R1.fq.gz | head -4
andzcat filename_R2.fq.gz | head -4
show? If your files are uncompressed then replacezcat
withcat
in the command.The read name in two fastq files generated by SamTofastq is the same except for /1 and /2. Also, the read count of two fastq file is the same. So, the problem is another place. In addition, someone says "bwa expects the first read of file one to be the mate of the first read of file 2. That assumption obviously does not hold throughout the whole file, because sorting by coordinates and throwing away reads ruins that" in this post. What do you think?
You should post the exact commands you used to extract the reads from the bam file and subsequently to map the fastq files again
You should also do your best to provide the information requested when someone asks for additional information.
I added the used commands to the original post.
Either the original alignment or your conversion from samtofastq eliminated the identifiers that show which of the reads is from R1 and which is from R2. Since you assure us that reads from R1/R2 are in sync, you can fix that issue by using following tool from BBMap suite:
add one of the two options below. Either should work.
Hi genomax.
is the output of
head -4
of two fastq files (sorry if the resolution is not enough high); as you can see read name in both fastq files is the same and are along with /1 and /2, respectively; so the read name and their identifier (/1 and /2) is correct. Is it possible the read name or the related identifiers become wrong in the whole fastq files? do you still suggest using repair.sh?Thank you
My apologies. I meant to say
reformat.sh
instead ofrepair.sh
(fixed now).BTW: We don't see your fastq examples. You can copy and paste the lines (don't post as an image) and then format using the
code
button.Your reads do have
/1
and/2
. So it is possible that there just are not enough reads forbwa
to do that estimate.Yes, the reads have /1 and /2 based on the head of two fastq files. There is 1259352 paired read, it isn't enough for bwa mem in your opinion? I also tested bowtie2, but the overall alignment rate was very bad (about 5.4%). With bwa mem, the mapping percentage is about 95%, but the properly paired is very low (about 10%). Please kindly share me any suggestions and ideas?
Are you getting that error/warning with a) original data AND b) extracted reads (~13K) for a specific region. I thought it was only the latter and thus there may not be enough data.
Why are you cherry picking data like this BTW? Is your own reference just for this smaller region? If it is for the whole genome why can't you use the full data to align?
/1
and/2
have not been used for a while. Is this old data?Hi genomax,
Thank you for following the issue. Regarding your questions, I get the error/warning with extracted reads; there is more than 1M PE reads (not 13K) in the extracted bam file as I mentioned in my previous comment. Assuming Bwa mem may predict the wrong insert size distribution, I used CollectInsertSizeMetrics (from picard) to predict the insert size distribution in the extracted bam file; however, I’m not sure how these numbers should be feed to bwa mem via
I
flag. Could you please kindly show me an example?2) yes, my own reference is related to just this small region. Actually, this region is HLA and I would like to visualize the reported alleles from HLA typing (of whole genome sequencing data) in IGV. So, I extracted this region, converted to fastq files and try to re-align to hla genomic sequences as reference. It will be great if you please share me any idea/suggestion on this issue.
Thanks
When you re-align these, are you not supposed to re-align to the HLA FASTA sequences from, e.g., IMGT?
Hi Kevin, not re-align to HLA fasta sequence! please kindly tell me what is the reference sequence to realign? I'm getting more confused
I think that I showed you (in some other thread) a published work by my colleagues at UCL (?) - in that, there is a few steps to follow.
Yes, your mean is LOHHLA I think. However, at the moment I would like to focus just HLA alleles visualization (HLA typing from whole genome sequencing was done by another person). After solving the current problem, I certainly test LOHHLA; but it requires Novalign that is not free!. So, now for just visualization of the typed HLA alleles in IGV, I realigned the extracted fastq reads to HLA fasta sequence as reference by
bwa mem
and faced with very low properly paired mapping. Considering that the reads in two generated fastq files is paired based onhead
output, which I showed in the previous comments in this post, I think thatbwa mem
may wrongly estimate the insert size distribution. That's why I calculated it by picard, but don't sure how to feed it to bwa mem viaI
option. Kevin, could you please kindly help me to solve the current issue?I did try to replicate that pipeline myself (LOHHLA), but ended up becoming too busy. I actually had a meeting with the developer (Dr. McGranahan) but cannot recall the finer details of it, at this stage. Yes, NovoAlign was a strange choice, and remains so. However, I believe that older versions of it are free?
HLA typing has been an obscure and strange area in bioinformatics these past few years. My first attempt at it was in 2015 on Roche 454 data and I did everything manually - it was excruciating.
Can you take a look at the suggestion by Wouter, perhaps? - see below.
Thanks. As I found there are several programs for HLA typing, so no need work manually these days, but I prefer to work manually on one sample for better learning; please kindly share me any well-documented guide/tutorial if you know. The tool suggested by Wouter is another HLA typing program; as I told you, at the moment I have to focus just on the HLA alleles visualization.
Have you seen tools like https://github.com/DiltheyLab/HLA-LA ?