BWA-MEM using long PacBio reads
2
0
Entering edit mode
7.8 years ago
Rox ★ 1.4k

Hello again everyone,

I'm struggling with the outputs of the tool BWA-MEM on my assembly alignement.

I have two input files : - pb_268_p_ctg.fasta : an assembly of my genome, that I want to be polish, that's why I need an alignement - filtered-reads.fastq : my raw reads from PacBio, a almost 90 Gb file

I tried to align my raw reads on my assembly using BWA-MEM in the following way :

./bwa index /path/pb_268_p_ctg.fasta
./bwa mem -x pacbio -t 20 /path/pb_268_p_ctg.fasta /path/filtered_subreads.fastq > /path/falcon_aligned_reads.sam

But the results are very strange. After indexing the sam file, IGV show a very strange message :

File: /media/loutre/SUZUKII/assembly_tries/falcon_aligned_reads.sam does not contain any sequence names which match the current genome.  File:      null, Genome: 000000F, 000001F, 000002F, 000003F, ...

The sequences name seems to be the same within the assembly file and the sam header file...

Did I missed something ?

Cheers,

Roxane

PacBio alignement bwa-mem sam • 12k views
ADD COMMENT
0
Entering edit mode

Can you post a few sequence headers from pb_268_p_ctg.fasta? Are there spaces in fasta headers?

ADD REPLY
0
Entering edit mode

Ah ! And here some sequence headers from pb_268_p_ctg.fasta :

Loutre:~$ head -1 '/media/loutre/SUZUKII/assembly_tries/pb_268_p_ctg.fasta' 
>000000F 002170011:B~002650282:E~001494317:E~001208944:E ctg_linear 20318800 109944874
ADD REPLY
0
Entering edit mode

Here is an example of the lines contained within the sam file pro

Loutre:~$ head '/media/loutre/SUZUKII/assembly_tries/falcon_aligned_reads.sam' 
@SQ SN:000000F  LN:20279970
@SQ SN:000001F  LN:15392955
@SQ SN:000002F  LN:12197397
@SQ SN:000003F  LN:8269814
@SQ SN:000004F  LN:7009319
@SQ SN:000005F  LN:6740481
@SQ SN:000006F  LN:5550981
@SQ SN:000007F  LN:4663438
@SQ SN:000008F  LN:4675199
@SQ SN:000009F  LN:4040179
ADD REPLY
0
Entering edit mode

Which genome do you load for igv? Are the chromosome identifiers in that genome the same as those you used for the alignment using bwa mem?

ADD REPLY
0
Entering edit mode

Within igv, I load pb_268_p_ctg.fasta, so yes the identifiers are supposed to be the same, but I think that something went wrong with identifiers

ADD REPLY
0
Entering edit mode

You should either truncate the fasta headers after the first space (as long as they are unique) or convert the spaces to an "_" (I would go with the former). That should fix the IGV issues. You may need to realign the data.

ADD REPLY
0
Entering edit mode

Okay, so within the assembly file I change spaces with "_" or troncate them, and then I realign using, right ? Let's try this, I'll tell you. Thanks for your quick answers!

ADD REPLY
0
Entering edit mode

Go for the truncation (as long as ID's are unique). Those "chromosome" names would get ugly if you change them with an "_".

ADD REPLY
0
Entering edit mode

Sorry for the late reply, it was quite long to re run. Well, sorry as I wasn't totally sure that the ID's were uniq, I opt for the ugly version with the "_". But in fact, it didn't solved my issue, the problem remains the same.

head -2 '/media/loutre/SUZUKII/assembly_tries/falcon_aligned_reads.sam' 
@SQ SN:000000F_002170011:B~002650282:E~001494317:E~001208944:E_ctg_linear_20318800_109944874    LN:20279970
@SQ SN:000001F_000757637:B~002089426:B~001707116:B~002616577:E_ctg_linear_15392828_83151272 LN:15392955

head -1 '/media/loutre/SUZUKII/assembly_tries/pb_268_p_ctg_s.fasta' 
>000000F_002170011:B~002650282:E~001494317:E~001208944:E_ctg_linear_20318800_109944874

The message from IGV remains the same

ADD REPLY
0
Entering edit mode

I actually suspect my raw reads file... Because when I try to type head -1 my_raw_read_file.fastq, everything start is shown on my screen...

An example of strange line :

,&**..+-/.+*+".%*"."-&-,+*-..$/,%+&-'#,//,-)%%%+.-++,*.,*+'.--./,-",*/%/+#..$/&/-/.*+-/+-&,.,-+--/+*(,.+,++)+-+#*'%.,&#-'-#-+*.%".&-.*%$*.,*(/,,-.,/*#-&-+%"'+*.,-'.+./"-.-+-)./**/.-)./*,//--//#-//$&$+%./+#+,.)*"$)..,"/-.,//)+,+&),,*-../)'('.&*.$.+"..+,&'(.',&#.,//.-',)/+,/,-%(./)-/(-$.%,..(*-$%*%%)+"#,.,--++*,#.)%+,./*-(+,"-/,,,&.,+.".+)*$"&,)+-(',#.&.(-/.+')/.+"/&.*(+$//)-,---/.-)+.,--,.++$./,(..-,/'.*/-."-.."//.,../.')/,%/''-//.".../-)/-//,.&%.,".-&*-,&)($+,*--+%.+.-.)./.(/,*/-'.'&..(&,),//.*.&-)/./,)&@m160112_141523_42237_c100966892550000001823200405121626_s1_p0/398/13732_27193TTTGACATAGGCTATCGGAAGGCTTCTGGGTAATATATCTTAGCACAGAACTCCTAACCTAATTCGATAGTGCTGCGTAATTGAGTAGATTGTCATATATCTGTTATGTGATCTTTCGCCTCCGTGGCACTCAAGCTTTGTGCGCCTATGAGGCATGCATTTTTTGCATGGTTTAATAGCCTTTAACCGCGCATATAAGGCTCCAAACGATTAAACGTCTTCAGGGATCCCACTTAAGATGTTAGGAATACAACTGAACCTGATTCTGAAATCAACTTATATTTTTCGCAGCGATAAAGGTATTTCTTTTATATAATACCATTTTTAATGTAAGTTAAGGAATTAAGTCGATTTTTATGATACCATTTAATATTGGTAAAAATGCGATTTGACAAAAAAAGCAACAGAAATCTGTTTGCTTAACGAAATAGTTATTAAAAAACAAATGTCAAA

And so, everything is on one line... Myabe the problem is here. Because I was first tickled when bwa mem said that he read only 1 read...

I'm going to try to fix this file now.

ADD REPLY
0
Entering edit mode

Actually, I have absolutely no idea of how to seperate the different lines...

Should I use awk ? I like also perl scripts even if it's a bit old school...

ADD REPLY
0
Entering edit mode

How did that happen? If the file was corrupted somewhere in the process you would be working with tarnished data.

Does the true "original" also look like this?

ADD REPLY
0
Entering edit mode

Sadly, this is the original file...

A collaborator from Munich gave it to me, I downloaded it from a server, and I never modified it.

I even used this file to perform an assembly using canu, and Canu worked fine with it... This is very strange.

Should I rerun hdf5 tools on it ? Or should I manually clean the file ?

ADD REPLY
1
Entering edit mode

What do you feel about this? What any of us think may be less important since you would be the one presenting/defending this data/analysis, which seems to have some flaw in the raw data.

ADD REPLY
0
Entering edit mode

Well, I'm still a student, that's why I'm not always sure about my opinions, and ask a lot of questions. But you guys were right, I don't know how and when, but it seems that the file was corrupted. The last modification date is the 10 January... I didn't even used this file at this moment. I really don't know how it could happen. Thankfully, I can download it again from a distant server. It will take time, but maybe one day I will be able to successfully polish my data ! Thanks again for all the answers and the support !

ADD REPLY
0
Entering edit mode

I wouldn't expect that Canu would work with a corrupted file like that and assume that the corruption happened later...

ADD REPLY
0
Entering edit mode
7.8 years ago

BWA is OK to use with CCS reads (preferably with at least 5 or 10 cycles), but that .fastq file seems quite large (and, unfortunately, unless you get a more raw data format, you can't use PacBio functions to define CCS reads).

You could also use BLASR with a .fasta file, but the recommendation is to use an unaligned .bam file for aligning subreads with BLASR (which includes additional information about each read). Also, some downstream tools designed for Illumina data might have issues with some differences in the PacBio .bam file format used by BLASR.

There could also be issues with the chromosome name formatting, but it looks like you've got a lot of good comments addressing that issue.

ADD COMMENT
0
Entering edit mode

Hi Charles ! Thanks for your answer. Indeed, I already know Blasr, but I have a lot of trouble with it, the last version is not working anymore, because blasr do not support --bam or --sam option anymore.. That's why I was looking for an alternative ! Bwa mem seems to be the solution

ADD REPLY
0
Entering edit mode

Last time I checked, the acceptable output formats depended upon the input format - for example, I remember the change to not allow direct .sam file creation if the input is a .bam file. However, I have a version of BLASR installed in this Docker image that will at least work with the following command:

blasr [unaligned.bam] [ref.fa] --bam --bestn 1 --nproc [2] --out [aligned.bam]
ADD REPLY
0
Entering edit mode

Well, me and some other people are encountering the same problem with blasr. Just look at this issue in github : https://github.com/PacificBiosciences/blasr/issues/324 It indeed seems that it accept only "valid" (whatever it means...) input files. Even if my files come directly from PacBio tools, I'm still struggling with blasr... Well, I think I'll just do it with bwa-mem then ! At least, it works

ADD REPLY
0
Entering edit mode
7.8 years ago

BBMap works well with raw PacBio data, depending on your needs. It does break reads longer than 6kbp into 6kbp segments, but it's capable of handling raw PacBio's high error rates with extremely high accuracy.

Usage:

mapPacBio.sh in=reads.fq out=mapped.sam ref=ref.fa maxlen=6000
ADD COMMENT
1
Entering edit mode

BBMap was covered in an earlier thread that Roxane Boyer had started : C: Alternative to BLASR ?

ADD REPLY
0
Entering edit mode

Ah! Thanks.

Roxane Boyer, did you encounter a problem with it?

ADD REPLY
0
Entering edit mode

I think she thought only first 6kb would be mapped and decided to do bwa mem first.

ADD REPLY
0
Entering edit mode

Ah, that's indeed what I first thought. Maybe I'll try it later to compare my results with BWA-MEM !

ADD REPLY
0
Entering edit mode

Hello all,

I know that this is an old thread but I have a large amount of raw PacBio reads that I wish to map to an existing mitochondrial sequence (only 90kbp) to extract unmapped reads to use in assembly.

Can I use the above script with bbmap? I can't find this usage in the manual. It seems to take only one input (reads.fq)?

many thanks for your help.

ADD REPLY
0
Entering edit mode

You should be able to. Specify outu= to capture reads that do not map to that file.

ADD REPLY
0
Entering edit mode

Thanks, I saw this in the manual. What I am unclear about is the inputs. I will try this to see how it runs:

bbmap.sh ref=AP_MT.fa
bbmap/mapPacBio.sh in=/project/Assembly/*.fasta outu=unmapped.fasta outm=mapped.fasta ref=AP_MT.fa maxlen=6000
ADD REPLY
0
Entering edit mode

The above script does not work. For mutiple fasta inputs you need to list them all. *.fasta is not recognised. The following seems to work.

bbmap.sh ref=AP_MT.fa
mapPacBio.sh in=reads1.fasta reads2.fasta reads3.fasta outu=unmapped.fasta outm=mapped.fasta maxlen=6000
ADD REPLY
0
Entering edit mode

another way to list all the fasta files is:

mapPacBio.sh in=`echo ./foo/*.fasta` outu=unmapped.fasta outm=mapped.fasta maxlen=6000
ADD REPLY
0
Entering edit mode

Just wanted to add that bbmap only seems able to take one in=file. My above trials at listing files does not work. Perhaps you could concatenate but my fasta files are huge so not going to risk. Will run them individually and output multiple unmapped and mapped .fasta files.

ADD REPLY
0
Entering edit mode

Good to know that mapPacBio takes in fasta files. Do you not have fastq format files for this data? It would be useful to have the Q-scores when BBMap does the alignment.

ADD REPLY
0
Entering edit mode

Thanks very much for your comment, genomax. I initially used dextract to output fasta, quiver and arrow files from bam and hdf5. https://dazzlerblog.wordpress.com/tag/arrow/ Quiver files are fastq-like but not exactly fastq format - though a simple conversion is probably possible. I intend to polish the assembly using arrow files once complete so was not too concerned about using fasta inputs at this stage. Perhaps I should though?

ADD REPLY
0
Entering edit mode

You could use bam2fastx utility from PacBio to get the fastq format files.

ADD REPLY
0
Entering edit mode

thanks, appreciated!

ADD REPLY

Login before adding your answer.

Traffic: 2015 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6