Hello again everyone,
I'm struggling with the outputs of the tool BWA-MEM on my assembly alignement.
I have two input files : - pb_268_p_ctg.fasta : an assembly of my genome, that I want to be polish, that's why I need an alignement - filtered-reads.fastq : my raw reads from PacBio, a almost 90 Gb file
I tried to align my raw reads on my assembly using BWA-MEM in the following way :
./bwa index /path/pb_268_p_ctg.fasta
./bwa mem -x pacbio -t 20 /path/pb_268_p_ctg.fasta /path/filtered_subreads.fastq > /path/falcon_aligned_reads.sam
But the results are very strange. After indexing the sam file, IGV show a very strange message :
File: /media/loutre/SUZUKII/assembly_tries/falcon_aligned_reads.sam does not contain any sequence names which match the current genome. File: null, Genome: 000000F, 000001F, 000002F, 000003F, ...
The sequences name seems to be the same within the assembly file and the sam header file...
Did I missed something ?
Cheers,
Roxane
Can you post a few sequence headers from
pb_268_p_ctg.fasta
? Are there spaces in fasta headers?Ah ! And here some sequence headers from pb_268_p_ctg.fasta :
Here is an example of the lines contained within the sam file pro
Which genome do you load for igv? Are the chromosome identifiers in that genome the same as those you used for the alignment using bwa mem?
Within igv, I load pb_268_p_ctg.fasta, so yes the identifiers are supposed to be the same, but I think that something went wrong with identifiers
You should either truncate the fasta headers after the first space (as long as they are unique) or convert the spaces to an "_" (I would go with the former). That should fix the IGV issues. You may need to realign the data.
Okay, so within the assembly file I change spaces with "_" or troncate them, and then I realign using, right ? Let's try this, I'll tell you. Thanks for your quick answers!
Go for the truncation (as long as ID's are unique). Those "chromosome" names would get ugly if you change them with an "_".
Sorry for the late reply, it was quite long to re run. Well, sorry as I wasn't totally sure that the ID's were uniq, I opt for the ugly version with the "_". But in fact, it didn't solved my issue, the problem remains the same.
The message from IGV remains the same
I actually suspect my raw reads file... Because when I try to type head -1 my_raw_read_file.fastq, everything start is shown on my screen...
An example of strange line :
And so, everything is on one line... Myabe the problem is here. Because I was first tickled when bwa mem said that he read only 1 read...
I'm going to try to fix this file now.
Actually, I have absolutely no idea of how to seperate the different lines...
Should I use awk ? I like also perl scripts even if it's a bit old school...
How did that happen? If the file was corrupted somewhere in the process you would be working with tarnished data.
Does the true "original" also look like this?
Sadly, this is the original file...
A collaborator from Munich gave it to me, I downloaded it from a server, and I never modified it.
I even used this file to perform an assembly using canu, and Canu worked fine with it... This is very strange.
Should I rerun hdf5 tools on it ? Or should I manually clean the file ?
What do you feel about this? What any of us think may be less important since you would be the one presenting/defending this data/analysis, which seems to have some flaw in the raw data.
Well, I'm still a student, that's why I'm not always sure about my opinions, and ask a lot of questions. But you guys were right, I don't know how and when, but it seems that the file was corrupted. The last modification date is the 10 January... I didn't even used this file at this moment. I really don't know how it could happen. Thankfully, I can download it again from a distant server. It will take time, but maybe one day I will be able to successfully polish my data ! Thanks again for all the answers and the support !
I wouldn't expect that Canu would work with a corrupted file like that and assume that the corruption happened later...