Dear all,
I have used the fasta genome provided by NCBI as reported in this post . The headers of this file are:
>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
>chr2 AC:CM000664.2 gi:568336022 LN:242193529 rl:Chromosome M5:f98db672eb0993dcfdabafe2a882905c AS:GRCh38
>chr3 AC:CM000665.2 gi:568336021 LN:198295559 rl:Chromosome M5:76635a41ea913a405ded820447d067b0 AS:GRCh38
[...]
>chrUn_GL000218v1 AC:GL000218.1 gi:224183305 LN:161147 rl:unplaced M5:1d708b54644c26c7e01c2dad5426d38c AS:GRCh38
>chrEBV AC:AJ507799.2 gi:86261677 LN:171823 rl:decoy M5:6743bd63b3ff2b5b8985d8933c53290a SP:Human_herpesvirus_4 tp:circular
I need to use a fusion genome built by concatenating this human genome with one obtained from selected virus sequences. This virus genome is formed by a single header and a long stretch of nucleotides derived from individual virus sequences.
Following the indications given in this other post I prepared the header for the virus genome as follows:
>chrV AC:XXXXXXXX.1 gi:00000000 LN:370064105 rl:Chromosome M5:5aa5be7025d7baa666a8651e0909e4ce AS:1 SP:All_viruses tp:linear
I made up accession number AC to XXXXXXXX.1 because there is no real entry for my made-up genome in Genbank & NCBI; since the IDs given in the human genome are 8 digit long, I gave a 8 letters fake entry and a ".1" because this is first time I am using this genome (maybe I should have used two letter, 6 numbers?).
Same for the GI number: the made up genome is not recorded in GenBank, thus I simply gave a fake 8 digit number.
LN is the length of the genome, I treated it as a real chromosome and M5 derives for md5sum I made on the fasta file. AS and SP are free text fields (I assumed) and the genome is linear.
By the way, I separated the fields with two spaces.
I concatenated the human genome and the made up virus genome with `cat <human.fa> <virus.fa> > <fusion.fa> and I prepared the indices for this genome and aligned the samples with
bwa index -a bwtsw <fusion.fa>
bwa mem -t 8 -R <read_group> <fusion.fa> <R1.fq.gz> <R2.fq.gz> | \
samtools sort -o <file_ALN-SRT.sam>
However, I got this error message:
[bns_restore_core] Parse error reading <fusion.fa>.amb
and the SAM file is virtually empty:
cat <file_ALN-SRT.sam>
@HD VN:1.3 SO:coordinate
May I ask what I got wrong? When aligning against either one or the other genome separately the alignment is OK, thus it must be a problem with the headers I guess.
Thank you
Does this help?
That would be weird because I could align to the individual genomes individually. How could spaces be introduced by simply concatenating two files? And that post indicates that "spaces in the header are OK". Anyway, I'll try this, thank you.
Nope, it did not. I tried with both
sed -i 's/\s*$//g'
(for spaces in the sequence) andsed -i 's/^[^>]\s*$//g'
(for spaces in the header) followed bybwa index
but the result was always the sae.I also tried with printf "field\t...field\n" but I got the same error.
Let's check if the line ending terminators are correct.
If the output is
<fusion.fa>: ASCII text, with CRLF line terminators
you should installdos2unix
and rundos2unix <fusion.fa>
.fin swimmer
That was a good tip that I shall remember, but there no Windows here:
Could you please try to reindex
fusion38-10k.fa
. But this time without any parameters.For -a bwtsw` you used I see a warning in the manual about it:
Without setting the algorithm bwa will auto select the best one.
fin swimmer
exactly the same error...
<human.fa>
only?<virus.fa>
only?grep "chrV" -A 5 -B 5 <fusion.fa>
?Hm, so the problem arises only in the concatenated file. The position in <fusion.fa> where the file were concatenated looks fine to me.
Which version of
bwa
are you using? Have you done anything else aftercat <human.fa> <virus.fa> > <fusion.fa>
other thanbwa index <fusion.fa>
?fin swimmer
BWA is Version:
0.7.17-r1188
. I did nothing else. You might notice that I changed the fields of chrV to match those of the human chromosome, but no improvements.Although unlikely, could the word length of the sequences (excluding the headers) affect the alignment? The human genomes has 80 characters per line and the virus has 100.