Failure in alignment with BWA using fusion reference genome
1
0
Entering edit mode
6.1 years ago

Dear all,

I have used the fasta genome provided by NCBI as reported in this post . The headers of this file are:

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
>chr2  AC:CM000664.2  gi:568336022  LN:242193529  rl:Chromosome  M5:f98db672eb0993dcfdabafe2a882905c  AS:GRCh38
>chr3  AC:CM000665.2  gi:568336021  LN:198295559  rl:Chromosome  M5:76635a41ea913a405ded820447d067b0  AS:GRCh38
[...]
>chrUn_GL000218v1  AC:GL000218.1  gi:224183305  LN:161147  rl:unplaced  M5:1d708b54644c26c7e01c2dad5426d38c  AS:GRCh38
>chrEBV  AC:AJ507799.2  gi:86261677  LN:171823  rl:decoy  M5:6743bd63b3ff2b5b8985d8933c53290a  SP:Human_herpesvirus_4  tp:circular

I need to use a fusion genome built by concatenating this human genome with one obtained from selected virus sequences. This virus genome is formed by a single header and a long stretch of nucleotides derived from individual virus sequences.

Following the indications given in this other post I prepared the header for the virus genome as follows:

>chrV  AC:XXXXXXXX.1  gi:00000000  LN:370064105  rl:Chromosome   M5:5aa5be7025d7baa666a8651e0909e4ce  AS:1  SP:All_viruses  tp:linear

I made up accession number AC to XXXXXXXX.1 because there is no real entry for my made-up genome in Genbank & NCBI; since the IDs given in the human genome are 8 digit long, I gave a 8 letters fake entry and a ".1" because this is first time I am using this genome (maybe I should have used two letter, 6 numbers?).

Same for the GI number: the made up genome is not recorded in GenBank, thus I simply gave a fake 8 digit number.

LN is the length of the genome, I treated it as a real chromosome and M5 derives for md5sum I made on the fasta file. AS and SP are free text fields (I assumed) and the genome is linear.

By the way, I separated the fields with two spaces.

I concatenated the human genome and the made up virus genome with `cat <human.fa> <virus.fa> > <fusion.fa> and I prepared the indices for this genome and aligned the samples with

bwa index -a bwtsw <fusion.fa>
bwa mem -t 8 -R <read_group> <fusion.fa> <R1.fq.gz> <R2.fq.gz> | \
samtools sort -o <file_ALN-SRT.sam>

However, I got this error message:

[bns_restore_core] Parse error reading <fusion.fa>.amb

and the SAM file is virtually empty:

cat <file_ALN-SRT.sam>  
@HD VN:1.3  SO:coordinate

May I ask what I got wrong? When aligning against either one or the other genome separately the alignment is OK, thus it must be a problem with the headers I guess.

Thank you

alignment software error • 2.7k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

That would be weird because I could align to the individual genomes individually. How could spaces be introduced by simply concatenating two files? And that post indicates that "spaces in the header are OK". Anyway, I'll try this, thank you.

ADD REPLY
0
Entering edit mode

Nope, it did not. I tried with both sed -i 's/\s*$//g' (for spaces in the sequence) and sed -i 's/^[^>]\s*$//g' (for spaces in the header) followed by bwa index but the result was always the sae.

ADD REPLY
0
Entering edit mode

I also tried with printf "field\t...field\n" but I got the same error.

ADD REPLY
1
Entering edit mode

Let's check if the line ending terminators are correct.

$ file <fusion.fa>
<fusion.fa>: ASCII text

If the output is <fusion.fa>: ASCII text, with CRLF line terminators you should install dos2unix and run dos2unix <fusion.fa>.

fin swimmer

ADD REPLY
0
Entering edit mode

That was a good tip that I shall remember, but there no Windows here:

$ file GRCh38.fa 
GRCh38.fa: ASCII text
$ file virChr10510.fa 
virChr10510.fa: ASCII text
$ file fusion38-10k.fa
fusion38-10k.fa: ASCII text
ADD REPLY
0
Entering edit mode

Could you please try to reindex fusion38-10k.fa. But this time without any parameters.

For -a bwtsw` you used I see a warning in the manual about it:

Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
         `-a div' do not work not for long genomes.

Without setting the algorithm bwa will auto select the best one.

fin swimmer

ADD REPLY
0
Entering edit mode

exactly the same error...

ADD REPLY
0
Entering edit mode
  • Can you index and align against <human.fa> only?
  • Can you index and align against <virus.fa> only?
  • What's the output of grep "chrV" -A 5 -B 5 <fusion.fa>?
ADD REPLY
0
Entering edit mode
  • yes
  • yes
  • $ grep "chrV" -A 5 -B 5 fusion38-10k.fa

GGGGGCGGCGCGGGAGCCTGCACGCCGTTGGAGGGTAGAATGACAGGGGGCGGGGACAGAGAGGCGGTCG
CGCCCCCGGCCGCGCCAGCCAAGCCCCCAAGGGGGGCGGGGAGCGGGCAATGGAGCGTGACGAAGGGCCC
CAGGGCTGACCCCGGCAAACGTGACCCGGGGCTCCGGGGTGACCCAGCCAAGCGTGACCAAGGGGCCCGT
GGGTGACACAGGCAACCCTGACAAAGGCCCCCCAGGAAAGACCCCCGGGGGGCATCGGGGGGGGTGTTGG
CGGGGGCATGGGGGGGTCGGATTTCGCCCTTATTGCCCTGTTT
>chrV  AC:KI270741.1  gi:86261678  LN:370064105  rl:Chromosome   M5:5aa5be7025d7baa666a8651e0909e4ce  AS:GRCh38  SP:All_viruses
AACATGGATTTTGATAAAATTTATAAAGATTTAGTAATGGATATTATTAAGAACGGAGTAACAGAATTACCAGAAGGGACTCGTGCAGTATATGCTGATG
GGACTCCTGCAACAACTAAGTTTATTGAAGGTGTAAACTTTACAATCACACCAGATATGGGGGCTCCTTTATTACGTAGTAAGCATGTAGGTATTAAATG
GGCGCTAACGGAATTGCAGTGGATTTGGCAAGAAATGTCTAATGAGGTTTCGTGGTTAAAAGAACGAGGTGTAACTATTTGGGACGAATGGGAGCAAGAA
GATGGCACTATTGGAAAAGCGTATGGTTACGCATTATTTAACAAAGAACGTTTTATTCCTGTTTATAGTGGAGATGTTGAGTTAGCTAAAAATCGCAAAG
CAGGTTCACTTATTCCAAAAGTCGTACCTAAAAAAGGTGGTTGGATAGCAGGAACACATCAACCATACCTAAAGCTAAATCAGGTAGAGGCAGTTATTGA
ADD REPLY
0
Entering edit mode

Hm, so the problem arises only in the concatenated file. The position in <fusion.fa> where the file were concatenated looks fine to me.

Which version of bwaare you using? Have you done anything else after cat <human.fa> <virus.fa> > <fusion.fa> other than bwa index <fusion.fa>?

fin swimmer

ADD REPLY
0
Entering edit mode

BWA is Version: 0.7.17-r1188. I did nothing else. You might notice that I changed the fields of chrV to match those of the human chromosome, but no improvements.

Although unlikely, could the word length of the sequences (excluding the headers) affect the alignment? The human genomes has 80 characters per line and the virus has 100.

ADD REPLY
4
Entering edit mode
6.0 years ago

You might notice that I changed the fields of chrV to match those of the human chromosome, but no improvements.

Everything after the first whitespace is just a description and have no influence on anything. bwa, samtools and other just ignore that part.

Although unlikely, could the word length of the sequences (excluding the headers) affect the alignment? The human genomes has 80 characters per line and the virus has 100.

I also thought about that, even if this would be weird. But at least in my testing here, this have no influence.

According to your answers here the problem arises only in the concatenated file. Let's check if there are any non valid characters within the sequences. Does this little awk script produce any output:

$ awk '/[^a-zA-Z]/ && $0 !~"^>" {print "invalid char at line ", NR}' <fusion.fa>

fin swimmer

ADD COMMENT
0
Entering edit mode

Ah, that is very clever! Yes it did find some troubles:

$ awk '/[^a-zA-Z]/ && $0 !~"^>" {print "invalid char at line ", NR}' fusion38-10k.fa
invalid char at line  44558934
invalid char at line  46471885
$ sed '44558934q;d' fusion38-10k.fa
TGCTTAGATGTAAGAGATAAACATTTAAAAGTGGAGTGAGCTAACCTCAATTGCAAGAGTCAAGCCCTGGAGCGCCAACTCCCAGAAGGGGGCGAACCAA
                    TTGAGCACTCTCACACACACCAAAAAGATTTTCCCTTTTCTTTGTACAGGGTTGGTTTCACCTCAACACCATAAAAGATCCGGAAACGGTGAAGGGCTTG
$ sed '46471885q;d' fusion38-10k.fa
GTGTTCAAAATC@AACGCTAAGTGTCGCCCCGAACGACGTAGGGAAGCTAAAGAAAATAAGACACCTGGGTATCTTCTTGAACTATGTCGTAAACGAATG

looks like there is a newline or a gap in the former and the "@" character I don't know how it came out in the first place. Both are present in the file I prepared:

$ grep -n "^>" fusion38-10k.fa
1:>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
...
44282437:>chrEBV  AC:AJ507799.2  gi:86261677  LN:171823  rl:decoy  M5:6743bd63b3ff2b5b8985d8933c53290a  SP:Human_herpesvirus_4  tp:circular
44284893:>chrV  AC:KI270741.1  gi:86261678  LN:370064105  rl:Chromosome   M5:5aa5be7025d7baa666a8651e0909e4ce  AS:GRCh38  SP:All_viruses

I amended the "@" with:

$ sed 's/@//' fusion38-10k.fa > fusion38-10k2.fa
$ grep @ fusion38-10k2.fa

I tried to solve the former problem with:

sed '44558934s/[[:space:]]*/\n/' fusion38-10k.fa > fusion38-10k2.fa

but the illegal character was not a space/tab; by using a text editor I found that it was some kind of unformatted value. So I removed it manually from the text editor. Checked the quality of the file with the awk command, re-cat it and re-index it. I am testing the alignment now, I will update on the outcome. Thanks!

ADD REPLY
0
Entering edit mode

The alignment has started! So the problem was indeed about those faulty characters. Case closed. Thank you so much again.

ADD REPLY
1
Entering edit mode

Fine if we got it. I moved my comment to an answer, so you can mark this thread as solved.

I would suggest you investigate a bit more, what goes wrong on your system when using cat. Not that your system is broken in someway and you get more problems because of this unexpected behaviour.

fin swimmer

ADD REPLY
0
Entering edit mode

I tried to investigate but none of the steps introduced funny characters. It will remain a mistery...

ADD REPLY

Login before adding your answer.

Traffic: 2121 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6