Question

Samtools To Analyze Bwa Output

0

Entering edit mode

12.8 years ago

Dawnoflife ▴ 10

I have a query file with 70,000 lines of sequences and when I do the bwasw each sam output file is about 35 mb and obviously contains all the sequence lines. I need to perform the alignment on about 1000 files, so you can imagine that I don't want to have 35 gigs worth of output to sort. (using cygwin on Windows 7 btw)

Exact bowtie command used:

./bwa bwasw INDEXES/AC_000091.fna QUERY/635plus104_500bp_reads.fasta > results.sam

My input is the index created with

./bwa index AC_000091.fna "AC_000091.fna";

My query file is a FASTA file. The sequences are 500bp, i am only pasting some of them.

> 1
CATGACTGATTCACGCCGTTCGGGGTTATTAAACCAAACTCGCCCTGCAGGTGTGGCAACATATCCA
> 2
ACGGCCGCAGCCATAATGGTCCAATCAGCTTGAGGATATGCGGCTAACATTGCAGCACGCATTTCTG
> 3
GATGCGGGTAACGGCATCCAAAAATCAAACCGGGCTGAATAGCTTCGTAAGCCATATGCTGAGCAGT
> 4
CGTACCATCGGCATTCAATTGGGCACCGTACAGCATGCCTTTCTCCAGGAGTTTGGAGTTGGCTTTG
> 5
ATCTCCGGTGGGTTCAACCCCTACTTTTGCGAGTAAGGCGTCCTGGCCATGATCGGCAAGGGCCATA
> 6
GTAGAAAAGTCTGCTCCAGAGACAGTCCGTGAGCAAGATGCATCAATTGAAGCCTCGGATGAGGCCGA
> 7
CAGTCGGTGGGTGGCTGAGAGCTTAAGGATGGCTTTGGTTTGGGCGGCTTTGGGATTTTTGATATTTT
> 8
GCCCAAAGTGACTGAACTAAAGACAATCGGAAACAGGTTAGCGGCTAACACAAACCCCAGTATGGACA
> 9
TCGGGATTGGGCATCAAAAGGGATTTCAACTGTGCCTGCTAATCCCACCAAATCATCATGGCTAATTA
> 10
AGCGCTATCCTCCTTATTCATACAATAGTGATTGTTCTGTTGTCTCCACTGCTGCTCATATTCACAGG
> 11
TAGGGAACATGCCCCCCAAGCGATTTCAAAATGGGGGATTCAGGCGATTATTGGCGAGAGTTTTGCCG

I am trying to get rid of the non-aligned reads from the SAM output file ans I searched through forums and tried the following code with no avail.

./samtools view -bS -f 4 testing.sam > output.bam

All my sam files give the following error:

`[samopen] SAM header is present: 1 sequences.
[sam_read1] reference '0' is recognized as '*'.
Parse warning at line 3: mapped sequence without CIGAR
Parse error at line 3: missing colon in auxiliary data
Aborted (core dumped)
`

Here's what a part of my sam files looks like

@SQ    SN:gi|89106884|ref|AC_000091.1|    LN:4646332

    4    *    0    0    *    *    0    0    <CGTAAAAAGGA>     *
    4    *    0    0    *    *    0    0    <TAGCCGTAGGG>   *

I'd appreciate help with this!

samtools bwa output • 8.6k views

ADD COMMENT • link updated 8.4 years ago by Biostar 20 • written 12.8 years ago by Dawnoflife ▴ 10

0

Entering edit mode

That sam output looks odd. How exactly did you generate it?

ADD REPLY • link 12.8 years ago by Andreas ★ 2.5k

0

Entering edit mode

I agree the output looks odd, there are no sequence names f.e. Could there be a problem with the input file, lacking proper names etc.? Could you please post one (upload to public ftp/web) of your input files (or at least the first 100 sequence files) + the exact command line of bwa, I assume it's fastq input (if it was fasta you wouldn't have the non matching output with bwasw)?

ADD REPLY • link 12.8 years ago by Michael 55k

0

Entering edit mode

Added the code i used for bwa and my query file! Thank you for the help!

ADD REPLY • link 12.8 years ago by Dawnoflife ▴ 10

0

Entering edit mode

I will try that out tomorrow. Your reference sequence is AC_000091.1 in RefSeq (E. coli K12 substr. W3110)? It is marked obsolete and removed in NCBI nucl., not that it matters here, but just wanted to note.

ADD REPLY • link 12.8 years ago by Michael 55k

0

Entering edit mode

Ha yes I noticed that too. My data is just random, i want to check the efficiency of bwa and bowtie etc.

ADD REPLY • link 12.8 years ago by Dawnoflife ▴ 10

score 3 · Answer 1 · 2012-03-12

3

Entering edit mode

12.8 years ago

Michael 55k

I have just tried it and found the reason for the error. First, with your input you are providing I see exactly the same error.

it's the id line in your query file:

> 1
ATGC...
> 2
ATGCG

That doesn't seem to work with a space between the > and the identifier, so simply remove the space.

Try running your input file through the following awk command (I hope cygwin includes awk):

 awk '{sub(/> /,">");print}' query.fna > query2.fna

Then running the following commands on query2.fna I get:

Confucius:test mdondrup$ bwa bwasw AC_000091.fna query2.fna > results.sam
[bsw2_aln] read 11 sequences/pairs (743 bp)...
[main] Version: 0.6.1-r104
[main] CMD: bwa bwasw AC_000091.fna query2.fna
[main] Real time: 0.010 sec; CPU: 0.012 sec
Confucius:test mdondrup$ samtools view -bS -f 4 results.sam > output.bam
[samopen] SAM header is present: 1 sequences.

ADD COMMENT • link 12.8 years ago by Michael 55k

0

Entering edit mode

Will you delete my post as it has been wrongly accepted as the answer? Cheers.

ADD REPLY • link 12.8 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Ah nope, I think you could do that yourself if you wanted, but just leave it, it documents try-and-error strategy in debugging. Further, it's up to the OP to decide on the accepted answer, even if contradicting evidence, that's how the system plays out.

ADD REPLY • link 12.8 years ago by Michael 55k

0

Entering edit mode

This works really well! Thank you!!! Is there a way I can save just the aligned sequences in the output file? I can't open bam with notepad.

ADD REPLY • link 12.8 years ago by Dawnoflife ▴ 10