I have a query file with 70,000 lines of sequences and when I do the bwasw
each sam output file is about 35 mb and obviously contains all the sequence lines. I need to perform the alignment on about 1000 files, so you can imagine that I don't want to have 35 gigs worth of output to sort. (using cygwin on Windows 7 btw)
Exact bowtie command used:
./bwa bwasw INDEXES/AC_000091.fna QUERY/635plus104_500bp_reads.fasta > results.sam
My input is the index created with
./bwa index AC_000091.fna "AC_000091.fna";
My query file is a FASTA file. The sequences are 500bp, i am only pasting some of them.
> 1
CATGACTGATTCACGCCGTTCGGGGTTATTAAACCAAACTCGCCCTGCAGGTGTGGCAACATATCCA
> 2
ACGGCCGCAGCCATAATGGTCCAATCAGCTTGAGGATATGCGGCTAACATTGCAGCACGCATTTCTG
> 3
GATGCGGGTAACGGCATCCAAAAATCAAACCGGGCTGAATAGCTTCGTAAGCCATATGCTGAGCAGT
> 4
CGTACCATCGGCATTCAATTGGGCACCGTACAGCATGCCTTTCTCCAGGAGTTTGGAGTTGGCTTTG
> 5
ATCTCCGGTGGGTTCAACCCCTACTTTTGCGAGTAAGGCGTCCTGGCCATGATCGGCAAGGGCCATA
> 6
GTAGAAAAGTCTGCTCCAGAGACAGTCCGTGAGCAAGATGCATCAATTGAAGCCTCGGATGAGGCCGA
> 7
CAGTCGGTGGGTGGCTGAGAGCTTAAGGATGGCTTTGGTTTGGGCGGCTTTGGGATTTTTGATATTTT
> 8
GCCCAAAGTGACTGAACTAAAGACAATCGGAAACAGGTTAGCGGCTAACACAAACCCCAGTATGGACA
> 9
TCGGGATTGGGCATCAAAAGGGATTTCAACTGTGCCTGCTAATCCCACCAAATCATCATGGCTAATTA
> 10
AGCGCTATCCTCCTTATTCATACAATAGTGATTGTTCTGTTGTCTCCACTGCTGCTCATATTCACAGG
> 11
TAGGGAACATGCCCCCCAAGCGATTTCAAAATGGGGGATTCAGGCGATTATTGGCGAGAGTTTTGCCG
I am trying to get rid of the non-aligned reads from the SAM output file ans I searched through forums and tried the following code with no avail.
./samtools view -bS -f 4 testing.sam > output.bam
All my sam files give the following error:
`[samopen] SAM header is present: 1 sequences.
[sam_read1] reference '0' is recognized as '*'.
Parse warning at line 3: mapped sequence without CIGAR
Parse error at line 3: missing colon in auxiliary data
Aborted (core dumped)
`
Here's what a part of my sam files looks like
@SQ SN:gi|89106884|ref|AC_000091.1| LN:4646332
4 * 0 0 * * 0 0 <CGTAAAAAGGA> *
4 * 0 0 * * 0 0 <TAGCCGTAGGG> *
I'd appreciate help with this!
That sam output looks odd. How exactly did you generate it?
I agree the output looks odd, there are no sequence names f.e. Could there be a problem with the input file, lacking proper names etc.? Could you please post one (upload to public ftp/web) of your input files (or at least the first 100 sequence files) + the exact command line of bwa, I assume it's fastq input (if it was fasta you wouldn't have the non matching output with bwasw)?
Added the code i used for bwa and my query file! Thank you for the help!
I will try that out tomorrow. Your reference sequence is AC_000091.1 in RefSeq (E. coli K12 substr. W3110)? It is marked obsolete and removed in NCBI nucl., not that it matters here, but just wanted to note.
Ha yes I noticed that too. My data is just random, i want to check the efficiency of bwa and bowtie etc.