This is a very basic question. I converted fastq file to fasta using the following commend
seqtk seq -a combined_RP.fastq.gz > RP.fasta
When I open the fasta file, I got the follwoing lines:
>E00502:101:ZA170417199:8:1101:27904:1731 1:N:0:ATGGGC
GTCNGTGAACTAGAAAATTTCTTGAAGTTGGAACCGCAAGTATTTGTTACCAATCCTCCTCAAAGTAGTATATGGCAAGAACTT....
>E00502:101:ZA170417199:8:1101:28168:1731 1:N:0:ATGAGC
TTGNAGTTTCAGTCAAAATCTAACTATTAAAATAAGGAATTTAAAACCTTACTCGCGCAGCATCCCGATCGCGGTGAGGTCAC...
>E00502:101:ZA170417199:8:1101:28716:1731 1:N:0:ATGAGC
AATNGGTTTTACTTTAATTTCTCTACTTCTATACTCTGTACATAATGTAATTAAGGGTGAATGAAGGGGTCACTAACAC....
My next step will be ab initio gene identification. Can I proceed with the same fasta file? or How do I get a fasta file with continuous stretch of sequences. Thank you in advance
You know have reads, which are not the full genome fasta. You first need to perform a de novo assembly to combine all reads in one fasta genome, then you can do gene identification.
I have a similar issue. I used BWA to generate a mapped genome assembly and called a consensus sequence (.fq). I wanted to run QUAST on my mapped assembly, but couldn't get quast to work with my bam files so I tried to convert my fq file into a fasta file and got a similar result. Although it looks like multiple reads or sequences, I believe it may be my data mapped to scaffolds? The names (associated with '>') match the reference scaffold names. I too had hoped to get a continuous stretch of sequences to input into quast, but am not sure how...
How?