I recently prefetched 157 SRA files from an NCBI BioProject using the SRAtoolkit. I then used the toolkit to download those files in FASTA format.
Each individual FASTA file looks something like this:
( example for SRR5678544 file )
">SRR5678544.1 HWI-ST1146_88:5:1112:6472:81473 length=194
CCATGCGGGGTATCGTATGCTTCCTTCTGCACTACCCTTTAGCTGTTCTATATGCTGCCACTCCTCAATTGGATTAGTCTCATCCTTCAATGCTATCACAAGAAGTAGAGACACAAATGCAAGAGGAGCATATAAATTACAAAACACCATCACTGAGGGCCCTAAAGCGGTTCCCACGAAAAAAAGGAGAGTAG
'>SRR5678544.2 HWI-ST1146_88:5:1113:13218:62635 length=194
TGGATGGTGGTTGGGAGTGGTAAGGTTGAATGAGACACGGTAACGAGTGGGGAGGTAGGGTAATGGAGGGTAAGTTGAGAGACAGGTTCGTCAGGGGACACACCACACCACACCCACACCCACACACCCACACACCACACCCACACACCACACCCACACCCACACCCCCACACACCACACCCCCACCCCACCCA
'>SRR5678544.3 HWI-ST1146_88:5:2206:11224:22269 length=194
GTCTCTTAACTTACCCTCCATTACCCTACCTCCCGACTCGTTACCCTGTCTCATTCAACCATACCACTCCCGACCACCATCCATCCCTCTACTTACTACAGTATGGTGAGTGGGACATGGTGGATGGTAGGGTAAGTGGCAGTGGAGTTGGATATGGGCAATTGGAGGGTAACGGTTATGGTGGACGGGGGGTG"
And so on and so forth. (Ignore the " and ' marks I had to input those to format it right on here)
Question 1: Can someone explain why this file has so many sequences (>)? Every FASTA file I have worked with has one line with ">" at the top and a header sentence followed by a single big paragraph of the genome sequence.
Question 2: If I wanted to combine all the sequences in a single FASTA file to look like a "normal" FASTA file - just a single paragraph - how would I do that?
Reasoning: I'm asking simply because I need to make a BLAST database out of the 157 sequences and the "makeblastdb" command only takes in a single file as the input. I was going to try and combine each individual FASTA file into single paragraphs and then make a huge FASTA file that includes all 157 sequences separated by ">".
1) this is a run of sequencing. These are short reads, not an assembly.
2) run a short-read mapper or/and a de-novo assembler
oh perfect i thought i was going crazy for a second
thank you for clarifying for me!
While technically possible it is unusual to map sequencing reads as fastA rather than fastQ files. You should download fastQ from NCBI and map or assemble that, simply because not all tools support fastA when fastQ is expected.
I was confused about the data but now I understand it's short unassembled reads. My next question is then what now? Do I assemble them? Or do I align them? I have a reference genome I could use but the purpose of the BLAST database is to eventually do some phylogenetic analysis, analyzing closely related species, looking at specific genes in those species, etc.. There's a lot I want to be able to do with the data I am collecting I am just confused on how to process them before those analysis.
For instance, creating a local BLAST database requires a SINGLE fastA file as the input. And as I have 157 files (to start) I need to somehow combine all the 157 sequences into one file. I have the fastQ files as well which I can use for cleaning up the data and then I can use the
FastX-Toolkit
to convert the fastQ files into fastA and then use this line of code to combine all 157 "cleaned" sequences into a single fastA filescat *.fasta > allfiles.fasta
but I can't include all of the short raw reads without making them a consensus or aligned or assembled sequence first. That's the part I'm stuck at mainly.What is your project?