(I have almost no experience with bioinformatics or biology in general prior to this summer; please excuse any gross abuses of terminology or general misunderstandings regarding the field)
I'm working with some FASTQ files for a project, about 40 gigabytes of them (17-18 Gb, 70-150 bp per sequence), and I suspect they're the result of shotgun sequencing, because there's no way the genome the files are supposed to represent is that large. If my understanding of shotgun sequencing is correct, this means that there's significant overlap between individual sequences which would allow for the sequences to be reconstructed into larger, contiguous sequences, dramatically reducing the size of the data and making it far easier to work with.
So far, the only promising lead I've found is an application by the name of ARACHNE, which appears to be exactly what I'm looking for, except that I don't have a sufficiently powerful Linux machine at hand with the correct software installed (although it might be possible to rectify this if no other options present themselves).
Short version: How can I go about turning this giant pile of tiny sequences into a smaller pile of larger sequences?
Thanks for the reply. I believe they were sequenced with the Illumina HiSeq platform. I'm not sure what single end and pair end reads are, but I'll look into that. Same for the reference genome (I suspect not, though).
I'll continue reading up on sequence assembly, and see if I can convince IT to install the necessary software on one of the more powerful computers we've got.
Thanks!
http://elements.eaglegenomics.com/
here you have some tools used in bioinfromatics. It's presented in easy way. Go for assemblers (lon or short, you will know after some reading what is best suited for your type of reads)