Hi!
I'm not schooled very much in bioinformatics that's why I'm hoping there is a better way to do this.
I want to test an index-based compression method which highly scales with the similarity between a sequence and a reference-sequence. In order to do that I need some genomes from the 1000 human genome project. I downloaded some of them from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.sequence.index as .fastq files.
Around 2.6 gb each. They seem to be split in 2 .fastq files each.
All I need is the raw sequence in form of one huge String. In order to do that I thought the best way might be to write a little program for this. The problem is that one .fastq file seems to have around 112000000 lines and after one hour of compilation time I have checked around 2000000 lines of the .fastq file. There needs to be a better way to get the raw String data of an human genome.
That's the code I'm using.
Thank you very much for help!
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.LineNumberReader;
public class ReadFastq {
public String readFastq() throws IOException{
int lineCount;
String sequence = "";
String line = "";
LineNumberReader lnr = new LineNumberReader(
new FileReader(
new File(
"C:\\Users\\Patrick\\Documents\\Bachelorarbeit\\genom\\ERR012616_2.fastq")));
System.out.println("File opened successful!");
lnr.skip(Long.MAX_VALUE);
lineCount = lnr.getLineNumber();
System.out.println("Input file contains " + lineCount
+ " lines of information");
lnr.close();
BufferedReader input = new BufferedReader(
new FileReader(
new File(
"C:\\Users\\Patrick\\Documents\\Bachelorarbeit\\genom\\ERR012616_2.fastq")));
System.out.println("File opened successful!");
String firstChar;
for (int x = 0; x < lineCount; x++) {
line = input.readLine();
firstChar = ""+line.charAt(0);
//System.out.println(firstChar);
if(firstChar.contains("N") || firstChar.contains("A") || firstChar.contains("G") || firstChar.contains("T") || firstChar.contains("C")){
sequence = sequence +line;
System.out.println("Checked line "+x+" and added to Sequence");
}
}
System.out.println("Sequence length: "+sequence.length());
return sequence;
}
}
"All I need is the raw sequence in form of one huge String."
you cannot do this. For a common Fastq file it won't fit in memory and why would you do this?
In order to do that i thought the best way might be to write a little program for this. The problem is that one .fastq file seems to have around 112,000,000 lines and
"after one hour of compilation time"
no compilation time, you meant execution.
why do you read the file twice when you could use only one loop using a
java.io.BufferedReader
?At first,
thanks for your answers!
I never studied this, but have to deal with it now in connection with my bachelor thesis, so all my knowledge is self-induced. Sorry for that.
I was aware of how the .fastq-files are structured, but I thought if if this two files contain all the raw reads of the sequenced genome, the entire sum would be the genomes sequence data.
Unfortunately I need to use the 1000 genome project data in order to reproduce some results of an important related work.
This question might seem a bit stupid but what's the difference between the raw reads and an "real" genome data?
I aware of that every letter of the raw data has a quality value which determines how exact this letter is.
This is not a forum. This is Q&A. If you have new questions they go in new threads.
I want to say there are no stupid questions; you're just new to the field and the technologies. Need to do some research on how "next gen sequencing" works, Read up on sequence alignment and why it's a thing that matters. There might be some questions about that here already asked and answered, but since they're foundational questions rather than technical ones, you might find better answers at wikipedia.