I need help with reading a fasta file into java, and being able to count the number of A, G, C, T nucleotides. I have been trying to use BioJava since everything I google takes me there but the literature is not clear how to do this. I can read the fasta file in as either a String[] or LinkedHashMap. I don't know what is better?
That will print out the number of nucleotides, fraction A/G/C/T/N/other, and so forth. jgi.CountGC will also work, and if you want to look at code, it's much shorter because it doesn't track scaffold statistics.
Thanks, could you help me with this though? I want to be able to write simple programs to look at sequences. Read in a fasta file and do basic analysis- number of total nucleotides, how many of each? This is a simple program for work, just to make my life easier.
If you could help with that, that would be awesome! Here is what I have, which basically just prints everything in the file. I want to count the number of nucleotides and total.. but I have trouble with the String [] and it keeps everything in lines instead of characters.
package textfiles;
import java.io.IOException;
public class FileData {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String file_name = "somefile.fna";
try {
ReadFile file = new ReadFile(file_name);
String [] sequence = file.OpenFile();
int i;
for (i=0; I < sequence.length; i++){
System.out.println(sequence[i]);
}
}
catch (IOException e) {
System.out.println(e.getMessage());
}
}
}
I'm no Java expert but a while ago I happened to write a simple method, getNextSequence(br), to read one sequence at a time from a fasta file. In your case the code would be along these lines I guess:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class CountNucsInFasta {
public static void main(String[] args) throws IOException {
String fastafile= "somefile.fna";
BufferedReader br= new BufferedReader(new FileReader(fastafile));
String[] fa= new String[2];
while( (fa= getNextSequence(br)) != null ){
String name= fa[0];
String sequence= fa[1];
// Code to to count nucs in sequence:
// ...
}
}
/**
* Read next sequence from FASTA file and put it in a String array of length two:
* String[0]: Name
* String[1]: Sequence
* @param br BufferedReader connected to the fasta file to read.
* @return
* @throws IOException
*/
public static String[] getNextSequence(BufferedReader br) throws IOException {
String[] fastaseq= new String[2];
int BUFFER_SIZE = 8192;
String line= br.readLine();
if(line == null){
return null;
} else if (line.startsWith(">")){
fastaseq[0]= line.replaceFirst(">", "");
} else {
System.err.println(line);
System.err.println("Invalid sequence name or format");
System.exit(1);
}
StringBuilder sb= new StringBuilder();
while(true){
br.mark(BUFFER_SIZE);
line= br.readLine();
if(line == null || line.startsWith(">")){
break;
} else {
sb.append(line);
}
}
String sequence= sb.toString().toUpperCase();
fastaseq[1]= sequence;
br.reset();
return fastaseq;
}
}
Thanks, could you help me with this though? I want to be able to write simple programs to look at sequences. Read in a fasta file and do basic analysis- number of total nucleotides, how many of each? This is a simple program for work, just to make my life easier.
If you could help with that, that would be awesome! Here is what I have, which basically just prints everything in the file. I want to count the number of nucleotides and total.. but I have trouble with the String [] and it keeps everything in lines instead of characters.
I don't use BioJava so I can't help you much, but if you have sequence in a String
s
, you can do this:But that won't work correctly if the header is mixed in with the sequence, so it depends on what
ReadFile
is doing.Yeah... the problem I have is that sequence is a String [] and I can't do the
.charAt
.If I try to convert it to a string, I don't get the sequence, I get name.