BioJava/FASTA file help
2
0
Entering edit mode
9.9 years ago
hood821 • 0

I need help with reading a fasta file into java, and being able to count the number of A, G, C, T nucleotides. I have been trying to use BioJava since everything I google takes me there but the literature is not clear how to do this. I can read the fasta file in as either a String[] or LinkedHashMap. I don't know what is better?

sequence • 2.8k views
ADD COMMENT
0
Entering edit mode
9.9 years ago

You should not do either... process it one line at a time so you don't have to load the whole file into memory at once.

You can also do this very quickly with my fasta stats program, which is open-source and written in java:

java -ea -Xmx120m jgi.AssemblyStats2 in=file.fasta

That will print out the number of nucleotides, fraction A/G/C/T/N/other, and so forth. jgi.CountGC will also work, and if you want to look at code, it's much shorter because it doesn't track scaffold statistics.

ADD COMMENT
0
Entering edit mode

Thanks, could you help me with this though? I want to be able to write simple programs to look at sequences. Read in a fasta file and do basic analysis- number of total nucleotides, how many of each? This is a simple program for work, just to make my life easier.

If you could help with that, that would be awesome! Here is what I have, which basically just prints everything in the file. I want to count the number of nucleotides and total.. but I have trouble with the String [] and it keeps everything in lines instead of characters.

package textfiles;
import java.io.IOException;

public class FileData {

    public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub
        String file_name = "somefile.fna";

        try {
            ReadFile file = new ReadFile(file_name);
            String [] sequence = file.OpenFile();

            int i;        
            for (i=0; I < sequence.length; i++){

                System.out.println(sequence[i]);

            }

        }
        catch (IOException e) {
            System.out.println(e.getMessage());
        }
    }

}
ADD REPLY
0
Entering edit mode

I don't use BioJava so I can't help you much, but if you have sequence in a String s, you can do this:

for(int j=0; j<s.length(); j++){
   char c=s.charAt(j);
   //do something with c, like increment a counter.
}

But that won't work correctly if the header is mixed in with the sequence, so it depends on what ReadFile is doing.

ADD REPLY
0
Entering edit mode

Yeah... the problem I have is that sequence is a String [] and I can't do the .charAt.

If I try to convert it to a string, I don't get the sequence, I get name.

ADD REPLY
0
Entering edit mode
9.9 years ago

I'm no Java expert but a while ago I happened to write a simple method, getNextSequence(br), to read one sequence at a time from a fasta file. In your case the code would be along these lines I guess:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class CountNucsInFasta {

    public static void main(String[] args) throws IOException {

        String fastafile= "somefile.fna";
        BufferedReader br= new BufferedReader(new FileReader(fastafile));
        String[] fa= new String[2];
        while( (fa= getNextSequence(br)) != null ){
            String name= fa[0];
            String sequence= fa[1];
            // Code to to count nucs in sequence:
            // ...
        }
    }

    /**
     * Read next sequence from FASTA file and put it in a String array of length two:
     * String[0]: Name
     * String[1]: Sequence
     * @param br BufferedReader connected to the fasta file to read.
     * @return
     * @throws IOException
     */
    public static String[] getNextSequence(BufferedReader br) throws IOException {

            String[] fastaseq= new String[2];

            int BUFFER_SIZE = 8192;
            String line= br.readLine();

            if(line == null){
                    return null;
            } else if (line.startsWith(">")){
                    fastaseq[0]= line.replaceFirst(">", "");
            } else {
                    System.err.println(line);
                    System.err.println("Invalid sequence name or format");
                    System.exit(1);
            }
            StringBuilder sb= new StringBuilder();
            while(true){
                    br.mark(BUFFER_SIZE);
                    line= br.readLine();
                    if(line == null || line.startsWith(">")){
                            break;
                    } else {
                            sb.append(line);
                    }
            }              
            String sequence= sb.toString().toUpperCase();
            fastaseq[1]= sequence;
            br.reset();
            return fastaseq;                
    }    
}
ADD COMMENT

Login before adding your answer.

Traffic: 2399 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6