Question

Biojava FastaReaderHelper read only 2814 ProteinSequence

0

Entering edit mode

9.0 years ago

Roberto Tellez Ibarra • 0

Hi everyone:

I have been trying to read a Fasta file containing 11374 protein sequences, but only the firsts 2814 sequences are readed. I am using biojava-4.1.0 and the line for reading the sequences is:

LinkedHashMap<String, ProteinSequence> entries = FastaReaderHelper.readFastaProteinSequence(file);

I know that all sequences are different because I created the file with different values. My Fasta file is here

I previously tried with biojava3-core-3.0.8 and biojava3-core-3.1.0, using the same code for reading and got the same count readed.

Any help are welcome.

EDIT: Any other alternative for reading and writing Fasta files is also accepted.

sequence biojava reader fasta • 2.0k views

ADD COMMENT • link updated 4.7 years ago by Biostar 20 • written 9.0 years ago by Roberto Tellez Ibarra • 0

Ram · Accepted Answer · 2015-11-19

I can replicate the issue. It appears biojava doesn't like the pipe char | in the sequence name. Or rather, I suspect it trims the sequence name to the first |. So when a duplicate name is found, the existing entry in the HashMap is silently replaced.

If you remove pipe chars from sequence names, than your code will read all the sequences:

sed 's/|/_/g' sequences_87394380194.fasta > tmp.fa

Then this will do:

public static void main (String[] args) throws IOException{

       File file= new File("/Users/berald01/Downloads/tmp.fa");
        LinkedHashMap<String, ProteinSequence> entries = FastaReaderHelper.readFastaProteinSequence(file);
        System.out.println(entries.size()); // 11374

}

It strikes me that no warning or anything is issued though!

If interested, here's my implementation to read fasta files BioJava/FASTA file help

Ram · Accepted Answer · 2015-11-20

I finally solved it. The problem was at creating the file, because the aforementioned issue. So, I replace the first token with new code like this:

File tmpFolder = new File(System.getProperty("user.dir"), "tmp");
tmpFolder.mkdirs();

long h = 0;
int i = 1;
for (Map.Entry<String, ProteinSequence> entry : uniqueSequences.entrySet()) {                
    StringTokenizer st = new StringTokenizer(entry.getValue().getOriginalHeader(), "|");
    //st.nextToken(); //Ignore first because it is going to be replaced

    StringBuilder sb = new StringBuilder();
    sb.append(String.format("AMP_%d|", i++));

    while (st.hasMoreElements()) {
        sb.append(st.nextToken()).append(st.hasMoreElements()?"|":"");
    }

    entry.getValue().setOriginalHeader(sb.toString());
    h += entry.getKey().hashCode();
}

File sequencesFile = new File(tmpFolder, "sequences_" + h + FASTA_EXT);

boolean notExists = sequencesFile.createNewFile();
if (notExists) FastaWriterHelper.writeProteinSequence(sequencesFile, uniqueSequences.values());

Thanks you @dariober for your reply.