I have been trying to read a Fasta file containing 11374 protein sequences, but only the firsts 2814 sequences are readed. I am using biojava-4.1.0 and the line for reading the sequences is:
I can replicate the issue. It appears biojava doesn't like the pipe char | in the sequence name. Or rather, I suspect it trims the sequence name to the first |. So when a duplicate name is found, the existing entry in the HashMap is silently replaced.
If you remove pipe chars from sequence names, than your code will read all the sequences:
sed 's/|/_/g' sequences_87394380194.fasta > tmp.fa
Then this will do:
public static void main (String[] args) throws IOException{
File file= new File("/Users/berald01/Downloads/tmp.fa");
LinkedHashMap<String, ProteinSequence> entries = FastaReaderHelper.readFastaProteinSequence(file);
System.out.println(entries.size()); // 11374
}
It strikes me that no warning or anything is issued though!
I finally solved it. The problem was at creating the file, because the aforementioned issue. So, I replace the first token with new code like this:
File tmpFolder = new File(System.getProperty("user.dir"), "tmp");
tmpFolder.mkdirs();
long h = 0;
int i = 1;
for (Map.Entry<String, ProteinSequence> entry : uniqueSequences.entrySet()) {
StringTokenizer st = new StringTokenizer(entry.getValue().getOriginalHeader(), "|");
//st.nextToken(); //Ignore first because it is going to be replaced
StringBuilder sb = new StringBuilder();
sb.append(String.format("AMP_%d|", i++));
while (st.hasMoreElements()) {
sb.append(st.nextToken()).append(st.hasMoreElements()?"|":"");
}
entry.getValue().setOriginalHeader(sb.toString());
h += entry.getKey().hashCode();
}
File sequencesFile = new File(tmpFolder, "sequences_" + h + FASTA_EXT);
boolean notExists = sequencesFile.createNewFile();
if (notExists) FastaWriterHelper.writeProteinSequence(sequencesFile, uniqueSequences.values());