Entering edit mode
5.8 years ago
genya35
▴
50
Hello, I have a text file with thousands of unique sequences in fasta format. Each read has a header in the following format:
122391_Tcount2352_Acount2352_Bcount0_length293
It's obvious that 'length' represents the length of the read but all the other numbers are not clear. I do not know which tool was used to generate the file but blastn was used as some point in the pipeline. I'm curious to see if anyone here has encountered this header format before and can tell me which part of the sequence header represents the count of reads.
Thanks for your help in advance,
Lena
Hi Lena,
Can you tell us the tool that provided those fasta headers for you? That might help us know what "Tcount", "Acount" and "Bcount" mean.
Thanks!
Identifying possible tools from the header style/format is the whole question...
Lena,
Take a few separate sequences, put it to Blastn or Blastx. It may become clearer what organism you deal with. Then look at NCBI - who has sequensed it. You may even find some articles describing it. Good luck!
How does this help with the question about the information in the header?
Lena said, she had thousands of unique sequences.
If it is published, if the source is known - one way is just ask the authors.
It may help or not - but any additional information is valuable.
Can you provide a little more background? Where did you get the file? Some co-worker / collaborator passed it to you? If so, ask them. Did you download it from some site / database / paper? Then please tell us where from.
My guess is this is some unpublished internal / personal pipeline, and your only hope at getting a conclusive answer is asking the person who created it.
Just guessing wildly - because guessing is free - I think the first number is the transcript identifier, Tcount (number) is the count of reads for sample T, Acount (number) is the count of reads for sample A, Bcount (number) is the count of reads for sample B, length (number) is the length of the transcript.