interpreting fasta header
0
0
Entering edit mode
5.9 years ago
genya35 ▴ 50

Hello, I have a text file with thousands of unique sequences in fasta format. Each read has a header in the following format:

122391_Tcount2352_Acount2352_Bcount0_length293

It's obvious that 'length' represents the length of the read but all the other numbers are not clear. I do not know which tool was used to generate the file but blastn was used as some point in the pipeline. I'm curious to see if anyone here has encountered this header format before and can tell me which part of the sequence header represents the count of reads.

Thanks for your help in advance,

Lena

alignment • 1.6k views
ADD COMMENT
1
Entering edit mode

Hi Lena,

Can you tell us the tool that provided those fasta headers for you? That might help us know what "Tcount", "Acount" and "Bcount" mean.

Thanks!

ADD REPLY
0
Entering edit mode

Identifying possible tools from the header style/format is the whole question...

ADD REPLY
0
Entering edit mode

Lena,

Take a few separate sequences, put it to Blastn or Blastx. It may become clearer what organism you deal with. Then look at NCBI - who has sequensed it. You may even find some articles describing it. Good luck!

ADD REPLY
3
Entering edit mode

How does this help with the question about the information in the header?

ADD REPLY
1
Entering edit mode

Lena said, she had thousands of unique sequences.

If it is published, if the source is known - one way is just ask the authors.

It may help or not - but any additional information is valuable.

ADD REPLY
0
Entering edit mode

Can you provide a little more background? Where did you get the file? Some co-worker / collaborator passed it to you? If so, ask them. Did you download it from some site / database / paper? Then please tell us where from.

My guess is this is some unpublished internal / personal pipeline, and your only hope at getting a conclusive answer is asking the person who created it.

Just guessing wildly - because guessing is free - I think the first number is the transcript identifier, Tcount (number) is the count of reads for sample T, Acount (number) is the count of reads for sample A, Bcount (number) is the count of reads for sample B, length (number) is the length of the transcript.

ADD REPLY

Login before adding your answer.

Traffic: 2312 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6