Hi to all, I have a fasta file with lots of sequences with the description not like NCBI.But the file has keyword mentioning the length of the sequence. I want to extract sequences only of, say upto 100 bp in length.How can I do this with perl?
the format is as follows:
>gene1 group5 length=84
AaaaaatGTCTGGATGAGTTCATCCTGTAAAAaTTGCTGTCTGATACAAATACTTTGCTT
AGTCCAGTTAAATCTTCACTACTTTTGTGCACTGAAAGGCTAGCTTTtCTTCCAAAGCGG
TTTTCAATAATTCCTCTGACGCCTCCTTTTTtAGAGTATTTATTGTGTCTTCTATTTCCT
thanx raghul
Yes the length of the sequence is is not as mentioned. I did that for brevity(sorry!). I have many sequences of varying length. I am trying to write a program counting the strings. But I also felt it will be easy if I use the keyword length in FASTA description line to extract sequences within a range of values. thanx once again.raghul
NCBI fasta description is as follows,It has gi no.followed by reseq ID & organism etc which is NCBI way of describing the sequence but Mine is not so when u observe.
>gi|159476307|ref|XM_001696201.1| Chlamydomonas reinhardtii strain CC-503 cw92 mt+
CACAGTACCTTTCTGGTCAGCTGCACTGCATTGCTTTGTGACTAGTGAAGCTTCGACAGCTCACTGCGGA
CATTCCAAAATTGCTGTAACTCGACATTGATTTAACTACAGTATGCTGTTATATCCATAGCGCAAGAGAG
CTTGCGGCTTGCCTCCCCTCCATGCTCTTGTAGTCTGAGCCTATCCAGCTGCCTCGTCGCCGTTTGCAAA
GTTTTATTACTGAGACACAAGTAGCAGGGGCCGAGCAGGCAGCTGCCTGCGAGGCCGGTGAACCACGCGG
This example is somewhat confusing, since the length clearly is not 84. Also, what is meant by "the description not like NCBI" ? Do you mean the format is not like the NCBI description of FASTA format ?
This example is somewhat confusing, since the length clearly is not 84.
OK, so the sequence is valid fasta format; that's what I needed to know. The length description will work (as in Cass' answer, below), provided that it is correct, otherwise better to use the calculated length of the sequence.
thanx it worked for a newbie like me