Entering edit mode
5.8 years ago
dod
•
0
Hi,
I have a fna file downloaded from the database containing all CDS of a bacterial strain. The format is shown below. I would like to filter (remove) those CDS less than 200 n.t. How do I do this using command line?
I've looked into the previous posts related to this topic, but the awk did not work.
Thanks!
>lcl|AL111168.1_cds_CAL34182.1_1 [gene=dnaA] [locus_tag=Cj0001] [db_xref=EnsemblGenomes-Gn:Cj0001,EnsemblGenomes-Tr:CAL34182,GOA:Q9PJB0,InterPro:IPR001957,InterPro:IPR003593,InterPro:IPR010921,InterPro:IPR013159,InterPro:IPR013317,InterPro:IPR018312,InterPro:IPR020591,InterPro:IPR024633,InterPro:IPR027417] [protein=chromosomal replication initiator protein] [protein_id=CAL34182.1] [location=1..1323] [gbkey=CDS]
ATGAATCCAAGCCAAATACTTGAAAATTTAAAAAAAGAATTAAGTGAAAACGAATACGAAAACTATTTATCAAATTTAAA
ATTCAACGAAAAACAAAGCAAAGCAGATCTTTTAGTTTTTAATGCTCCAAATGAACTCATGGCTAAATTCATACAAACAA
AATACGGCAAAAAAATCGCGCATTTTTATGAAGTGCAAAGCGGAAATAAAGCCATCATAAATATACAAGCACAAAGTGCT
AAACAAAGCAACAAAAGCACAAAAATCGACATAGCTCATATAAAAGCACAAAGCACGATTTTAAATCCTTCTTTTACTTT
>lcl|AL111168.1_cds_CAL34183.1_2 [gene=dnaN] [locus_tag=Cj0002] [db_xref=EnsemblGenomes-Gn:Cj0002,EnsemblGenomes-Tr:CAL34183,GOA:Q0PCC3,InterPro:IPR001001,InterPro:IPR022634,InterPro:IPR022635,InterPro:IPR022637,UniProtKB/TrEMBL:Q0PCC3] [protein=DNA polymerase III, beta chain] [protein_id=CAL34183.1] [location=1483..2550] [gbkey=CDS]
ATGAAGTTAAGTATCAATAAAAATACTTTAGAATCTGCAGTGATTTTATGTAATGCTTATGTAGAAAAAAAAGACTCAAG
CACCATTACTTCTCATCTTTTTTTTCATGCTGATGAAGATAAACTTCTTATTAAAGCTAGTGATTATGAAATAGGTATCA
ACTATAAAATAAAAAAAATCCGCGTAGAATCAAGTGGTTTTGCTACTGCAAATGCAAAAAGTATTGCAGATGTTATTAAA
AGCTTAAACAATGAAGAAGTTGTTTTAGAAACCATTGATAATTTTTTATTTGTAAGACAAAAAAGTACAAAATACAAACT
. . .
What do you mean by this?
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
biopython is a solution
duplicate: How To Filter Multi Fasta By Length??
Selecting defined length fasta sequence and excluding them from a dataset
Separate by size sequences in a fasta file ; ...