I have a 47GB file to parse. The sequences are in following format:
>TSCS_00041 gene0EA_12345_rframe2_ORF
MLAATHYYKFAIRRLFPLLKDTICASYSISIKHHENFMALSNMPKIWEDVEVDGNNMQWTRFQTTPVMPVYFIAAGVFNLSFITNWNTKLLYRKDILPYMTFAYNVAKNIAWFLSHIRKTKITNHI
>TSCS_00044 gene0EA_12341_rframe2_ORF
MTICASYSISIKHHENFMAIKHHENFMALSNMPKIWEDV
I simply want to format this file like:
>TSCS_00041
MLAATHYYKFAIRRLFPLLKDTICASYSISIKHHENFMALSNMPKIWEDVEVDGNNMQWTRFQTTPVMPVYFIAAGVFNLSFITNWNTKLLYRKDILPYMTFAYNVAKNIAWFLSHIRKTKITNHI
>TSCS_00044
MTICASYSISIKHHENFMAIKHHENFMALSNMPKIWEDV
Could anyone share the script
what have you tried ? hint: 'cut'
can this be done with cut only? the OP seems to shorten the fasta header not the other lines
cut -d" " -f 1 will work as long as no spaces in sequence.
and that's how homework is done ;)