How to split big .faa file into smaller .faa files
3
0
Entering edit mode
3.3 years ago
Shaurya • 0

I have a 10 gb .faa proteomes file that I want to run MAFFT on. But it is too big and hence I need to divide the file. How do I convert it to smaller files in windows without losing any data ? The solutions I have come across are for a UNIX/LINUX based environment

faa proteomes • 1.5k views
ADD COMMENT
1
Entering edit mode

from the statement The solutions I have come across are for a UNIX/LINUX based environment, I am assuming that you are on windows. Even in windows (=>10), you can use GNU-linux tools using wsl2. I would suggest seqkit (windows version) from here. Please go through the manual and there are multiple ways to split a fasta using seqkit in manual.

ADD REPLY
1
Entering edit mode
3.3 years ago
Mark ★ 1.6k

Seqkit is the answer.

To split into 100 parts:

seqkit split myfile.faa --by-part 100

To split by number of desired sequences per file (eg 5000 per file):

seqkit split myfile.faa --by-part 5000 -by-size

The solutions I have come across are for a UNIX/LINUX based environment

Yes, use linux, if you want to perform any bioinformatics you need to use linux

ADD COMMENT
0
Entering edit mode
3.3 years ago
Divon ▴ 230

You can use my Genozip tool:

genozip myfile.faa
genocat --downsample 3,1 myfile.faa.genozip   <--- get part 1 out of 3

Works on Windows (as well as Linux and Mac)

See here: https://genozip.com/downsampling.html

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

ADD COMMENT
0
Entering edit mode
3.3 years ago
Juke34 8.9k

14 methods reviewed here: https://github.com/Juke34/knowledge/blob/main/split_fasta.md

ADD COMMENT

Login before adding your answer.

Traffic: 1703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6