Question

How to cut a 15000 sequence file into multiple files of 1000nt each and save it in new files like F1,F2 and so on?

0

Entering edit mode

3.0 years ago

Ap1438 ▴ 50

I have a file with more than 15000nt sequence and i want it to be separated into 1000nt new files like F1,F2 .....

Example

MainFile

  ATGCATATGCCCAGTAGCGAATGATGATCA
  ATGCATA_TCCCAGTAGTGAATGATAATCA
  _CAT_TGCC_ATTAGAGAATGATGATCA_C

INTO 3 FILES OF 10nt each

FILE1
```
  ATGCATATGC
  ATGCATA_TC
  _CAT_TGCCA
```
File2
```
  CCAGTAGCGA
  CCAGTAGCAG
  AGAGAATGAG
```
File3
```
  ATGATGATCA
  ATGATAATCA
  GATGATCA_C
```

awk FASTA • 2.8k views

ADD COMMENT • link 3.0 years ago by Ap1438 ▴ 50

1

Entering edit mode

3.0 years ago

Alex Reynolds 36k

PyFasta might help: https://pypi.org/project/pyfasta/#command-line-interface

ADD COMMENT • link 3.0 years ago by Alex Reynolds 36k

0

Entering edit mode

The main file is not a single line fasta file its a multiple sequence aligned file . i WANT TO CUT THEM BASED ON COLUMN i.e. horizontally 1000 chatacters each.

Example
ATGCATATGCCCAGTAGCGAATGATGATCA
ATGCATA_TCCCAGTAGTGAATGATAATCA
_CAT_TGCC_ATTAGAGAATGATGATCA_ -MainFile

INTO 3 FILES OF 10nt each

ATGCATATGC-FILE1
ATGCATA_TC
_CAT_TGCCA

CCAGTAGCGA-File2
CCAGTAGCAG
AGAGAATGAG

ATGATGATCA-File3
ATGATAATCA
GATGATCA_C

ADD REPLY • link 3.0 years ago by Ap1438 ▴ 50

0

Entering edit mode

Try with the above solutions. If they do not work, come back, post what you tried, the error you are having and any other relevant details.

ADD REPLY • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

I misunderstood your query.. You want the sequences to be vertical cut. Above solutions may not work

ADD REPLY • link 3.0 years ago by cpad0112 21k

score 4 · Accepted Answer · 2021-12-07

For the format you paste (one line for a sequence ), use the commands below (change 10 to 1000, change 30 to 15000, change 9 to 999).
For the FASTA format (usually produced by multiple sequence alignment software), change the awk command with other tools, like seqkit subseq -r $s:$e msa.fasta > msa.$s-$e.txt.

Just use awk:

for s in $(seq 1 10 30); do \
    e=$(expr $s + 9); \
    echo $s $e; \
done
1 10
11 20
21 30

for s in $(seq 1 10 30); do \
    e=$(expr $s + 9); \
    awk -v s=$s '{print substr($0, s, 10)}' msa.txt > msa.$s-$e.txt
done

more msa.*.txt
::::::::::::::
msa.1-10.txt
::::::::::::::
ATGCATATGC
ATGCATA_TC
_CAT_TGCC_
::::::::::::::
msa.11-20.txt
::::::::::::::
CCAGTAGCGA
CCAGTAGTGA
ATTAGAGAAT
::::::::::::::
msa.21-30.txt
::::::::::::::
ATGATGATCA
ATGATAATCA
GATGATCA_C

score 3 · Accepted Answer · 2021-12-07

3

Entering edit mode

3.0 years ago

cpad0112 21k

https://bioinf.shenwei.me/seqkit/usage/#split or faSplit from kentutils

ADD COMMENT • link 3.0 years ago by cpad0112 21k

1

Entering edit mode

Thanks, but seqkit split is for splitting many sequences into several parts, rather than splitting sequences into fragments which are job of seqkit subseq or seqkit sliding.

ADD REPLY • link 3.0 years ago by shenwei356 8.7k

1

Entering edit mode

input: test.fa

$ seq 1 10 30 | while read line; do  (echo ">seq_"$line"_"`expr $line + 9` && cut -c $line-`expr $line + 9` test.fa | sed '/^$/d') > "seq_"$line"_"`expr $line + 9`".fa" ; done

$ tail -n+1 test.fa seq_*.fa
==> test.fa <==
>seq
ATGCATATGCCCAGTAGCGAATGATGATCA
ATGCATA_TCCCAGTAGTGAATGATAATCA
_CAT_TGCC_ATTAGAGAATGATGATCA_

==> seq_1_10.fa <==
>seq_1_10
>seq
ATGCATATGC
ATGCATA_TC
_CAT_TGCC_

==> seq_11_20.fa <==
>seq_11_20
CCAGTAGCGA
CCAGTAGTGA
ATTAGAGAAT

==> seq_21_30.fa <==
>seq_21_30
ATGATGATCA
ATGATAATCA
GATGATCA_

Remove original header from very first split fasta file (>seq here)

ADD REPLY • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

Thankyou it worked. Can you please explain this code.

ADD REPLY • link 3.0 years ago by Ap1438 ▴ 50

1

Entering edit mode

$ seq 1 10 30 | while read line; do  (echo ">seq_"$line"_"`expr $line + 9` && cut -c $line-`expr $line + 9` test.fa | sed '/^$/d') > "seq_"$line"_"`expr $line + 9`".fa" ; done

seq 1 10 30 - prints numbers from 1 to 30 with a window of 10 without overlaps
bash loop has following logic: a) Echo the new file name with "seq", "number" from step 1 and "a second number generated by adding 9 to number from step1'. b) use cut to extract characters column wise from test.fa. Ranges are provided by number from step and another number generated by adding 9 to number from step 1 c) using sed remove the empty lines d) > outputs the file to a file named "seq" with number from step 1 and a second number generated by adding 9 to the number from step 1.

Logic of adding 9 to step 1 number is window size is 10 and step 1 numbers are 1, 11,21 and you would require 1-10, 11-20, 21-30 for cutting characters.

ADD REPLY • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

Thank you for your valuable time and explanation. Can you tell me how can i get the output in a format like seq0001 seq 00002 ...... and so on or seq_000001_001000.fas. Because when i run sort command in these file names. It sorts based on First character like
seq_110001_111000.fas
seq_11001_12000.fas.
seq_111001_112000.fas
and so on.
So its difficult to sort based on the file no.

ADD REPLY • link 3.0 years ago by Ap1438 ▴ 50

1

Entering edit mode

man sort

   -V, --version-sort
          natural sort of (version) numbers within text