How to cut a 15000 sequence file into multiple files of 1000nt each and save it in new files like F1,F2 and so on?
3
0
Entering edit mode
3.0 years ago
Ap1438 ▴ 50

I have a file with more than 15000nt sequence and i want it to be separated into 1000nt new files like F1,F2 .....

Example

  • MainFile

      ATGCATATGCCCAGTAGCGAATGATGATCA
      ATGCATA_TCCCAGTAGTGAATGATAATCA
      _CAT_TGCC_ATTAGAGAATGATGATCA_C
    

INTO 3 FILES OF 10nt each

  • FILE1

      ATGCATATGC
      ATGCATA_TC
      _CAT_TGCCA
    
  • File2

      CCAGTAGCGA
      CCAGTAGCAG
      AGAGAATGAG
    
  • File3

      ATGATGATCA
      ATGATAATCA
      GATGATCA_C
    
awk FASTA • 2.8k views
ADD COMMENT
4
Entering edit mode
3.0 years ago
  • For the format you paste (one line for a sequence ), use the commands below (change 10 to 1000, change 30 to 15000, change 9 to 999).
  • For the FASTA format (usually produced by multiple sequence alignment software), change the awk command with other tools, like seqkit subseq -r $s:$e msa.fasta > msa.$s-$e.txt.

Just use awk:

for s in $(seq 1 10 30); do \
    e=$(expr $s + 9); \
    echo $s $e; \
done
1 10
11 20
21 30

for s in $(seq 1 10 30); do \
    e=$(expr $s + 9); \
    awk -v s=$s '{print substr($0, s, 10)}' msa.txt > msa.$s-$e.txt
done

more msa.*.txt
::::::::::::::
msa.1-10.txt
::::::::::::::
ATGCATATGC
ATGCATA_TC
_CAT_TGCC_
::::::::::::::
msa.11-20.txt
::::::::::::::
CCAGTAGCGA
CCAGTAGTGA
ATTAGAGAAT
::::::::::::::
msa.21-30.txt
::::::::::::::
ATGATGATCA
ATGATAATCA
GATGATCA_C
ADD COMMENT
0
Entering edit mode

Thankyou this code also worked.Can you explain the code.

ADD REPLY
3
Entering edit mode
3.0 years ago

https://bioinf.shenwei.me/seqkit/usage/#split or faSplit from kentutils

ADD COMMENT
1
Entering edit mode

Thanks, but seqkit split is for splitting many sequences into several parts, rather than splitting sequences into fragments which are job of seqkit subseq or seqkit sliding.

ADD REPLY
1
Entering edit mode

input: test.fa

$ seq 1 10 30 | while read line; do  (echo ">seq_"$line"_"`expr $line + 9` && cut -c $line-`expr $line + 9` test.fa | sed '/^$/d') > "seq_"$line"_"`expr $line + 9`".fa" ; done

$ tail -n+1 test.fa seq_*.fa
==> test.fa <==
>seq
ATGCATATGCCCAGTAGCGAATGATGATCA
ATGCATA_TCCCAGTAGTGAATGATAATCA
_CAT_TGCC_ATTAGAGAATGATGATCA_

==> seq_1_10.fa <==
>seq_1_10
>seq
ATGCATATGC
ATGCATA_TC
_CAT_TGCC_

==> seq_11_20.fa <==
>seq_11_20
CCAGTAGCGA
CCAGTAGTGA
ATTAGAGAAT

==> seq_21_30.fa <==
>seq_21_30
ATGATGATCA
ATGATAATCA
GATGATCA_

Remove original header from very first split fasta file (>seq here)

ADD REPLY
0
Entering edit mode

Thankyou it worked. Can you please explain this code.

ADD REPLY
1
Entering edit mode
$ seq 1 10 30 | while read line; do  (echo ">seq_"$line"_"`expr $line + 9` && cut -c $line-`expr $line + 9` test.fa | sed '/^$/d') > "seq_"$line"_"`expr $line + 9`".fa" ; done
  1. seq 1 10 30 - prints numbers from 1 to 30 with a window of 10 without overlaps
  2. bash loop has following logic: a) Echo the new file name with "seq", "number" from step 1 and "a second number generated by adding 9 to number from step1'. b) use cut to extract characters column wise from test.fa. Ranges are provided by number from step and another number generated by adding 9 to number from step 1 c) using sed remove the empty lines d) > outputs the file to a file named "seq" with number from step 1 and a second number generated by adding 9 to the number from step 1.

Logic of adding 9 to step 1 number is window size is 10 and step 1 numbers are 1, 11,21 and you would require 1-10, 11-20, 21-30 for cutting characters.

ADD REPLY
0
Entering edit mode

Thank you for your valuable time and explanation. Can you tell me how can i get the output in a format like seq0001 seq 00002 ...... and so on or seq_000001_001000.fas. Because when i run sort command in these file names. It sorts based on First character like
seq_110001_111000.fas
seq_11001_12000.fas.
seq_111001_112000.fas
and so on.
So its difficult to sort based on the file no.

ADD REPLY
1
Entering edit mode

man sort

   -V, --version-sort
          natural sort of (version) numbers within text
ADD REPLY
0
Entering edit mode

Thankyou .

ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

The main file is not a single line fasta file its a multiple sequence aligned file . i WANT TO CUT THEM BASED ON COLUMN i.e. horizontally 1000 chatacters each.

Example
ATGCATATGCCCAGTAGCGAATGATGATCA
ATGCATA_TCCCAGTAGTGAATGATAATCA
_CAT_TGCC_ATTAGAGAATGATGATCA_ -MainFile

INTO 3 FILES OF 10nt each

ATGCATATGC-FILE1
ATGCATA_TC
_CAT_TGCCA

CCAGTAGCGA-File2
CCAGTAGCAG
AGAGAATGAG

ATGATGATCA-File3
ATGATAATCA
GATGATCA_C

ADD REPLY
0
Entering edit mode

Try with the above solutions. If they do not work, come back, post what you tried, the error you are having and any other relevant details.

ADD REPLY
0
Entering edit mode

I misunderstood your query.. You want the sequences to be vertical cut. Above solutions may not work

ADD REPLY

Login before adding your answer.

Traffic: 1659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6