Question

receive GC-content profile for a given window-size and splitting multi fasta file into separate fasta files by contigs name for running control-free

0

Entering edit mode

7.5 years ago

reza ▴ 300

i have two question about control-FREE

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

thanks in advance

splitting fasta • 3.5k views

ADD COMMENT • link updated 7.5 years ago by GenoMax 151k • written 7.5 years ago by reza ▴ 300

0

Entering edit mode

the following format is for GC-content profile for a given window-size (50000) that freec-control need it. first column:chromosome name; column two: window-size, column tree: GC-content, column four: percentage of ACGT-letter per window (1-poly(N)%). the file containing the following information is in .cnp file.

1   0       0.45896 1

1   50000   0.38424 1

1   100000  0.43834 1

1   150000  0.45            0.3456

ADD REPLY • link updated 7.5 years ago by GenoMax 151k • written 7.5 years ago by reza ▴ 300

score 1 · Answer 1 · 2017-12-07

For #1: geecee from EMBOSS.

For #2:

how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

Jim Kent's (UCSC) faSplit utility is the fastest solution. You will need to do chmod a+x faSplit to add execute permissions after you download the file.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.

score 0 · Answer 2 · 2017-12-07

0

Entering edit mode

7.5 years ago

Matteo Schiavinato ★ 3.7k

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

I would do a cut -c xx-yy where xx and yy are the two margins (included) of the region you're interested. Once you have that, you can easily count the number of G+C and divide it by the total number of chars.

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

I personally would make a python script. Something like:

#!/usr/bin/env python

import sys 

for line in sys.stdin:
  if line[0:1] == ">":
    if OUT:
      OUT.close()
    name = # process your line accordingly to get the sequence / scaffold name
    OUT = open("{0}.fa".format(name), "w")
    OUT.write(line)
  else:
    OUT.write(line)
else:
  OUT.close()

Then you:

chmod 755 pythonscript.py
cat filename.fa | pythonscript.py

Be careful: making that many files will clog your machine.

ADD COMMENT • link 7.5 years ago by Matteo Schiavinato ★ 3.7k

1

Entering edit mode

Or something like this (untested pre-coffee code):

from Bio import SeqIO

for rec in SeqIO.parse("myfasta.fa", "fasta"):
    with openrec.id + ".fasta") as f:
        f.write(rec.format("fasta")

ADD REPLY • link 7.5 years ago by WouterDeCoster 48k

0

Entering edit mode

it does not work for me.

cat DROM_5000.fa | ./pythonscript.py

File "./pythonscript.py", line 9 name = # process your line accordingly to get the sequence / scaffold name ^ SyntaxError: invalid syntax

ADD REPLY • link 7.5 years ago by reza ▴ 300

1

Entering edit mode

That's because you have to fill in that line.

ADD REPLY • link 7.5 years ago by WouterDeCoster 48k

1

Entering edit mode

Simply copy-pasting code is never a good idea. I put a commented line on purpose because I don't know how your lines look like. But I guess a combination of .rstrip(), .split(), .replace() will give you a well formatted line. Those built-in functions are properly documented in the python documentation (https://docs.python.org/2/).

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

@reza you should install SeqIO and do as @Wouter suggested, that would be the easiest way. if not installed, use my code. :D

ADD REPLY • link 7.5 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

True, but they should install it anyway ;-D

ADD REPLY • link 7.5 years ago by WouterDeCoster 48k