receive GC-content profile for a given window-size and splitting multi fasta file into separate fasta files by contigs name for running control-free
2
0
Entering edit mode
7.1 years ago
reza ▴ 300

i have two question about control-FREE

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

thanks in advance

splitting fasta • 3.2k views
ADD COMMENT
0
Entering edit mode

the following format is for GC-content profile for a given window-size (50000) that freec-control need it. first column:chromosome name; column two: window-size, column tree: GC-content, column four: percentage of ACGT-letter per window (1-poly(N)%). the file containing the following information is in .cnp file.

1   0       0.45896 1

1   50000   0.38424 1

1   100000  0.43834 1

1   150000  0.45            0.3456
ADD REPLY
1
Entering edit mode
7.1 years ago
GenoMax 148k

For #1: geecee from EMBOSS.

For #2:

how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

Jim Kent's (UCSC) faSplit utility is the fastest solution. You will need to do chmod a+x faSplit to add execute permissions after you download the file.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
ADD COMMENT
0
Entering edit mode
7.1 years ago

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

I would do a cut -c xx-yy where xx and yy are the two margins (included) of the region you're interested. Once you have that, you can easily count the number of G+C and divide it by the total number of chars.

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

I personally would make a python script. Something like:

#!/usr/bin/env python

import sys 

for line in sys.stdin:
  if line[0:1] == ">":
    if OUT:
      OUT.close()
    name = # process your line accordingly to get the sequence / scaffold name
    OUT = open("{0}.fa".format(name), "w")
    OUT.write(line)
  else:
    OUT.write(line)
else:
  OUT.close()

Then you:

  • chmod 755 pythonscript.py
  • cat filename.fa | pythonscript.py

Be careful: making that many files will clog your machine.

ADD COMMENT
1
Entering edit mode

Or something like this (untested pre-coffee code):

from Bio import SeqIO

for rec in SeqIO.parse("myfasta.fa", "fasta"):
    with openrec.id + ".fasta") as f:
        f.write(rec.format("fasta")
ADD REPLY
0
Entering edit mode

it does not work for me.

cat DROM_5000.fa | ./pythonscript.py

File "./pythonscript.py", line 9 name = # process your line accordingly to get the sequence / scaffold name ^ SyntaxError: invalid syntax

ADD REPLY
1
Entering edit mode

That's because you have to fill in that line.

ADD REPLY
1
Entering edit mode

Simply copy-pasting code is never a good idea. I put a commented line on purpose because I don't know how your lines look like. But I guess a combination of .rstrip(), .split(), .replace() will give you a well formatted line. Those built-in functions are properly documented in the python documentation (https://docs.python.org/2/).

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

@reza you should install SeqIO and do as @Wouter suggested, that would be the easiest way. if not installed, use my code. :D

ADD REPLY
0
Entering edit mode

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

True, but they should install it anyway ;-D

ADD REPLY

Login before adding your answer.

Traffic: 2451 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6