How to cut fasta sequences in three equivalement parts ?
2
0
Entering edit mode
8.1 years ago
kevinm ▴ 40

Hello,

I have a list of sequences.fasta and i would like to cut all the sequences in this list in three parts (5'-1/3-1/3-1/3-3'), i think the gt command (GenomeTools) could help me but i have no clue on how use it in this particular case... Does someone have an idea ?

Thanks by advance

Kevin

sequence rna-seq split genometools • 2.7k views
ADD COMMENT
0
Entering edit mode

Cut them into three new fasta sequences or else? Are all sequences the same length?

ADD REPLY
4
Entering edit mode
8.1 years ago

linearize ( http://stackoverflow.com/documentation/bioinformatics/4194/ ), get the length of the DNA sequence, print each part with awk. something like:

awk '{L=length($2);L3=int(L/3);printf("%s 1/3\n%s\n%s 2/3\n%s\n%s 3/3\n%s\n",$1,substr($2,1,L3),$1,substr($2,L3+1,L3),$1,substr($2,1+2*L3));}
ADD COMMENT
0
Entering edit mode

Hi Pierre ! I thank you so much ! your awk one-liner works perfectly ! I am not familiar with AWK language, i certainly will try to learn the basics of it, 'cause it do the job ! Thanks again !

ADD REPLY
0
Entering edit mode

Hi Pierre,

On a separate note, what is the best resource to learn writing one liners like this one?

ADD REPLY
1
Entering edit mode

Internet/search engines :-)

On a serious note look for learning resources (here is one). There is an O'Reilly book on awk as well.

ADD REPLY
3
Entering edit mode
8.1 years ago
ssv.bio ▴ 200

In R:

Example file with sequences:

> gene1 
ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC
> gene2 
CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG
> gene3 
ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA

1) Extract 1/3 of each sequence and store each 1/3 as individual sequence, in a single fasta file i.e new fasta file will have new 3 sequences, but with first 1/3 of each sequence . Source file is named genes.fa and new file is named subgenes.fa

$ library(Biostrings)
$ genes=readDNAStringSet("genes.fa", format = "fasta")
$ subfasta=subseq(genes,1,ceiling(width(genes)/3))
$ writeXStringSet(subfasta, "subgenes.fa")

2) Extract 1/3 of from each sequence, combine all first 1/3 sequences as a single sequence and store it as a single sequence in a fasta format . Source file is named genes.fa and new file is named "new_gene.fasta". New sequence will have an ID "new_gene".

$ library(Biostrings)
$ genes=readDNAStringSet("genes.fa", format = "fasta")
$ subfasta=subseq(genes,1,ceiling(width(genes)/3))
$ csubfasta=DNAStringSet(unlist(subfasta))
$ names(csubfasta)="new_gene"
$ writeXStringSet(DNAStringSet(csubfasta),"new_gene.fasta")
ADD COMMENT

Login before adding your answer.

Traffic: 1618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6