split sequences into their codons in a table
3
0
Entering edit mode
10 months ago
Assa Yeroslaviz ★ 1.9k

I have a file with multiple DNA sequences

>seq1
TGACCGATAATAGCAACAATCAACTGGTAGTAA
AATGAGGACGAGCTTTCC...
>seq2
GGAGGCTGGTGGGAGGGCACACTCAACG
CGCGAGGTCAAGGC...
>seq3
CGGCCATTGCAGACCAGTGAGAAGTTAAGTTC
CTAGAAGAAATATGTTCT...

I would like to know if there are any computational method to split each of the sequences into their codon triplets (e.g. seq1 - TGA, CCG, ATA, ATA, ...) and summarise it in a table (codons as columns and names as rows).

Looking into BioStrings or seqinr was really helpful. But, maybe I missed something.

thanks

R fasta codon-table • 772 views
ADD COMMENT
0
Entering edit mode

Its easy enough to do, but are those sequences already in the reading frame of interest, or are you trying to find coding sequences within them?

Seems a bit like an XY problem.

ADD REPLY
1
Entering edit mode
10 months ago
Assa Yeroslaviz ★ 1.9k

Thnaks for the answers, but I wanted something in R for my downstream analysis. I have made this snippet (which can be done slicker and nicer probably, but it works).

dna <- readDNAStringSet(filepath = "Homo_sapiens.GRCh38.cds.all.fa")

dna.abs.list <- dna.ratio.list <- list()

for (i in 1:length(dna)) {
  name <- gsub(pattern = "\\s.*" , replacement = "", x = names(dna[i]))
  df.tmp <- as.data.frame(trinucleotideFrequency(x = dna[i], step = 3))
  df.tmp$length <- dna[[i]]@length
  rownames(df.tmp) <- name
  dna.abs.list[[i]] <- df.tmp
}
rowname AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT length
ENST00000415118.1   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   8
ENST00000448914.1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   13

thanks for help

Assa

ADD COMMENT
1
Entering edit mode
10 months ago
barslmn ★ 2.3k

Would you want: TGA CCG ATA or GAC CGA TAA or ACC GAT AAT?

I think you can just do a little script to achieve your goal.

Maybe something like this:

printf ">seq1
TGACCGATAATAGCAACAATCAACTGGTAGTAA
AATGAGGACGAGCTTTCC...
>seq2
GGAGGCTGGTGGGAGGGCACACTCAACG
CGCGAGGTCAAGGC...
>seq3
CGGCCATTGCAGACCAGTGAGAAGTTAAGTTC
CTAGAAGAAATATGTTCT..." | awk '/^>/{f="seq_"++d} {print > f}'


for seq in $(ls -1 seq*); do
 cat $seq |
 grep -v ">" |
 fold -w 3 |
 sort |
 uniq -c |
 sort -r |
 sed "s/^/$seq/"; 
done |
 awk '{print $1"\t"$3"\t"$2}' |
 datamash -s crosstab 1,2 sum 3 |
column


    ..  ... AAC AAG AAT ACA ACC AGT ATA ATC CAG CCA CCG CGC CGG CTA CTC CTT G   GAA GAC GAG GC. GCA GGA GGC GTC TAA TAG TC  TCC TCT TGA TGG TGT TTA TTG
seq_1   N/A 1   1   N/A 1   1   N/A N/A 2   1   N/A N/A 1   N/A N/A N/A N/A 1   N/A N/A 1   2   N/A 1   N/A N/A N/A 1   1   N/A 1   N/A 1   1   N/A N/A N/A
seq_2   1   N/A 1   1   N/A 1   N/A N/A N/A N/A N/A N/A N/A 1   N/A N/A 1   N/A 1   N/A N/A 2   1   N/A 1   2   1   N/A N/A N/A N/A N/A N/A 2   N/A N/A N/A
seq_3   N/A 1   N/A 1   N/A N/A 1   2   1   N/A 1   1   N/A N/A 1   1   N/A N/A N/A 2   N/A 1   N/A N/A N/A N/A N/A N/A N/A 1   N/A 1   N/A N/A 1   1   1
ADD COMMENT
1
Entering edit mode
10 months ago

using linearize fasta, grep -o ... and datamash:

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' in.fa | while IFS=$'\t' read N S; do echo "${S}" | grep -o ... | awk -F '\t' -vN="${N}" '{printf("%s\t%s\n",N,$1);}' ; done  | datamash crosstab 1,2 
    AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
>1 length=282707 depth=1.00x    2   1   N/A 1   1   1   1   N/A 1   1   N/A N/A N/A N/A 1   N/A 1   1   1   1   N/A 1   1   1   N/A 1   1   N/A N/A 1   N/A 1   1   N/A N/A 1   1   N/A 1   2   1   1   N/A N/A 11  1   1   1   1   1   1   1   N/A 1   N/A N/A 1   N/A N/A 1   1
>2 length=26815 depth=3.45x 1   N/A 1   N/A 1   1   N/A 1   1   N/A 1   1   N/A 1   N/A N/A 1   1   2   1   1   1   1   1   1   1   N/A N/A N/A 1   N/A 1   1   N/A 1   1   1   N/A N/A 1   1   1   N/A N/A 1N/A    1   1   N/A 1   1   N/A 1   1   1   N/A 1   N/A N/A N/A N/A 1
>3 length=24243 depth=1.02x 1   1   1   1   1   1   1   N/A N/A 1   1   N/A 11  1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   11  1   1   1   1   1   1   1   1   1   1   1   1   1   1   N/A 11  1   1   1   1   1   N/A N/A 1   1   1   1   1   N/A 2
ADD COMMENT

Login before adding your answer.

Traffic: 2437 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6