split sequences into their codons in a table
3
I have a file with multiple DNA sequences
>seq1
TGACCGATAATAGCAACAATCAACTGGTAGTAA
AATGAGGACGAGCTTTCC...
>seq2
GGAGGCTGGTGGGAGGGCACACTCAACG
CGCGAGGTCAAGGC...
>seq3
CGGCCATTGCAGACCAGTGAGAAGTTAAGTTC
CTAGAAGAAATATGTTCT...
I would like to know if there are any computational method to split each of the sequences into their codon triplets (e.g. seq1 - TGA
, CCG
, ATA
, ATA
, ...) and summarise it in a table (codons as columns and names as rows).
Looking into BioStrings
or seqinr
was really helpful. But, maybe I missed something.
thanks
R
fasta
codon-table
• 810 views
Thnaks for the answers, but I wanted something in R for my downstream analysis. I have made this snippet (which can be done slicker and nicer probably, but it works).
dna <- readDNAStringSet(filepath = "Homo_sapiens.GRCh38.cds.all.fa")
dna.abs.list <- dna.ratio.list <- list()
for (i in 1:length(dna)) {
name <- gsub(pattern = "\\s.*" , replacement = "", x = names(dna[i]))
df.tmp <- as.data.frame(trinucleotideFrequency(x = dna[i], step = 3))
df.tmp$length <- dna[[i]]@length
rownames(df.tmp) <- name
dna.abs.list[[i]] <- df.tmp
}
rowname AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT length
ENST00000415118.1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
ENST00000448914.1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13
thanks for help
Assa
Would you want: TGA
CCG
ATA
or GAC
CGA
TAA
or ACC
GAT
AAT
?
I think you can just do a little script to achieve your goal.
Maybe something like this:
printf ">seq1
TGACCGATAATAGCAACAATCAACTGGTAGTAA
AATGAGGACGAGCTTTCC...
>seq2
GGAGGCTGGTGGGAGGGCACACTCAACG
CGCGAGGTCAAGGC...
>seq3
CGGCCATTGCAGACCAGTGAGAAGTTAAGTTC
CTAGAAGAAATATGTTCT..." | awk '/^>/{f="seq_"++d} {print > f}'
for seq in $(ls -1 seq*); do
cat $seq |
grep -v ">" |
fold -w 3 |
sort |
uniq -c |
sort -r |
sed "s/^/$seq/";
done |
awk '{print $1"\t"$3"\t"$2}' |
datamash -s crosstab 1,2 sum 3 |
column
.. ... AAC AAG AAT ACA ACC AGT ATA ATC CAG CCA CCG CGC CGG CTA CTC CTT G GAA GAC GAG GC. GCA GGA GGC GTC TAA TAG TC TCC TCT TGA TGG TGT TTA TTG
seq_1 N/A 1 1 N/A 1 1 N/A N/A 2 1 N/A N/A 1 N/A N/A N/A N/A 1 N/A N/A 1 2 N/A 1 N/A N/A N/A 1 1 N/A 1 N/A 1 1 N/A N/A N/A
seq_2 1 N/A 1 1 N/A 1 N/A N/A N/A N/A N/A N/A N/A 1 N/A N/A 1 N/A 1 N/A N/A 2 1 N/A 1 2 1 N/A N/A N/A N/A N/A N/A 2 N/A N/A N/A
seq_3 N/A 1 N/A 1 N/A N/A 1 2 1 N/A 1 1 N/A N/A 1 1 N/A N/A N/A 2 N/A 1 N/A N/A N/A N/A N/A N/A N/A 1 N/A 1 N/A N/A 1 1 1
using linearize fasta, grep -o ...
and datamash:
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' in.fa | while IFS=$'\t' read N S; do echo "${S}" | grep -o ... | awk -F '\t' -vN="${N}" '{printf("%s\t%s\n",N,$1);}' ; done | datamash crosstab 1,2
AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
>1 length=282707 depth=1.00x 2 1 N/A 1 1 1 1 N/A 1 1 N/A N/A N/A N/A 1 N/A 1 1 1 1 N/A 1 1 1 N/A 1 1 N/A N/A 1 N/A 1 1 N/A N/A 1 1 N/A 1 2 1 1 N/A N/A 11 1 1 1 1 1 1 1 N/A 1 N/A N/A 1 N/A N/A 1 1
>2 length=26815 depth=3.45x 1 N/A 1 N/A 1 1 N/A 1 1 N/A 1 1 N/A 1 N/A N/A 1 1 2 1 1 1 1 1 1 1 N/A N/A N/A 1 N/A 1 1 N/A 1 1 1 N/A N/A 1 1 1 N/A N/A 1N/A 1 1 N/A 1 1 N/A 1 1 1 N/A 1 N/A N/A N/A N/A 1
>3 length=24243 depth=1.02x 1 1 1 1 1 1 1 N/A N/A 1 1 N/A 11 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 N/A 11 1 1 1 1 1 N/A N/A 1 1 1 1 1 N/A 2
Login before adding your answer.
Traffic: 2057 users visited in the last hour
Its easy enough to do, but are those sequences already in the reading frame of interest, or are you trying to find coding sequences within them?
Seems a bit like an XY problem.