Triplet frequencies in human genome
1
0
Entering edit mode
5.9 years ago
9606 ▴ 330

Hello,

does anybody know if it exists a list of nucleotide triplets associated with their frequency in the human genome (hg19 or grch38 are both ok) ?

Of course I can count it by myself, but I just wish to save some time.

genome counts • 1.8k views
ADD COMMENT
0
Entering edit mode

not in my knowledge. However if you need some idea to compute it : https://unix.stackexchange.com/questions/231213/count-number-of-a-substring-repetition-in-a-string

ADD REPLY
1
Entering edit mode

Jellyfish will do that efficiently.

ADD REPLY
4
Entering edit mode
5.9 years ago

Based on ensembl's hg38:

$ parallel -j8 "samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa {2} \
| seqkit seq -w0 \
| tail -n+2 \
| LC_ALL=C grep -io  {1} \
| wc -l \
| awk -v kmer={1} '{print kmer,\$0}'" ::: `echo {C,A,G,T}{C,A,G,T}{C,A,G,T}|tr " " "\n"` ::: {1..22} X Y \
| awk -v OFS="\t" 'BEGIN {print "kmer", "count"} {kmer[$1] += $2} END {for (k in kmer) {print k,kmer[k]}}' \
| sort -k1 > kmer_counts.tsv

kmer count
AAA 72422544
AAC 43498931
AAG 58395299
AAT 72538809
ACA 54592537
ACC 33744272
ACG 7570315
ACT 47439444
AGA 60011989
AGC 41032778
AGG 51662892
AGT 47366030
ATA 54637885
ATC 39037377
ATG 53500562
ATT 73243898
CAA 55177058
CAC 40478897
CAG 59726969
CAT 53819968
CCA 53234309
CCC 29406688
CCG 8009167
CCT 51819106
CGA 6500298
CGC 6757743
CGG 8211634
CGT 7625486
CTA 37635012
CTC 45987888
CTG 58974725
CTT 59291925
GAA 58945303
GAC 27703241
GAG 46077091
GAT 39520000
GCA 42437940
GCC 34453205
GCG 6814422
GCT 40634247
GGA 45967975
GGC 34430965
GGG 29496000
GGT 33746826
GTA 33239299
GTC 27434614
GTG 40909908
GTT 43157536
TAA 60310660
TAC 32854132
TAG 37928829
TAT 54564113
TCA 57749469
TCC 44865092
TCG 6701011
TCT 59950096
TGA 57710783
TGC 42118599
TGG 54271241
TGT 54768375
TTA 60122345
TTC 58852563
TTG 56718442
TTT 73414613

Please notice that grep matches are not overlapping. This means in case of homopolymer stretches like TTTTTT, this will be count as 2 and not 4.

ADD COMMENT
0
Entering edit mode

seqkit is not part of standard unix install and will have to be downloaded separately.

ADD REPLY
0
Entering edit mode

samtools neither ;)

ADD REPLY

Login before adding your answer.

Traffic: 2821 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6