Triplet frequencies in human genome
1
0
Entering edit mode
5.5 years ago
9606 ▴ 330

Hello,

does anybody know if it exists a list of nucleotide triplets associated with their frequency in the human genome (hg19 or grch38 are both ok) ?

Of course I can count it by myself, but I just wish to save some time.

genome counts • 1.7k views
ADD COMMENT
0
Entering edit mode

not in my knowledge. However if you need some idea to compute it : https://unix.stackexchange.com/questions/231213/count-number-of-a-substring-repetition-in-a-string

ADD REPLY
1
Entering edit mode

Jellyfish will do that efficiently.

ADD REPLY
4
Entering edit mode
5.5 years ago

Based on ensembl's hg38:

$ parallel -j8 "samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa {2} \
| seqkit seq -w0 \
| tail -n+2 \
| LC_ALL=C grep -io  {1} \
| wc -l \
| awk -v kmer={1} '{print kmer,\$0}'" ::: `echo {C,A,G,T}{C,A,G,T}{C,A,G,T}|tr " " "\n"` ::: {1..22} X Y \
| awk -v OFS="\t" 'BEGIN {print "kmer", "count"} {kmer[$1] += $2} END {for (k in kmer) {print k,kmer[k]}}' \
| sort -k1 > kmer_counts.tsv

Please notice that grep matches are not overlapping. This means in case of homopolymer stretches like TTTTTT, this will be count as 2 and not 4.

ADD COMMENT
0
Entering edit mode

seqkit is not part of standard unix install and will have to be downloaded separately.

ADD REPLY
0
Entering edit mode

samtools neither ;)

ADD REPLY

Login before adding your answer.

Traffic: 1629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6