twoBitToFa retrieving longer sequence than expected in hg19
1
Hi all!
I'm using twoBitToFa to retrieve a specific region of a chromosome in hg19. The command I'm using is:
twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit myfile.fa -seq=chr1 -start=891021 -end=125030866
I would expect there to be 125030866 - 891021 = 124139845 nucleotides. However, when I count the number of nucleotides with a bash command there seem to be 60 more positions.
So, my question is: how can I find those extra positions and delete them? Or are they supposed to be there and I'm doing something wrong/ not having something into account??
Thanks in advance!
twoBitToFa
hg19
• 1.2k views
•
link
updated 4.8 years ago by
wm
▴
570
•
written 4.8 years ago by
biosol
▴
170
You can post your bash code
here. the error could be from your script.
The output of twoBitToFa
is Correct for my test.
$ twoBitToFa hg19.2bit out.fa -seq=chr1 -start=891021 -end=125030866
$ wc -l out.fa
2482798 out.fa
$ head -n 2 out.fa
>chr1:891021-125030866
GAAAATCTGAGAGCAAAGGCAGCTGCACACGCCACACCTTCTGAGACTCT
Use bash
command to count the base (expect: 124139845 )
$ sed '1d' out.fa | awk '{s+=length($1)}END{print s}'
124139845
Use seqkit
command to stat
$ seqkit stat
file format type num_seqs sum_len min_len avg_len max_len
out.fa FASTA DNA 1 124,139,845 124,139,845 124,139,845 124,139,845
•
link
4.8 years ago by
wm
▴
570
Login before adding your answer.
Traffic: 2372 users visited in the last hour
Are there any N's?
Thanks to both of you, you were both right!! It turns out that my
bash command
was counting both theN's
and the header line of the sequence!!