twoBitToFa retrieving longer sequence than expected in hg19
1
0
Entering edit mode
4.8 years ago
biosol ▴ 170

Hi all!

I'm using twoBitToFa to retrieve a specific region of a chromosome in hg19. The command I'm using is:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit myfile.fa -seq=chr1 -start=891021  -end=125030866

I would expect there to be 125030866 - 891021 = 124139845 nucleotides. However, when I count the number of nucleotides with a bash command there seem to be 60 more positions.

So, my question is: how can I find those extra positions and delete them? Or are they supposed to be there and I'm doing something wrong/ not having something into account??

Thanks in advance!

twoBitToFa hg19 • 1.2k views
ADD COMMENT
1
Entering edit mode

Are there any N's?

ADD REPLY
0
Entering edit mode

Thanks to both of you, you were both right!! It turns out that my bash command was counting both the N's and the header line of the sequence!!

ADD REPLY
1
Entering edit mode
4.8 years ago
wm ▴ 570

You can post your bash code here. the error could be from your script.

The output of twoBitToFa is Correct for my test.

$ twoBitToFa  hg19.2bit out.fa -seq=chr1 -start=891021  -end=125030866
$ wc -l out.fa
2482798 out.fa
$ head -n 2 out.fa 
>chr1:891021-125030866
GAAAATCTGAGAGCAAAGGCAGCTGCACACGCCACACCTTCTGAGACTCT

Use bash command to count the base (expect: 124139845)

$ sed '1d' out.fa | awk '{s+=length($1)}END{print s}'
124139845

Use seqkit command to stat

$ seqkit stat 
file    format  type  num_seqs      sum_len      min_len      avg_len      max_len
out.fa  FASTA   DNA          1  124,139,845  124,139,845  124,139,845  124,139,845
ADD COMMENT

Login before adding your answer.

Traffic: 2365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6