Question

twobittofa -seqlist formatting problem

0

Entering edit mode

3.1 years ago

Laura ▴ 50

Hi everyone,

I'm working on extracting sequences from 2 bit files using twobittofa. My sequences are in a bed file, so I'm using:

twoBitToFa -bed=L1PA4.bed -udcDir=. https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit stdout > L1PA4.fa

However, when I do the extraction, I lose a lot of sequences! The resulting fasta file has 399 sequences, whereas the bed file has 1,340 sequences. I have read in documentation that using the -bed= option will "grab sequences specified by input.bed and will exclude introns."

I am interested in binding motifs, so it seems like excluding introns is a bad idea. I've tried to use -seqList instead of -bed, but can't figure out the formatting. Looking online, it says:

" -seqList=file File containing list of the desired sequence names in the format seqSpec[:start-end], e.g. chr1 or chr1:0-189 where coordinates are half-open zero-based, i.e. [start,end)."

So I've formatted my sequences like so:

head L1PA4_test.csv 
chr1:7471818-7477982
chr1:30966237-30972403
chr1:34137598-34143718
chr1:41641937-41648034
etc

but am getting errors I don't get using the same data in -bed format. Error:

twoBitReadSeqFrag in chr4 end (190891002) >= seqSize (190214555)

I am wondering if anyone knows:

1) why twobittofa excludes introns using -bed option 2) if the exclusion of introns is what is reducing the number of sequences retrieved in my output fasta 3) if I can use the -bed option but opt to not remove any sequences 4) why seq-list doesn't work 5) any workarounds that might be useful for me in this situation

As always, thank you so much for your consideration and help!

fasta bed 2bit twobittofa • 573 views

ADD COMMENT • link 3.1 years ago by Laura ▴ 50