Hi everyone,
I'm working on extracting sequences from 2 bit files using twobittofa. My sequences are in a bed file, so I'm using:
twoBitToFa -bed=L1PA4.bed -udcDir=. https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit stdout > L1PA4.fa
However, when I do the extraction, I lose a lot of sequences! The resulting fasta file has 399 sequences, whereas the bed file has 1,340 sequences. I have read in documentation that using the -bed= option will "grab sequences specified by input.bed and will exclude introns."
I am interested in binding motifs, so it seems like excluding introns is a bad idea. I've tried to use -seqList instead of -bed, but can't figure out the formatting. Looking online, it says:
" -seqList=file File containing list of the desired sequence names in the format seqSpec[:start-end], e.g. chr1 or chr1:0-189 where coordinates are half-open zero-based, i.e. [start,end)."
So I've formatted my sequences like so:
head L1PA4_test.csv
chr1:7471818-7477982
chr1:30966237-30972403
chr1:34137598-34143718
chr1:41641937-41648034
etc
but am getting errors I don't get using the same data in -bed format. Error:
twoBitReadSeqFrag in chr4 end (190891002) >= seqSize (190214555)
I am wondering if anyone knows:
1) why twobittofa excludes introns using -bed option 2) if the exclusion of introns is what is reducing the number of sequences retrieved in my output fasta 3) if I can use the -bed option but opt to not remove any sequences 4) why seq-list doesn't work 5) any workarounds that might be useful for me in this situation
As always, thank you so much for your consideration and help!