twobittofa -seqlist formatting problem
0
0
Entering edit mode
2.7 years ago
Laura ▴ 50

Hi everyone,

I'm working on extracting sequences from 2 bit files using twobittofa. My sequences are in a bed file, so I'm using:

twoBitToFa -bed=L1PA4.bed -udcDir=. https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit stdout > L1PA4.fa

However, when I do the extraction, I lose a lot of sequences! The resulting fasta file has 399 sequences, whereas the bed file has 1,340 sequences. I have read in documentation that using the -bed= option will "grab sequences specified by input.bed and will exclude introns."

I am interested in binding motifs, so it seems like excluding introns is a bad idea. I've tried to use -seqList instead of -bed, but can't figure out the formatting. Looking online, it says:

" -seqList=file File containing list of the desired sequence names in the format seqSpec[:start-end], e.g. chr1 or chr1:0-189 where coordinates are half-open zero-based, i.e. [start,end)."

So I've formatted my sequences like so:

head L1PA4_test.csv 
chr1:7471818-7477982
chr1:30966237-30972403
chr1:34137598-34143718
chr1:41641937-41648034
etc

but am getting errors I don't get using the same data in -bed format. Error:

twoBitReadSeqFrag in chr4 end (190891002) >= seqSize (190214555)

I am wondering if anyone knows:

1) why twobittofa excludes introns using -bed option 2) if the exclusion of introns is what is reducing the number of sequences retrieved in my output fasta 3) if I can use the -bed option but opt to not remove any sequences 4) why seq-list doesn't work 5) any workarounds that might be useful for me in this situation

As always, thank you so much for your consideration and help!

fasta bed 2bit twobittofa • 521 views
ADD COMMENT

Login before adding your answer.

Traffic: 2531 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6