I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.
Otherwise extracted sequences look like this:
>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT
>chr19:13985689-13985825
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG
And my input file looks like this:
>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt
I was using this:
fastaFromBed -fi input -bed seq.bed -fo output
So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?
What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?
That is a terrible recommendation - but then NCBI is the King of Terrible Formats. I recommend NOT wrapping sequence data. 1) With the massive amount of sequences we have these days, the "human readable" argument don't hold up. 2) With wrapped sequences grep fails at the line ends. 3) You cant easily random access wrapped FASTA a files, 4) It's waste of precious newlines.
I disagree. I frequently "less" reads and genomic sequences/subsequences to get a feeling about lengths, the repetitiveness, lowercase/uppercase patterns, etc. Sequences should be human readable no matter how long it is. We should not trust our programs too much. Human eyes are frequently more effective in identifying problems. As to your other concerns: 2) with sequences in one line, grep will return the entire sequence, which is frequently pointless. 3) as long as sequence lines are of the same length, you can use bioperl::DB::fasta or faidx strategy for quick random access. There is only a little more work. 4) I am not sure why newlines are precious.
I totally agree with you and prefer my sequences store in linear mode, but it was just a xxx database thing.