Question

How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns

1

Entering edit mode

12.6 years ago

PoGibas 5.1k

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622   
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT
>chr19:13985689-13985825  
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

fasta bedtools • 8.5k views

ADD COMMENT • link updated 12.6 years ago by lh3 33k • written 12.6 years ago by PoGibas 5.1k

0

Entering edit mode

That is a terrible recommendation - but then NCBI is the King of Terrible Formats. I recommend NOT wrapping sequence data. 1) With the massive amount of sequences we have these days, the "human readable" argument don't hold up. 2) With wrapped sequences grep fails at the line ends. 3) You cant easily random access wrapped FASTA a files, 4) It's waste of precious newlines.

ADD REPLY • link 12.6 years ago by Martin A Hansen 3.0k

1

Entering edit mode

I disagree. I frequently "less" reads and genomic sequences/subsequences to get a feeling about lengths, the repetitiveness, lowercase/uppercase patterns, etc. Sequences should be human readable no matter how long it is. We should not trust our programs too much. Human eyes are frequently more effective in identifying problems. As to your other concerns: 2) with sequences in one line, grep will return the entire sequence, which is frequently pointless. 3) as long as sequence lines are of the same length, you can use bioperl::DB::fasta or faidx strategy for quick random access. There is only a little more work. 4) I am not sure why newlines are precious.

ADD REPLY • link 12.6 years ago by lh3 33k

0

Entering edit mode

I totally agree with you and prefer my sequences store in linear mode, but it was just a xxx database thing.

ADD REPLY • link 12.6 years ago by PoGibas 5.1k

Ram · Answer 1 · 2013-01-27

The NCBI recommendation is that sequence lines in FASTA be 80 characters or less, but it is not enforced. Some software outputs sequence all on one line. Most, if not all, software that reads FASTA is able to cope with this, without errors.

There are lots of ways that you could reformat. One is to use seqret from the EMBOSS package, to simply read in the FASTA formatted sequences and write them out again in tidy FASTA format:

seqret -sequence myfasta.fa -outseq mytidyfasta.fa

Ram · Answer 2 · 2013-01-27

4

Entering edit mode

12.6 years ago

lh3 33k

Another option is seqtk:

seqtk seq -l 80 in.fa > out.fa

It is faster than seqret. Seqtk also implements some bedtools functionality:

seqtk subseq -l 60 in.fa reg.bed > out.fa  # fastaFromBed with the support of fastq and customized line length
seqtk seq -M reg.bed in.fa > out.fa        # maskFastaFromBed

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 12.6 years ago by lh3 33k

Ram · Answer 3 · 2013-01-27

1

Entering edit mode

12.6 years ago

Pierre Lindenbaum 166k

You can pipe the ouput into 'fold -w 80'

$ echo -e ">chr19:13985513-13985622\nGGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT" | fold -w 60
>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAG
GCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 12.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Assuming that the header lines are < 60 characters, of course :)

ADD REPLY • link 12.6 years ago by Neilfws 49k