How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns
3
1
Entering edit mode
11.9 years ago
PoGibas 5.1k

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622   
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT
>chr19:13985689-13985825  
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

fasta bedtools • 7.8k views
ADD COMMENT
0
Entering edit mode

That is a terrible recommendation - but then NCBI is the King of Terrible Formats. I recommend NOT wrapping sequence data. 1) With the massive amount of sequences we have these days, the "human readable" argument don't hold up. 2) With wrapped sequences grep fails at the line ends. 3) You cant easily random access wrapped FASTA a files, 4) It's waste of precious newlines.

ADD REPLY
1
Entering edit mode

I disagree. I frequently "less" reads and genomic sequences/subsequences to get a feeling about lengths, the repetitiveness, lowercase/uppercase patterns, etc. Sequences should be human readable no matter how long it is. We should not trust our programs too much. Human eyes are frequently more effective in identifying problems. As to your other concerns: 2) with sequences in one line, grep will return the entire sequence, which is frequently pointless. 3) as long as sequence lines are of the same length, you can use bioperl::DB::fasta or faidx strategy for quick random access. There is only a little more work. 4) I am not sure why newlines are precious.

ADD REPLY
0
Entering edit mode

I totally agree with you and prefer my sequences store in linear mode, but it was just a xxx database thing.

ADD REPLY
4
Entering edit mode
11.9 years ago
Neilfws 49k

The NCBI recommendation is that sequence lines in FASTA be 80 characters or less, but it is not enforced. Some software outputs sequence all on one line. Most, if not all, software that reads FASTA is able to cope with this, without errors.

There are lots of ways that you could reformat. One is to use seqret from the EMBOSS package, to simply read in the FASTA formatted sequences and write them out again in tidy FASTA format:

seqret -sequence myfasta.fa -outseq mytidyfasta.fa
ADD COMMENT
4
Entering edit mode
11.9 years ago
lh3 33k

Another option is seqtk:

seqtk seq -l 80 in.fa > out.fa

It is faster than seqret. Seqtk also implements some bedtools functionality:

seqtk subseq -l 60 in.fa reg.bed > out.fa  # fastaFromBed with the support of fastq and customized line length
seqtk seq -M reg.bed in.fa > out.fa        # maskFastaFromBed
ADD COMMENT
1
Entering edit mode
11.9 years ago

You can pipe the ouput into 'fold -w 80'

$ echo -e ">chr19:13985513-13985622\nGGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT" | fold -w 60
>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAG
GCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT
ADD COMMENT
0
Entering edit mode

Assuming that the header lines are < 60 characters, of course :)

ADD REPLY

Login before adding your answer.

Traffic: 2025 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6