Convert .Txt File To Bed File
3
5
Entering edit mode
13.3 years ago
Bioscientist ★ 1.7k

I have txt file for genome gap assembly like below:

585    chr10    0    50000    1    N    50000    clone    no
78    chr10    5627110    5677110    51    N    50000    clone    yes
722    chr10    18014681    18064681    161    N    50000    clone    yes
881    chr10    38858841    38908841    337    N    50000    contig    no
884    chr10    39194941    39244941    340    N    50000    contig    no
13    chr10    39244941    41624941    341    N    2380000    centromere    no
902    chr10    41624941    41674941    342    N    50000    contig    no
904    chr10    41866693    41916693    344    N    50000    contig    no
116    chr10    45746970    45896970    375    N    150000    contig    no

Program said I should convert this to BED files. So just do cat XXX.txt > XXX.bed ? If so, why should we bother to use bed, why not just use txt? What's the point of BED file? thx

bed • 47k views
ADD COMMENT
1
Entering edit mode

BED is a simple text file. Tools such as BEDOPS will do all sorts of logic and other computations for you (what elements overlap between these N input files? What's the trimmed mean of all ChIP-seq scores falling in every 100 kb window across the genome? etc.). The actual BED format has a fairly strict definition, but various tool suites allow for a more relaxed set of constraints such that only the first 3 fields (chrom, start, end) need to be specified for many operations, while all other columns are essentially free to be whatever you need. This allows for interactions between a tool suite and standard unix commands to manipulate data on the fly without losing any information. In fact, this very simple relaxation of the BED format can encode the information kept in any of the other 20 or so formats you'll commonly encounter in 'bioinformatics' (VCF, GFF, GTF, SAM, WIG, BEDGRAPH, etc). That is, a small extension to the usual BED format can represent anything that any of these other formats offer with no loss of data (see the conversion scripts offered in BEDOPS). However, conversions in the other direction often do not exist in the general case. For example, SAM/BAM is unable to hold signal data. The better question, imo, is why do we have so many file formats and tool suites to operate on each kind of format, when these formats are hardly more than shuffled-column versions of each other?

ADD REPLY
10
Entering edit mode
13.3 years ago

FWIW, you could also just use cut:

cut -f 2,3,4,8 XXX.txt >XXX.bed

Same result.

If so, why should we bother to use bed, why not just use txt? What's the point of BED file

The program you're running on the bed file expects that certain values lie in certain columns. If you run in on the txt file, it will either crash or produce output that is not what you expect.

ADD COMMENT
1
Entering edit mode

One caveat with "cut" is that it prints columns in the same order it sees them. In this case that's okay, because they are in increasing order, but you could not do "cut -f 3,2,8,4" if your data were in a different order.

ADD REPLY
1
Entering edit mode

Sure, but in this example, the columns are already in the correct order, so why bother with the extra syntax of something like awk? Less characters typed = more actual work getting done.

ADD REPLY
7
Entering edit mode
13.3 years ago

See a specification for the BED format: http://genome.ucsc.edu/FAQ/FAQformat.html#format1

So you don't need the first column and you have to re-order the other ones. AWK could be a tool to re-arrange the columns:

awk -F '    ' {printf("%s\t%s\t%s\t%s\n",$2,$3,$4,$8) < XXX.txt > XXX.bed
ADD COMMENT
1
Entering edit mode

That doesn't make any sense. You're converting to BED format, which has fields for chr,st,sp,name,strand,score (and a few other optional fields if you use the BED12 format). The program you're trying to run obviously doesn't care about all that other data, since it can't be represented easily in BED format.

ADD REPLY
0
Entering edit mode

Why we need $2,$3,$4,$8? I know, $2,3,4 are three required field, and should be put at first three positions. Then why we need $8? Why not preserve column $5,6,7 and 9, since I actually don't know what additional information program may need besides 2,3,4... Thx

ADD REPLY
2
Entering edit mode
10.5 years ago

You can also use GenomeIntervals2BED.py script within the SeqGI framework!

Have a look: http://seqgi.sourceforge.net/Genomeintervals2bed.html

ADD COMMENT

Login before adding your answer.

Traffic: 1909 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6