A wide range of formats exist for representing the comparisons of different sequences to each other: blast tabular, blast xml, psl, pslx, SAM/BAM, BED
Most of these formats can be converted from one format to another. Sometimes the format is lossless allowing for the original data to be perfectly converted without the loss of information. Other times, the format conversion is lossy permitting the conversion of only part of the original data resulting in the loss of some information.
Here, I have compiled the tools or UNIX commands necessary for converting from one file format to another. As you can see, I am still needing to compete some of the gaps so please let me know of any other tools which are missing.
The command is shown in full below the table.
Conversion From Row/To Col | blast-xml | blast-tab | psl | pslx | SAM/BAM | BED |
blast-xml | N/A | blast2tsv.xsl | blastXmlToPsl | blastXmlToPsl -pslx | blast2bam | BLAST_to_BED |
blast-tab | perl blast2xml.pl | N/A | ?? | ?? | ?? | blast2bed |
psl | ?? | ?? | N/A | pslToPslx | psl2sam.pl | pslToBed psl2BED |
pslx | ?? | ?? | ?? | N/A | ?? | pslToBed |
SAM/BAM | ?? | ?? | sam2psl.py | sam2psl.py -s | samtools view | bedtools bamtobed |
BED | ?? | ?? | bed2psl | ?? | bedtools bedtobam | N/A |
Convert from SAM to psl using sam2psl.py
Available from: https://github.com/ndaniel/fusioncatcher/blob/master/bin/sam2psl.py
Example command:
python sam2psl.py -i test.sam -s -o test.psl
This is a lossless format conversion with the -s
option, however the sequence as a read is no longer supported in the psl format.
python sam2psl.py -i test.sam -o test_no_seq.psl
The help for sam2psl:
python sam2psl.py
Usage: sam2psl.py [options]
It takes as input a file in SAM format and it converts into a PSL format file.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILENAME, --input=INPUT_FILENAME
The input file in SAM format.
-4, --skip-conversion-cigar-1.3
By default if the CIGAR strings in the input SAM file
are in the format defined in SAM version 1.4 (i.e.
there are 'X' and '=') then the CIGAR string will be
first converted into CIGAR string, which is described
in SAM version 1.3, (i.e. there are no 'X' and '='
which are replaced with 'M') and afterwards into PSL
format. Default is 'False'.
-s, --read-seq It adds to the PSL output as column 22, the sequence
of the read. This is not anymore a valid PSL format.
-r REPLACE_READS_IDS, --replace-read-ids=REPLACE_READS_IDS
In the reads ids (also known as query name in PSL) the
string specified here will be replaced with '/' (which
is used in Solexa for /1 and /2).
-o OUTPUT_FILENAME, --output=OUTPUT_FILENAME
The output file in PSL format.
Convert from psl to SAM
Available from the samtools legacy scripts: https://github.com/lh3/samtools-legacy/blob/master/misc/psl2sam.pl
Example command:
psl2sam.pl test.psl
This ends up being a lossy conversion as the read sequence is not in the output.
Usage
psl2sam.pl
Usage: psl2sam.pl [-a 1] [-b 3] [-q 5] [-r 2] <in.psl>
The options are used to calculate a blast like scoring see post: How To Use Psl2Sam.Pl From Samtools?
Convert psl to pslx
Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
pslToPslx test_no_seq.psl test.fa ref.fa test.pslx
This is a lossless conversion. For usage:
pslToPslx - Convert from psl to pslx format, which includes sequences
usage:
pslToPslx [options] in.psl qSeqSpec tSeqSpec out.pslx
qSeqSpec and tSeqSpec can be nib directory, a 2bit file, or a FASTA file.
FASTA files should end in .fa, .fa.gz, .fa.Z, or .fa.bz2 and are read into
memory.
Options:
-masked - if specified, repeats are in lower case cases, otherwise entire
sequence is loader case.
Convert SAM to fasta
awk '$1~!/^@/ {print ">"$1"\n"$10}' test.sam > test.fa
Convert psl to BED
Option 1:
Using pslToBed from https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
This is a lossless conversion as the standard psl doesn't have the sequence and so the bed file doesn't either.
pslToBed test_no_seq.psl test.bed
Option 2 as suggested by Alex Reynolds:
Using psl2bed from http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html
This is also lossless when used with --keep-header
:
Example:
psl2bed < in.psl > out.bed
As a bonus, it uses sort-bed to make a sorted BED file, so that it is ready to use with bedops, bedmap, etc.
Convert pslx to BED
Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
This is a lossy conversion as the sequence is lost
pslToBed test.pslx test.bed
Convert BAM to BED using bedtools
http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
This is a lossy conversion as the sequence data is lost.
bedtools bamtobed -i test.bam > test_bamtobed.bed
Convert BED to BAM
Create the genome file for bed
samtools faidx ref.fa
awk -v OFS='\t' {'print $1,$2'} ref.fai > ref.txt
Using the genome file and BED file to produce the BAM file.
bedtools bedtobam -i test_bamtobed.bed -g ref_revcomp.txt > test_bedtobam.bam
The sequence is not present in the BED file so is absent from BAM as well. This is a lossy format conversion. Additionally, there are differences in the number of read compared to the original file.
Convert BED to psl
Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
This is a lossless conversion as neither the BED nor psl contain sequence information
bedToPsl Longest_revcomp.txt test.bed test_bedtopsl.psl
Usage:
bedToPsl - convert bed format files to psl format
usage:
bedToPsl chromSizes bedFile pslFile
Convert a BED file to a PSL file. This the result is an alignment.
It is intended to allow processing by tools that operate on PSL.
If the BED has at least 12 columns, then a PSL with blocks is created.
Otherwise single-exon PSLs are created.
Options:
-keepQuery - instead of creating a fake query, create PSL with identical query and
target specs. Useful if bed features are to be lifted with pslMap and one
wants to keep the source location in the lift result.
Preparing blast-xml format
makeblastdb -dbtype nucl -in Longest_revcomp.fa
blastn -query test.fa -db Longest_revcomp.fa -out test.blastxml -outfmt 5
Preparing blast-tab format
blastn -query test.fa -db Longest_revcomp.fa -out test.blasttab -outfmt 6
Blast-xml to psl
Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
This is a lossy conversion
blastXmlToPsl test.blastxml test_blastxmltopsl.psl
However, if you use the -pslx option, you can get lossless conversion
blastXmlToPsl -pslx test.blastxml test_blastxmltopsl.pslx
Converting from blast-xml to SAM/BAM
Using https://github.com/guyduche/Blast2Bam
This is a lossless conversion with sequence and read quality introduced.
blast2bam -o test_blastxmltosam.sam test.blastxml ref.fa reads_1.fq reads_2.fq
Usage
Blast2Bam. Last compilation: Jun 27 2017 at 15:21:50.
Usage: blast2bam [options] <Blast XML output> <reference sequence dictionary> <FastQ_1> [FastQ_2]
Options:
--output | -o FILE Output file (default: stdout)
--interleaved | -p Interleaved data
--readGroup | -R STR Read group header line '@RG\tID:foo'
--minAlignLength | -W INT Discard alignments shorter than [INT]
--shortCigar | -c Short version of the CIGAR string ('M' instead of '=' and 'X')
--posOnChr | -z Adjust the alignment position to the first position of the reference
--help | -h Get help (this screen)
Subsequently converted to BAM using samtools
samtools view -b test_blastxmltosam.sam > test_blastxmltosam.bam
Blast-xml to BED
Using https://github.com/mojaveazure/BLAST_to_BED
Command
BLAST_to_BED.py -x test.blastxml -o test_blastxmltobed.bed
This is a lossless conversion as the sequence information is lost.
Converting Blast tabular to blast-xml
Not completely possible due to missing information such as the alignment but see post Convert Blast Output Into Blast-Xml Or using the script from the blast2go google group: https://11302289526311073549.googlegroups.com/attach/ed2c446e1b1852a9/blast2xml.pl?part=0.1&view=1&vt=ANaJVrEJYYa7SZC-uvOtoKb6932qlMJWltc2p_5GrTK5Wi7jo-hw14zFroKEcLhdNcJUcQweoUJOuXk2H7wQB5q6mzDTTn211hC2OvwiWw0b5PZev-HQ7Qg
Command
perl blast2xml.pl -i test.blastxml -o test_blasttoblastxml
Usage
perl blast2xml.pl
- i|input : path of the input file (must be text blast file output)
- o|output : path of the output file (by default, the same as the input file)
- s|sequences : number of sequences by xml file (default inf)
- hit : number of hit to print for each sequences (default inf)
- hsp : number of hsp to print for each hit (default inf)
- help|h|? : print this help and exit
Convert blast-xml to blast tabular
Several approaches have been suggested here Tools Parsing Ncbi Blast -M 7 Xml Output Format? but the most straight forward I have found is using the style sheet blast2tsv.xml from here: https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2tsv.xsl. This is a lossless conversion but is not a standardly formated blast-tabular output as it contains the sequence and the aligned site information in the last two columns.
Command:
xsltproc --novalid blast2tsv.xsl test.blastxml
Blast tabular to BED
Using https://github.com/nterhoeven/blast2bed
blast2bed test.blasttab
The output will be in test.blasttab.bed, this is a lossless conversion neither blast-tabular nor BED have the sequence.
Usage
Usage: ./blast2bed <blastoutput.bls>
The blast file should be in blast outfmt 6 or 7.
See Readme.org for more details.
Converting SAM to BAM
Using samtools http://quinlanlab.org/tutorials/samtools/samtools.html This is a lossless format conversion.
Command
samtools view -bS test.sam > test.bam
Converting BAM to SAM using samtools
This is a lossless format conversion.
samtools view -h -o test.sam test.bam
PS: I dedicate this tutorial to Sej, a great bioinformatician and friend ;)
PPS: Due to the limited number of characters, I had to remove the usage information for each of the tools. A more complete version is cross-posted here.
There's also
psl2bed
, which is also lossless when used with--keep-header
: http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.htmlExample:
As a bonus, it uses
sort-bed
to make a sorted BED file, so that it is ready to use withbedops
,bedmap
, etc.