Hi, I am wondering if there were a BioPython or other tool to help me subset a multiple FASTA file?
Thank you!
Hi, I am wondering if there were a BioPython or other tool to help me subset a multiple FASTA file?
Thank you!
Here's a script that takes (potentially stranded) BED-formatted intervals from standard input. The output is FASTA for the input intervals, written to standard output.
This requires samtools
be findable from your shell. You also need to specify a folder containing whole-genome FASTA files and their associated index files, unless they are in your current working directory.
#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long;
#
# bed2faidxsta.pl
# --
# Reads BED data from standard input, writes FASTA to standard output.
#
# Dependent on samtools and indexed FASTA data files located in $fastaDir
# variable. Set --fastaDir=dir to set custom directory containing a source
# of per-build FASTA and associated index (.fai) files, or leave unset
# to use data in current working directory.
#
# test if samtools is available
`samtools --version` || die "Error: The samtools application is required to run this script.\n";
# default FASTA input directory is present working directory
my $fastaDir = `pwd`; chomp $fastaDir;
# default is to assume input coordinates use zero-based index scheme
my $oneBased;
# default is to leave IDs alone
my $useIDPrefixAsStrand;
GetOptions ('fastaDir=s' => \$fastaDir, 'oneBased' => \$oneBased, 'useIDPrefixAsStrand' => \$useIDPrefixAsStrand) or die("Error: Problem with command line arguments\n");
while (<STDIN>) {
chomp;
my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
if (!defined($chr) || !defined($start) || !defined($stop)) { die "Error: No chromosome name, start or stop position defined\n"; }
if (! -e "$fastaDir/$chr.fa.fai") { die "Error: Could not locate indexed FASTA file for chromosome $chr\n"; }
if (!defined($id)) { $id = "."; }
if (!defined($score)) { $score = "."; }
if (!defined($strand)) { $strand = "+"; } else { $strand = substr($strand, 0, 1); }
# adjust coordinates to one-based index, if necessary
my ($queryChr, $queryStart, $queryStop) = ($chr, $start, $stop);
if (!$oneBased) {
$queryStart++;
}
# adjust strand if required
if ($useIDPrefixAsStrand) {
$strand = substr($id, 0, 1);
}
# lookup
my $queryKey = "$queryChr:$queryStart-$queryStop";
my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
# linearize result
my @lines = split("\n", $queryResult);
my @seqs = @lines[1..(scalar @lines - 1)];
my $seq = join("", @seqs);
# handle reverse-stranded elements
if ($strand eq "-") {
$seq = rc_sequence($seq);
}
# print to standard output
my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
print STDOUT $header."\n".$seq."\n";
}
sub rc_sequence {
my $seq = shift @_;
my $reverse_complement = reverse($seq);
$reverse_complement =~ tr/ACGTacgt/TGCAtgca/;
return $reverse_complement;
}
Assuming you have your indexed FASTA in the current working directory, you might use it like so:
$ bed2faidxsta.pl < input.bed > output.fa
Or if you have indexed FASTA in another directory (/foo/bar/baz
):
$ bed2faidxsta.pl --fastaDir=/foo/bar/baz < input.bed > output.fa
If you have GFF, you can use BEDOPS gff2bed
to write GFF as BED to standard output, and pipe that to the standard input stream of bed2faidxsta.pl
:
$ gff2bed < input.gff | bed2faidxsta.pl --fastaDir=/foo/bar/baz > output.fa
The gffread utility, included with cufflinks, is a workhorse for gff,gff3,gtf files, and allows to extract from your fasta file the subsequences sequences indexed by your gff formatted intervals with a command like:
gffread your_intervals.gff -g your.fasta -w your_subset.fasta
The manual page details many options including
-g full path to a multi-fasta file with the genomic sequences for
all input mappings, OR a directory with single-fasta files (one
per genomic sequence, with file names matching sequence names)
-w write a fasta file with spliced exons for each GFF transcript
your_intervals do not have to be gene models to extract the underlying sequence. But, if the are, there are more nice options, such as:
-x write a fasta file with spliced CDS for each GFF transcript
-W for -w and -x options, also write for each fasta record the exon
coordinates projected onto the spliced sequence
-y write a protein fasta file with the translation of CDS for each
record
Highly recommended!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Is there a reason you're not using
bedtools getfasta
?If they have a multifasta for a region across samples, say chr1 for 10 individuals, bedtools getfasta won't work unless the bed is modified accordingly, right? That said, something tells me that getfasta is what they want.
Correct, only a custom program would handle that.
I assume the fasta file contains entire chromosomes/contigs? It would for sure be possible to write a Biopython script for this, but I have the feeling that a tool like that probably already exists.
Doh!
What criteria do you want to use for subsetting? Names, length, size of file?Based on the title my guess is he wants to use intervals in a BED/GFF file :)