[Resolved] Local alternative to galaxy "Extract Genomic DNA using coordinates" tool
4
0
Entering edit mode
9.5 years ago
giroudpaul ▴ 70

Hello,

For a simple script I am writing, I need to extract the genomic data using coordinates, but I would need to do it locally on my computer.

Is the galaxy tool downloadable ? Is there an alternative ? It seems that bedtools can do something like this, but then I need the fasta for mm9 ? Where can I get this ?

Thanks

galaxy • 3.1k views
ADD COMMENT
0
Entering edit mode

Yes, getfasta of BEDtools can do it. mm9 FASTA sequence can be downloaded from UCSC.

ADD REPLY
0
Entering edit mode

Is it in the mm9.2bit file ? How do I extract it ? It say to use their twoBitToFa tool, but I don't get how to install it

ADD REPLY
0
Entering edit mode

No, you need the ChromFa.tar.gz file, which when uncompressed will give you one fasta file per chromosome. You can then create a master fasta file by concatenating all the files into one using 'cat' command.

ADD REPLY
2
Entering edit mode
9.5 years ago

To get mm9 FASTA files via the command-line:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/mm9.2bit
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/twoBitToFa
$ chmod +x ./twoBitToFa
$ for i in `seq 1 19` X Y M; do echo "converting chr$i"; ./twoBitToFa -seq=chr$i mm9.2bit chr$i.fa; done

If you are using Linux, get the twoBitToFa Kent tool with the following URL:

$ wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa

Install samtools. On OS X, if you have Homebrew installed, you could use brew install samtools. On Ubuntu, you might run sudo apt-get install samtools. Or on a RedHat-like Linux, you might run sudo yum install samtools.

Index the FASTA files with samtools faidx:

$ for i in `seq 1 19` X Y M; do echo "indexing chr$i"; samtools faidx chr$i.fa; done

Then query coordinates with samtools faidx. Here is a convenience Perl script I wrote that wraps around samtools, which reads stranded or unstranded BED from standard input and writes FASTA to standard output:

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long;
#
# bed2faidxsta.pl
# --
# Reads BED data from standard input, writes FASTA to standard output.
#
# Dependent on samtools and indexed FASTA data files located in $fastaDir
# variable. Set --fastaDir=dir to set custom directory containing a source
# of per-build, bgzip-compressed FASTA and associated index (fa.gz.fai)
# files, or leave unset to use data in current working directory. Use the
# --fastaIsUncompressed option if the FASTA files are not compressed.
#
# test if samtools is available
`samtools --version` || die "Error: The samtools application is required to run this script. Try 'module add samtools' or install a local copy of samtools.\n";
# default FASTA input is current working directory
my $fastaDir = `pwd`; chomp $fastaDir;
# default is to assume input coordinates use zero-based index scheme
my $oneBased;
# default is to leave IDs alone
my $useIDPrefixAsStrand;
# default is to assume FASTA files are bgzip-compressed
my $fastaIsUncompressed;
GetOptions ('fastaDir=s' => $fastaDir, 'oneBased' => $oneBased, 'useIDPrefixAsStrand' => $useIDPrefixAsStrand, 'fastaIsUncompressed' => $fastaIsUncompressed);
if (! -d $fastaDir) { die "Error: FASTA directory does not exist\n"; }
while (<STDIN>) {
chomp;
my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
if (!defined($chr) || !defined($start) || !defined($stop)) { die "Error: No chromosome name, start or stop position defined\n"; }
if (!defined($id)) { $id = "."; }
if (!defined($score)) { $score = "."; }
if (!defined($strand)) { $strand = "+"; } else { $strand = substr($strand, 0, 1); }
# adjust coordinates to one-based index, if necessary
my ($queryChr, $queryStart, $queryStop) = ($chr, $start, $stop);
if (!$oneBased) {
$queryStart++;
}
# adjust strand if required
if ($useIDPrefixAsStrand) {
$strand = substr($id, 0, 1);
}
# lookup
my $queryFn = "$fastaDir/$chr.fa.gz";
if ($fastaIsUncompressed) {
$queryFn = "$fastaDir/$chr.fa";
}
my $queryKey = "$queryChr:$queryStart-$queryStop";
my $queryResult = `samtools faidx $queryFn $queryKey`; chomp $queryResult;
# linearize result
my @lines = split("\n", $queryResult);
my @seqs = @lines[1..(scalar @lines - 1)];
my $seq = join("", @seqs);
# handle reverse-stranded elements
if ($strand eq "-") {
$seq = rc_sequence($seq);
}
# print to standard output
my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
print STDOUT $header."\n".$seq."\n";
}
sub rc_sequence {
my $seq = shift @_;
my $reverse_complement = reverse($seq);
$reverse_complement =~ tr/ACGTacgt/TGCAtgca/;
return $reverse_complement;
}
view raw bed2faidxsta.pl hosted with ❤ by GitHub

To use this script, e.g.:

$ ./bed2faidxsta.pl < foo.bed > foo.fa
ADD COMMENT
0
Entering edit mode
9.5 years ago
Ian 6.1k

The following link should also be helpful: Perl To Retrieve Sequences From Ucsc Genome Browser

ADD COMMENT
0
Entering edit mode
9.5 years ago

pyfaidx has a script for this that is easy to install and works well: https://github.com/mdshw5/pyfaidx#cli-script-faidx

ADD COMMENT
0
Entering edit mode
9.5 years ago

samtools faidx can do it too.

ADD COMMENT

Login before adding your answer.

Traffic: 2548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6