[Resolved] Local alternative to galaxy "Extract Genomic DNA using coordinates" tool

0

Entering edit mode

9.5 years ago

giroudpaul ▴ 70

Hello,

For a simple script I am writing, I need to extract the genomic data using coordinates, but I would need to do it locally on my computer.

Is the galaxy tool downloadable ? Is there an alternative ? It seems that bedtools can do something like this, but then I need the fasta for mm9 ? Where can I get this ?

Thanks

galaxy • 3.1k views

ADD COMMENT • link 9.5 years ago by giroudpaul ▴ 70

0

Entering edit mode

Yes, getfasta of BEDtools can do it. mm9 FASTA sequence can be downloaded from UCSC.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Tej Sowpati ▴ 250

0

Entering edit mode

Is it in the mm9.2bit file ? How do I extract it ? It say to use their twoBitToFa tool, but I don't get how to install it

ADD REPLY • link 9.5 years ago by giroudpaul ▴ 70

0

Entering edit mode

No, you need the ChromFa.tar.gz file, which when uncompressed will give you one fasta file per chromosome. You can then create a master fasta file by concatenating all the files into one using 'cat' command.

ADD REPLY • link 9.5 years ago by Tej Sowpati ▴ 250

2

Entering edit mode

9.5 years ago

Alex Reynolds 36k

To get mm9 FASTA files via the command-line:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/mm9.2bit
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/twoBitToFa
$ chmod +x ./twoBitToFa
$ for i in `seq 1 19` X Y M; do echo "converting chr$i"; ./twoBitToFa -seq=chr$i mm9.2bit chr$i.fa; done

If you are using Linux, get the twoBitToFa Kent tool with the following URL:

$ wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa

Install samtools. On OS X, if you have Homebrew installed, you could use brew install samtools. On Ubuntu, you might run sudo apt-get install samtools. Or on a RedHat-like Linux, you might run sudo yum install samtools.

Index the FASTA files with samtools faidx:

$ for i in `seq 1 19` X Y M; do echo "indexing chr$i"; samtools faidx chr$i.fa; done

Then query coordinates with samtools faidx. Here is a convenience Perl script I wrote that wraps around samtools, which reads stranded or unstranded BED from standard input and writes FASTA to standard output:

	#!/usr/bin/env perl

	use strict;
	use warnings;
	use Getopt::Long;

	#
	# bed2faidxsta.pl
	# --
	# Reads BED data from standard input, writes FASTA to standard output.
	#
	# Dependent on samtools and indexed FASTA data files located in $fastaDir
	# variable. Set --fastaDir=dir to set custom directory containing a source
	# of per-build, bgzip-compressed FASTA and associated index (fa.gz.fai)
	# files, or leave unset to use data in current working directory. Use the
	# --fastaIsUncompressed option if the FASTA files are not compressed.
	#

	# test if samtools is available
	`samtools --version` \|\| die "Error: The samtools application is required to run this script. Try 'module add samtools' or install a local copy of samtools.\n";

	# default FASTA input is current working directory
	my $fastaDir = `pwd`; chomp $fastaDir;
	# default is to assume input coordinates use zero-based index scheme
	my $oneBased;
	# default is to leave IDs alone
	my $useIDPrefixAsStrand;
	# default is to assume FASTA files are bgzip-compressed
	my $fastaIsUncompressed;

	GetOptions ('fastaDir=s' => $fastaDir, 'oneBased' => $oneBased, 'useIDPrefixAsStrand' => $useIDPrefixAsStrand, 'fastaIsUncompressed' => $fastaIsUncompressed);

	if (! -d $fastaDir) { die "Error: FASTA directory does not exist\n"; }

	while (<STDIN>) {
	chomp;
	my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
	if (!defined($chr) \|\| !defined($start) \|\| !defined($stop)) { die "Error: No chromosome name, start or stop position defined\n"; }
	if (!defined($id)) { $id = "."; }
	if (!defined($score)) { $score = "."; }
	if (!defined($strand)) { $strand = "+"; } else { $strand = substr($strand, 0, 1); }
	# adjust coordinates to one-based index, if necessary
	my ($queryChr, $queryStart, $queryStop) = ($chr, $start, $stop);
	if (!$oneBased) {
	$queryStart++;
	}
	# adjust strand if required
	if ($useIDPrefixAsStrand) {
	$strand = substr($id, 0, 1);
	}
	# lookup
	my $queryFn = "$fastaDir/$chr.fa.gz";
	if ($fastaIsUncompressed) {
	$queryFn = "$fastaDir/$chr.fa";
	}
	my $queryKey = "$queryChr:$queryStart-$queryStop";
	my $queryResult = `samtools faidx $queryFn $queryKey`; chomp $queryResult;
	# linearize result
	my @lines = split("\n", $queryResult);
	my @seqs = @lines[1..(scalar @lines - 1)];
	my $seq = join("", @seqs);
	# handle reverse-stranded elements
	if ($strand eq "-") {
	$seq = rc_sequence($seq);
	}
	# print to standard output
	my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
	print STDOUT $header."\n".$seq."\n";
	}

	sub rc_sequence {
	my $seq = shift @_;
	my $reverse_complement = reverse($seq);
	$reverse_complement =~ tr/ACGTacgt/TGCAtgca/;
	return $reverse_complement;
	}

view raw bed2faidxsta.pl hosted with ❤ by GitHub

To use this script, e.g.:

$ ./bed2faidxsta.pl < foo.bed > foo.fa

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Alex Reynolds 36k

0

Entering edit mode

9.5 years ago

Ian 6.1k

The following link should also be helpful: Perl To Retrieve Sequences From Ucsc Genome Browser

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Ian 6.1k

0

Entering edit mode

9.5 years ago

Matt Shirley 10k

pyfaidx has a script for this that is easy to install and works well: https://github.com/mdshw5/pyfaidx#cli-script-faidx

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Matt Shirley 10k

0

Entering edit mode