How to batch save short prespecified sequences from the human genome

0

Entering edit mode

5.2 years ago

jamesebarrett • 0

I would like to save sequences from, say, chr1:100-150, chr1:300-350 etc to a file. I have several hundred such genome regions that I want to do this for using GRCh37. I would be grateful if someone could point me to a quick way of doing this.

Edit: samtools faidx ref.fa chr1:100-150 works nicely. Many thanks to answers below.

sequence • 866 views

ADD COMMENT • link 5.2 years ago by jamesebarrett • 0

1

Entering edit mode

5.2 years ago

Alex Reynolds 36k

If you have a BED file of intervals (in.bed) and a directory of FASTA for your genome assembly of interest (e.g., hg19), you could use a script like bed2faidxsta.pl to write data to a FASTA file (out.fa):

$ ./bed2faidxsta.pl --fastaDir=/path/to/hg19/fasta < in.bed > out.fa

It relies on samtools and indexed FASTA files.

Here's the script:

	#!/usr/bin/env perl

	use strict;
	use warnings;
	use Getopt::Long;

	#
	# bed2faidxsta.pl
	# --
	# Reads BED data from standard input, writes FASTA to standard output.
	#
	# Dependent on samtools and indexed FASTA data files located in $fastaDir
	# variable. Set --fastaDir=dir to set custom directory containing a source
	# of per-build, bgzip-compressed FASTA and associated index (fa.gz.fai)
	# files, or leave unset to use data in current working directory. Use the
	# --fastaIsUncompressed option if the FASTA files are not compressed.
	#

	# test if samtools is available
	`samtools --version` \|\| die "Error: The samtools application is required to run this script. Try 'module add samtools' or install a local copy of samtools.\n";

	# default FASTA input is current working directory
	my $fastaDir = `pwd`; chomp $fastaDir;
	# default is to assume input coordinates use zero-based index scheme
	my $oneBased;
	# default is to leave IDs alone
	my $useIDPrefixAsStrand;
	# default is to assume FASTA files are bgzip-compressed
	my $fastaIsUncompressed;

	GetOptions ('fastaDir=s' => $fastaDir, 'oneBased' => $oneBased, 'useIDPrefixAsStrand' => $useIDPrefixAsStrand, 'fastaIsUncompressed' => $fastaIsUncompressed);

	if (! -d $fastaDir) { die "Error: FASTA directory does not exist\n"; }

	while (<STDIN>) {
	chomp;
	my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
	if (!defined($chr) \|\| !defined($start) \|\| !defined($stop)) { die "Error: No chromosome name, start or stop position defined\n"; }
	if (!defined($id)) { $id = "."; }
	if (!defined($score)) { $score = "."; }
	if (!defined($strand)) { $strand = "+"; } else { $strand = substr($strand, 0, 1); }
	# adjust coordinates to one-based index, if necessary
	my ($queryChr, $queryStart, $queryStop) = ($chr, $start, $stop);
	if (!$oneBased) {
	$queryStart++;
	}
	# adjust strand if required
	if ($useIDPrefixAsStrand) {
	$strand = substr($id, 0, 1);
	}
	# lookup
	my $queryFn = "$fastaDir/$chr.fa.gz";
	if ($fastaIsUncompressed) {
	$queryFn = "$fastaDir/$chr.fa";
	}
	my $queryKey = "$queryChr:$queryStart-$queryStop";
	my $queryResult = `samtools faidx $queryFn $queryKey`; chomp $queryResult;
	# linearize result
	my @lines = split("\n", $queryResult);
	my @seqs = @lines[1..(scalar @lines - 1)];
	my $seq = join("", @seqs);
	# handle reverse-stranded elements
	if ($strand eq "-") {
	$seq = rc_sequence($seq);
	}
	# print to standard output
	my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
	print STDOUT $header."\n".$seq."\n";
	}

	sub rc_sequence {
	my $seq = shift @_;
	my $reverse_complement = reverse($seq);
	$reverse_complement =~ tr/ACGTacgt/TGCAtgca/;
	return $reverse_complement;
	}

view raw bed2faidxsta.pl hosted with ❤ by GitHub

ADD COMMENT • link 5.2 years ago by Alex Reynolds 36k

0

Entering edit mode

Brilliant thank you. I am currently downloading chromFa.tar.gz from https://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. Could you please tell me how I can index these?

ADD REPLY • link 5.2 years ago by jamesebarrett • 0

1

Entering edit mode

5.2 years ago

Jean-Karim Heriche 27k

Use the Ensembl perl API with something like this:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;


my $input = $ARGV[0];

my ($chr,$coords) = split(/:/,$input);
my ($start,$end) = split(/-/,$coords);

my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org',
                             -port => 3337,
                             -user => 'anonymous',
                             -passwd => undef,
                             -db_version => '99');
my $slice_adaptor = $registry->get_adaptor( 'Homo sapiens', 'Core', 'Slice' );
# Obtain a slice covering the region
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $chr, $start, $end);
print ">$chr:$start-$end\n",$slice->seq(),"\n";

ADD COMMENT • link 5.2 years ago by Jean-Karim Heriche 27k

Login before adding your answer.