How to batch save short prespecified sequences from the human genome
2
0
Entering edit mode
5.2 years ago

I would like to save sequences from, say, chr1:100-150, chr1:300-350 etc to a file. I have several hundred such genome regions that I want to do this for using GRCh37. I would be grateful if someone could point me to a quick way of doing this.

Edit: samtools faidx ref.fa chr1:100-150 works nicely. Many thanks to answers below.

sequence • 866 views
ADD COMMENT
1
Entering edit mode
5.2 years ago

If you have a BED file of intervals (in.bed) and a directory of FASTA for your genome assembly of interest (e.g., hg19), you could use a script like bed2faidxsta.pl to write data to a FASTA file (out.fa):

$ ./bed2faidxsta.pl --fastaDir=/path/to/hg19/fasta < in.bed > out.fa

It relies on samtools and indexed FASTA files.

Here's the script:

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long;
#
# bed2faidxsta.pl
# --
# Reads BED data from standard input, writes FASTA to standard output.
#
# Dependent on samtools and indexed FASTA data files located in $fastaDir
# variable. Set --fastaDir=dir to set custom directory containing a source
# of per-build, bgzip-compressed FASTA and associated index (fa.gz.fai)
# files, or leave unset to use data in current working directory. Use the
# --fastaIsUncompressed option if the FASTA files are not compressed.
#
# test if samtools is available
`samtools --version` || die "Error: The samtools application is required to run this script. Try 'module add samtools' or install a local copy of samtools.\n";
# default FASTA input is current working directory
my $fastaDir = `pwd`; chomp $fastaDir;
# default is to assume input coordinates use zero-based index scheme
my $oneBased;
# default is to leave IDs alone
my $useIDPrefixAsStrand;
# default is to assume FASTA files are bgzip-compressed
my $fastaIsUncompressed;
GetOptions ('fastaDir=s' => $fastaDir, 'oneBased' => $oneBased, 'useIDPrefixAsStrand' => $useIDPrefixAsStrand, 'fastaIsUncompressed' => $fastaIsUncompressed);
if (! -d $fastaDir) { die "Error: FASTA directory does not exist\n"; }
while (<STDIN>) {
chomp;
my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
if (!defined($chr) || !defined($start) || !defined($stop)) { die "Error: No chromosome name, start or stop position defined\n"; }
if (!defined($id)) { $id = "."; }
if (!defined($score)) { $score = "."; }
if (!defined($strand)) { $strand = "+"; } else { $strand = substr($strand, 0, 1); }
# adjust coordinates to one-based index, if necessary
my ($queryChr, $queryStart, $queryStop) = ($chr, $start, $stop);
if (!$oneBased) {
$queryStart++;
}
# adjust strand if required
if ($useIDPrefixAsStrand) {
$strand = substr($id, 0, 1);
}
# lookup
my $queryFn = "$fastaDir/$chr.fa.gz";
if ($fastaIsUncompressed) {
$queryFn = "$fastaDir/$chr.fa";
}
my $queryKey = "$queryChr:$queryStart-$queryStop";
my $queryResult = `samtools faidx $queryFn $queryKey`; chomp $queryResult;
# linearize result
my @lines = split("\n", $queryResult);
my @seqs = @lines[1..(scalar @lines - 1)];
my $seq = join("", @seqs);
# handle reverse-stranded elements
if ($strand eq "-") {
$seq = rc_sequence($seq);
}
# print to standard output
my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
print STDOUT $header."\n".$seq."\n";
}
sub rc_sequence {
my $seq = shift @_;
my $reverse_complement = reverse($seq);
$reverse_complement =~ tr/ACGTacgt/TGCAtgca/;
return $reverse_complement;
}
view raw bed2faidxsta.pl hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode

Brilliant thank you. I am currently downloading chromFa.tar.gz from https://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. Could you please tell me how I can index these?

ADD REPLY
1
Entering edit mode
5.2 years ago

Use the Ensembl perl API with something like this:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;


my $input = $ARGV[0];

my ($chr,$coords) = split(/:/,$input);
my ($start,$end) = split(/-/,$coords);

my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org',
                             -port => 3337,
                             -user => 'anonymous',
                             -passwd => undef,
                             -db_version => '99');
my $slice_adaptor = $registry->get_adaptor( 'Homo sapiens', 'Core', 'Slice' );
# Obtain a slice covering the region
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $chr, $start, $end);
print ">$chr:$start-$end\n",$slice->seq(),"\n";
ADD COMMENT

Login before adding your answer.

Traffic: 1399 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6