Question

How to extract ~2M short sequences based on coordinates from a 3G fasta file?

0

Entering edit mode

4.7 years ago

kynnjo ▴ 70

(Sorry for the Bioinformatics 101 question!)

I have a file with ~2 million sets of coordinates (chromosome<tab>begin-position<tab>end-position), corresponding to short (~50nt) human genomic sequences (hg19). I want to extract the actual sequences from a human genome assembly 19 fasta file (~3.0G).

I imagine this is a relatively common task, and therefore, that there must be standard tools to carry it out efficiently.

Sadly, my Google fu has not been up to the task of finding them.

I would appreciate not only the name of a tool to use, but also the command line one would use, especially if there are important flags and options I should be aware of when using such a tool.

genome assembly sequence • 879 views

ADD COMMENT • link 4.7 years ago by kynnjo ▴ 70

score 3 · Accepted Answer · 2020-07-11

3

Entering edit mode

4.7 years ago

kynnjo ▴ 70

Shortly after I posted this question I found that samtools faidx -r <regions_file> ... does what I need.

ADD COMMENT • link 4.7 years ago by kynnjo ▴ 70