Question

Get part of sequence from genome, given a start and stop position with Java.

0

Entering edit mode

5.6 years ago

ahclugtenberg • 0

I've got VCF-like files with start, stop, REF and ALT columns. I need to check that the REF position from the variants are the same as the one in the genome, to check if they're from the same built. I also need the surrounding nucleotides of the given position. Also, some of the REF columns are empty and because of this, it is not an appropriate VCF file.

I've got a fasta file which has the genome for chromosome 1, and I was wondering if there's a library available to get a part of the genome in nucleotides, given a start- and stop position. For example, if you've got the genome AACCGGTT, that given a start position of 1 and a stop position of 4 it returns AACC. I could write such a parser myself, but I'd rather use a library which has the edge-cases covered.

I'd rather have something locally than use the API of NCBI, which also makes this possible.

genome java vcf • 1.3k views

ADD COMMENT • link updated 5.2 years ago by Biostar 20 • written 5.6 years ago by ahclugtenberg • 0

0

Entering edit mode

Hi, You can use bedtools getfasta .

Best

ADD REPLY • link 5.6 years ago by Titus ▴ 910

0

Entering edit mode

samtools faidx, pyfaidx, bedtools getfasta can all retrieve parts of fasta sequence given a start and stop. While not libraries they may be an option to consider.

@Pierre has his Javarkit which may have something that will work (if you must use Java): http://lindenb.github.io/jvarkit/

ADD REPLY • link 5.6 years ago by GenoMax 147k

0

Entering edit mode

If it's anything like BioPython and you absolutely must use Java, there's no doubt something in BioJava which you could use.

I know less than nothing about Java specifically though so can't offer any practical code for this.

ADD REPLY • link 5.6 years ago by Joe 21k

score 1 · Answer 1 · 2019-05-07

1

Entering edit mode

5.6 years ago

Pierre Lindenbaum 164k

use the htsjdk library and the class IndexedFastaSequenceFile https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/reference/IndexedFastaSequenceFile.html

(...)
faidx =new IndexedFastaSequenceFile(fastaFile);
sub = faidx.getSubsequenceAt("chr1",10,20).getBaseString();
(...)

ADD COMMENT • link 5.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Yes, thank you! I was just looking at this library, but couldn't find the right function.

ADD REPLY • link 5.6 years ago by ahclugtenberg • 0

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY • link 5.6 years ago by Pierre Lindenbaum 164k