I've got VCF-like files with start, stop, REF and ALT columns. I need to check that the REF position from the variants are the same as the one in the genome, to check if they're from the same built. I also need the surrounding nucleotides of the given position. Also, some of the REF columns are empty and because of this, it is not an appropriate VCF file.
I've got a fasta file which has the genome for chromosome 1, and I was wondering if there's a library available to get a part of the genome in nucleotides, given a start- and stop position. For example, if you've got the genome AACCGGTT, that given a start position of 1 and a stop position of 4 it returns AACC. I could write such a parser myself, but I'd rather use a library which has the edge-cases covered.
I'd rather have something locally than use the API of NCBI, which also makes this possible.
Hi, You can use bedtools getfasta .
Best
samtools faidx
,pyfaidx
,bedtools getfasta
can all retrieve parts of fasta sequence given a start and stop. While not libraries they may be an option to consider.@Pierre has his
Javarkit
which may have something that will work (if you must use Java): http://lindenb.github.io/jvarkit/If it's anything like
BioPython
and you absolutely must use Java, there's no doubt something in BioJava which you could use.I know less than nothing about Java specifically though so can't offer any practical code for this.