How to retrieve/extract specific intergenic regions?

0

Entering edit mode

5.1 years ago

geizetomazetto • 0

Hi Folks,

A database was constructed for a target protein sequences, including protein_ID, start codon, stop codon, BioSample, and Assembly Accession. It means, the data has a lot of “XX” protein sequences from different bacterial genomes.

Now, I would like to use this information to retrieve/extract only the sequences after stop codon. So far, I only found to retrieve/extract all intergenic region. I need a specific region. I also tried to adapt the script from https://www.ncbi.nlm.nih.gov/books/NBK179288/ -- Given a shell script named "upstream.sh".

Without successful. Anyone could give me some tips?

python extract perl intergenic region • 1.1k views

ADD COMMENT • link 5.1 years ago by geizetomazetto • 0

0

Entering edit mode

For a bacterial genome with no introns, and if you have the gff / gtf genome annotation file, bedtools subtract or bedops --difference would do the trick.

ADD REPLY • link 5.1 years ago by h.mon 35k

0

Entering edit mode

Hi ..

I appreciate your hits. However, only for E. coli there are 8,000 sequences. To download gff and extract them, I do not know if it would not be much easy.

ADD REPLY • link 5.1 years ago by geizetomazetto • 0

0

Entering edit mode

Can you provide an example of your input table? Specifically, what are the start and stop coordinates you have? As long as you can get to a BED-like file with chromosome, start position, end position and strand information you can always download the entire genomes in FASTA format first and use bedtools getfasta to extract downstream sequences.

ADD REPLY • link 5.1 years ago by vkkodali_ncbi ★ 3.8k

Login before adding your answer.