Extract several sequences from genome in FASTA format with genomic coordinates.
3
4
Entering edit mode
9.8 years ago
Ander ▴ 50

Hi pals, I got some problems here.

I have an excel file with several genomic coordinates in the format "strand:start:end", and I also have access to my genome of interest in different formats (fna, gbk...). Now, what I want to do is to extract and obtain a file with all the sequences corresponding to my coordinates, I don't mind if the sequences have a header or a custom ID, I just want the sequences in the same order as I have them in my excel file. Is there any tool that can do this? It would be much appreciated since I have to extract more than 2000 sequences, and doing this manually is humanly impossible.

The next step is to perform a BLAST with another genome, also accessible in several formats. I'm using the blastn NCBI program and this script -task blastn -query "query" -subject "subject" -oytfmt 6 >name, how can I make the output file to only contain the most relevant matches? (E.g. only the matches that have an E value lower than 0.0001)

Thanks beforehand!

genome blast • 11k views
ADD COMMENT
4
Entering edit mode
9.8 years ago

If you're using Windows, save your excel file as text with tab delimiters, and first column chromosome name, second start, third end (this is BED format. Then install Python3, then open cmd.exe:

C:\Python34\Scripts\pip.exe install pyfaidx
C:\Python34\Scripts\faidx.exe genome.fna --bed regions.bed > out.fasta
ADD COMMENT
0
Entering edit mode

Hi Matt Shirley,

I tried your approach, and although it's confortable to work in Windows, I find Linux Bedtools more suitable for my current project. Thanks for you help!!

ADD REPLY
2
Entering edit mode
9.8 years ago

Save your excel data as tab delimited file.

Use bedtools getfasta to fetch sequences:

bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> -fo <output FASTA>
ADD COMMENT
0
Entering edit mode

Hi Geek_y, thanks for your answer!!

I'm pretty new to this and don't know much about LINUX based operating systems or bioinformatic tools. I will run ubuntu, try to do as you adviced and report later.

ADD REPLY
0
Entering edit mode

Hello again Geek_y,

I tried as you described and didn't worked well, I modified the order of the tabs so they were chromosome:start:end:name:0:strand and got best results, thanks for your help and introducing me to Bedtools!!

ADD REPLY
0
Entering edit mode

Position in bed files start from 0. I want to get the bases at the position range depending on the variant call format (VCF). Should I consider the position shift?

Many thanks.

ADD REPLY
0
Entering edit mode

I'm having trouble, using NCBI, which file goes in which parameter in the above command?

Maybe I am not looking in the right place for the files? I used the ftp directory: ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000409795.2_Chlorocebus_sabeus_1.1, and this is what I see when I "ls" in that directory:

 dr-xr-xr-x   4 ftp      anonymous     4096 Aug 19  2014
 GCF_000409795.2_Chlorocebus_sabeus_1.1_assembly_structure
 GCF_000409795.2_Chlorocebus_sabeus_1.1_cds_from_genomic.fna.gz
 GCF_000409795.2_Chlorocebus_sabeus_1.1_feature_table.txt.gz
 GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.fna.gz
 GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.gbff.gz
 GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.gff.gz
 GCF_000409795.2_Chlorocebus_sabeus_1.1_protein.faa.gz
 GCF_000409795.2_Chlorocebus_sabeus_1.1_protein.gpff.gz
 GCF_000409795.2_Chlorocebus_sabeus_1.1_rm.out.gz
 GCF_000409795.2_Chlorocebus_sabeus_1.1_rm.run
 GCF_000409795.2_Chlorocebus_sabeus_1.1_rna.fna.gz
 GCF_000409795.2_Chlorocebus_sabeus_1.1_rna.gbff.gz
 -r--r--r--   1 ftp      anonymous 43521485 Jun 10 15:52  
 Chlorocebus_sabeus_1.1_rna_from_genomic.fna.gz 

README.txt

README.txt annotation_hashes.txt md5checksums.txt

**I'm sorry that the above file list is poorly formatted, it looks like a normal list when I type it.

I think GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.gff.gz is the gff file for the -bed parameter, but which of the files below is my -fi parameter?

When I run: ./bedtools2/bin/bedtools getfasta -fi GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.fna -bed GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.gff -fo test3.out

localhost:NCBIGenomes$ grep ">" test3.out |wc
 1815133 1815133 56029598

The aim is to obtain the ~20,000 protein coding sequences in fasta format for the above species.

ADD REPLY
0
Entering edit mode

That command is correct.

The GFF file contains separate lines for exons etc so you are going to need to first select only the CDS annotations from the GFF file.

grep "CDS" GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.gff > vervet_cds.gff
bedtools getfasta -fi GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.fna -bed vervet_cds.gff -fo out.fa

This still has 800K+ sequences since the GFF file contains a boatload of transcript variants.

ADD REPLY
0
Entering edit mode

Thank you so much. While I appreciate the reply, I'm obviously missing something in my understanding!

I ran the above commands, and I have the out.fa output file. In total, there are 800,763 sequences in the output file. I can see that the first "gene/transcript"(?) in this file is NC_023642.1. This gene name occurs 45,153 times in the file. Then, for this gene, I pulled out the longest sequence with this gene name in the file. It's 172,096 base pairs long, and contains both capital and small letters. This is obviously not the longest canonical transcript for this gene.

So then on NCBI, I searched for NC_023642.1 and I can see that it is Chlorocebus sabaeus isolate 1994-021 chromosome 1, Chlorocebus_sabeus 1.1, whole genome shotgun sequence; so it seems to be a section of DNA, rather than a gene. In addition, there is no gene names in the _genomic.fna file.

So then I wondered was I meant to change the -fi parameter to the cds_from_genomic.fna file, which has gene names like >lcl|NW_003943604.1_cds_XP_010335103.1_1 [gene=LOC101049931] [db_xref=GeneID:101049931] [protein=breast cancer type 2 susceptibility protein], which is more along the lines of what I was expecting the gene names to look like, instead of just NC_023642.1. Except this is obviously wrong, because when I ran the command:

bedtools getfasta -fi GCF_000409795.2_Chlorocebus_sabeus_1.1_cds_from_genomic.fna -bed vervet_cds.gff -fo out.fa

There is just a lot or errors (e.g. WARNING. chromosome (NC_023650.1) was not found in the FASTA file. Skipping.) and an empty file at the end.

Just to clarify, this is the FTP directory that I got the above files from in NCBI: /genomes/all/GCF_000409795.2_Chlorocebus_sabeus_1.1

So then I looked into the vervet_cds.gff file, and I noticed that it has the gene, protein and gene lengths. So I wondered was I meant to parse the gff file to obtain a set of longest transcripts from this file, and then add that newly parsed file to the bed tools command.

So I wrote a script that goes to the GFF file, and pulls out the gene name and the gene length (i.e. gene end - gene start). Then, for the transcript with the longest gene length per gene (i.e. GeneID), I extracted that line of the GFF file. So then I thought I would have a list of the longest transcripts, and I would just extract these from the .fna file.

So I ran the command:

./bedtools2/bin/bedtools getfasta -fi GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.fna -bed vervet_cds_test_out -fo test4.out

But my output makes no sense, it's just like this:

>NC_023642.1:847922-847995 GAGTCTTGTATGGCAGCCACTTGGCTTCTCCGTTGTTGGTGACCACCTCGTCCTGAGAGATCACAATTTTGTC
>NC_023642.1:3297687-3297772 GTGCTGTTTGAGACTTTCAACATACCTGCCCTGTATTTAGCTAACCAAGGAGTCCTTTCTCTGTACGCCTCTGGCCAGACCTCTG
>NC_023642.1:9147058-9147135 TGCTCAACGGAGCCATGTTCTTCAACCGCCTGTCCTTGGAGTTTCTGGCCATCGAGTACCGGGAGGAGAACCACTGA
>NC_023642.1:3250338-3250552 GGGTTGTGAACAAGTATCAGATCAACGGCCTGCAAGCCTGGCTCCTCACGCATCTGCTCTGGTTTGCAAACGCTCATCTCCTGTCCTGGTTCTCGCCCACCATCATCTTCGACAACTGGATCCCATTGCTGTGGTGTGCCAACATCCTTGGCTATGCCGTCTCCACCTTCGCCATGGTCAAGGGCTACTTCTTCCCCACCAGCGCCAGAGACTG
>NC_023642.1:8037595-8037731 CTGCGTGCTGTGGCCCAAATCTGGGGAGCGCTCGCTGTCCGAACTGTTGGTCCTCTCACTCAAGGGCTTGGAGAGCTGCCGCTCCTTCAGGGGCGTGCTCCTCTTCTGCCGGGGTGTGTGCACCCCTTCCACTTTG
>NC_023642.1:6341744-6341813 CGCCTTTCCTTCGAGGAGCTCAaggcatttttcaaaatctgACAGTttttctcacaacaaccctgtgag

Again, not an output I can work with and I've clearly gone wrong. I KNOW it has to be easy, or at least easier than I'm making it out to be, to extract a set of protein coding genes from NCBI, I know it's a really widely used database, I'm just not getting it.

If you could shed any light on what I'm doing wrong, or the exact steps I would use to obtain the set of protein coding sequences I would really appreciate it. This is just an example of a species, I need them for lots of mammals species, hence why I need to minimise what I do manually.

Thank you.

ADD REPLY
0
Entering edit mode

I also tried another way. You can see here: here this is a list of 20,620 protein coding genomic sequences for Cebus.

I pulled out the GI's for the first three protein coding genes, and put the first three GI's into this script that I found on BioStars here

I can see here the output is:

>GI  103249076  ABAAQ0183067 Homo sapiens 5' cDNA fragments, random primer derived CRL-2429 CCD-1112Sk RIKEN Cap Analysis Gene Expression (CAGE) library 
ttagaataggtagggacagt

>GI  103249075  ABAAQ0183066 Homo sapiens 5' cDNA fragments, random primer derived CRL-2429 CCD-1112Sk RIKEN Cap Analysis Gene Expression (CAGE) library 
ggcttggggcggcgggggt

>GI  103249072  ABAAQ0183063 Homo sapiens 5' cDNA fragments, random primer derived CRL-2429 CCD-1112Sk RIKEN Cap Analysis Gene Expression (CAGE) library 
gtcctgccagagcggcgctc

For 103249076, I then went here, and I can see from the Genomic Sequence section that the sequences should be longer. I also emailed NCBI, they suggested that I use Entrez Direct, which was what led me to my original question where I was having trouble using the Entrez Direct route: here.

So basically, I think at this point I am just completely confused, and I'm sure I've probably managed to confuse you too. To clarify, the aim is: download the set of longest canonical coding sequences (DNA, not protein), for the species Chlorocebus_sabeus_1.1 (although in reality, I have to do it for a set of species, this is just an example). The problem: Although I can get the tools like bed tools, entrez direct and python code described above to run, I just cannot get them to give me the answer that i want. I don't know if it's that I'm not using the right files (I obtained the data from /genomes/all/GCF_000409795.2_Chlorocebus_sabeus_1.1, maybe this isn't right), or I'm not parsing the right file, or in the right way. If you could help I would appreciate it.

ADD REPLY
0
Entering edit mode

Can we go back to the original thread for further discussion?

I think solutions 2 or 3 I had suggested in that thread would be the way to go. I am going to try them out myself later today. I will let you know what happens. We will get this sorted out.

ADD REPLY
0
Entering edit mode
9.6 years ago

Use samtools (http://samtools.sourceforge.net/). Install it. Save your excel file in a tab separated text file.

Index your sequences using

samtools faidx fasta.fa

This will generate a fasta.fai file (indexed file).

Now use the following perl script

#!/usr/bin/perl -w

open (COORDINATES, "yourtabseparated.txt");

while(<COORDINATES>){
    chomp;
    my ($name,$start,$end)=split(/\s/);
    open(PROGRAM, "full/path/to/samtools faidx full/path/to/fasta.fa $name:$start-$end |"),

    while(my $line = <PROGRAM>){
        print $line
    }

    close(PROGRAM);
}
close(COORDINATES);

Save this code in a file with .pl extension with modifications. And run as below

perl file.pl

This should do the job.

ADD COMMENT

Login before adding your answer.

Traffic: 2442 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6