Get cds using bedtools
3
0
Entering edit mode
8.1 years ago
User000 ▴ 710

Hello, I have a gff3 file. I'm using the following command line to extract sequences for each CDS:

bedtools getfasta -fi pseudomolecules.fasta -bed all.gff3 -fo CDS.fa

chr1A   rnaseq  CDS 992  1945   .   +   0   mRNA_1
chr1B   rnaseq  CDS 72477  25295    .   +   0   mRNA_7
chr2D   rnaseq  CDS 13  725     .   +   0   mRNA_6

1)Is there a way to get the last lines (mRNA_1) added to the header as well? my header is >chr1A:992-1945 and my wanted header is >chr1A:992-1945:mRNA_1

2) Is this command line going to take me the coordinates from each chromosome, because I have one fasta file containing all chromosomes.

bedtools • 6.9k views
ADD COMMENT
0
Entering edit mode

Hi,

I am also trying to use bedtools to extract CDS from a genome fasta. However, I don't undertstand what the bed file is? Do I have to modify my gff file to only contain the hits of CDS?

ADD REPLY
0
Entering edit mode

Do I have to modify my gff file to extract only CDS from my genome fasta? The bedtools getfasta manual is not informative to me.

ADD REPLY
0
Entering edit mode

If you only need CDS then yes but it looks like @Dario's answer below is doing that already. Your gff file will need to have CDS entries in column 3 for that to work.

ADD REPLY
3
Entering edit mode
8.1 years ago

Here's my suggestion:

  • Convert gff to bed where the name field (4th column) has the information you want

  • Run getfasta with -name option to assign fasta names from the name field

Not tested:

awk -v OFS='\t' -v FS='\t' '$3 == "CDS" {print $1, $4-1, $5, $1 ":" $4 "-" $5 ":" $9}' all.gff3 \
| bedtools getfasta -name -fi pseudomolecules.fasta -bed - -fo CDS.fa

Turning gff to bed should be just a matter extracting columns 1, 4, 5 and subtracting 1 from the 4th column since bed is 0-based while gff is 1-based.

ADD COMMENT
0
Entering edit mode

Seems to work, thanks! I have a Q, in the command line I used it give me headers like this i.e. for the 1st line (chr1A rnaseq CDS 992 1945 . + 0 mRNA_1) chr1A:991-1945 instead of 992, in your command line it is giving me chr1A:992-1945:mRNA_1. So it takes 1st nucleotide as 0, but in urs as 1, why is this difference? This is not so important as both cases are ok, I wanted just to understand.

ADD REPLY
2
Entering edit mode

That difference is explained in his last sentence. See also: Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

ADD REPLY
0
Entering edit mode

you are right! thanks

ADD REPLY
2
Entering edit mode
8.1 years ago
venu 7.1k

1.After generating fasta file with getfasta, create another file with only mRNA field, mRNA.txt.

Linearize fasta | paste - - | paste - mRNA.txt | awk -F'\t' '{print $1":"$3"\n"$2}' | fold -w 60 > newFasta.fa

2.Would you be more specific?

P.S: Linearize fasta

ADD COMMENT
0
Entering edit mode

with only mRNA field using getbasta? how to do it? or just using a simple list of mRNA's? could you explain better Linearize fasta?

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode
8.1 years ago
biocyberman ▴ 870

1) Checkout the help for the command below. I think you can experiment with -name or -fullHeader to see if you get what you want.

bedtools getfasta -h                                                                               

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.26.0
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
        -fi     Input FASTA file
        -bed    BED/GFF/VCF file of ranges to extract from -fi
        -name   Use the name field for the FASTA header
        -split  given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
        -tab    Write output in TAB delimited format.
                - Default is FASTA format.

        -s      Force strandedness. If the feature occupies the antisense,
                strand, the sequence will be reverse complemented.
                - By default, strand information is ignored.

        -fullHeader     Use full fasta header.
                - By default, only the word before the first space or tab is used.

2) Not sure what you are asking, but the coordinates, for example chr1A:992-1945 is specific to chr1A.

ADD COMMENT
0
Entering edit mode

How can I experiment...? is not clear at all..

ADD REPLY

Login before adding your answer.

Traffic: 1839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6