Question

Get cds using bedtools

0

Entering edit mode

8.1 years ago

User000 ▴ 710

Hello, I have a gff3 file. I'm using the following command line to extract sequences for each CDS:

bedtools getfasta -fi pseudomolecules.fasta -bed all.gff3 -fo CDS.fa

chr1A   rnaseq  CDS 992  1945   .   +   0   mRNA_1
chr1B   rnaseq  CDS 72477  25295    .   +   0   mRNA_7
chr2D   rnaseq  CDS 13  725     .   +   0   mRNA_6

1)Is there a way to get the last lines (mRNA_1) added to the header as well? my header is >chr1A:992-1945 and my wanted header is >chr1A:992-1945:mRNA_1

2) Is this command line going to take me the coordinates from each chromosome, because I have one fasta file containing all chromosomes.

bedtools • 6.8k views

ADD COMMENT • link updated 7.1 years ago by molly77 ▴ 10 • written 8.1 years ago by User000 ▴ 710

0

Entering edit mode

Hi,

I am also trying to use bedtools to extract CDS from a genome fasta. However, I don't undertstand what the bed file is? Do I have to modify my gff file to only contain the hits of CDS?

ADD REPLY • link 7.1 years ago by molly77 ▴ 10

0

Entering edit mode

Do I have to modify my gff file to extract only CDS from my genome fasta? The bedtools getfasta manual is not informative to me.

ADD REPLY • link 7.1 years ago by molly77 ▴ 10

0

Entering edit mode

If you only need CDS then yes but it looks like @Dario's answer below is doing that already. Your gff file will need to have CDS entries in column 3 for that to work.

ADD REPLY • link 7.1 years ago by GenoMax 146k

2

Entering edit mode

8.1 years ago

venu 7.1k

1.After generating fasta file with getfasta, create another file with only mRNA field, mRNA.txt.

Linearize fasta | paste - - | paste - mRNA.txt | awk -F'\t' '{print $1":"$3"\n"$2}' | fold -w 60 > newFasta.fa

2.Would you be more specific?

P.S: Linearize fasta

ADD COMMENT • link 8.1 years ago by venu 7.1k

0

Entering edit mode

with only mRNA field using getbasta? how to do it? or just using a simple list of mRNA's? could you explain better Linearize fasta?

ADD REPLY • link 8.1 years ago by User000 ▴ 710

1

Entering edit mode

fasta lineraization A: Useful Bash Commands To Handle Fasta Files or here

ADD REPLY • link 8.1 years ago by Medhat 9.8k

0

Entering edit mode

8.1 years ago

biocyberman ▴ 870

1) Checkout the help for the command below. I think you can experiment with -name or -fullHeader to see if you get what you want.

bedtools getfasta -h                                                                               

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.26.0
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
        -fi     Input FASTA file
        -bed    BED/GFF/VCF file of ranges to extract from -fi
        -name   Use the name field for the FASTA header
        -split  given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
        -tab    Write output in TAB delimited format.
                - Default is FASTA format.

        -s      Force strandedness. If the feature occupies the antisense,
                strand, the sequence will be reverse complemented.
                - By default, strand information is ignored.

        -fullHeader     Use full fasta header.
                - By default, only the word before the first space or tab is used.

2) Not sure what you are asking, but the coordinates, for example chr1A:992-1945 is specific to chr1A.

ADD COMMENT • link 8.1 years ago by biocyberman ▴ 870

0

Entering edit mode

How can I experiment...? is not clear at all..

ADD REPLY • link 8.1 years ago by User000 ▴ 710

score 3 · Accepted Answer · 2016-10-14

3

Entering edit mode

8.1 years ago

dariober 15k

Here's my suggestion:

Convert gff to bed where the name field (4th column) has the information you want
Run getfasta with -name option to assign fasta names from the name field

Not tested:

awk -v OFS='\t' -v FS='\t' '$3 == "CDS" {print $1, $4-1, $5, $1 ":" $4 "-" $5 ":" $9}' all.gff3 \
| bedtools getfasta -name -fi pseudomolecules.fasta -bed - -fo CDS.fa

Turning gff to bed should be just a matter extracting columns 1, 4, 5 and subtracting 1 from the 4th column since bed is 0-based while gff is 1-based.

ADD COMMENT • link 8.1 years ago by dariober 15k

0

Entering edit mode

Seems to work, thanks! I have a Q, in the command line I used it give me headers like this i.e. for the 1st line (chr1A rnaseq CDS 992 1945 . + 0 mRNA_1) chr1A:991-1945 instead of 992, in your command line it is giving me chr1A:992-1945:mRNA_1. So it takes 1st nucleotide as 0, but in urs as 1, why is this difference? This is not so important as both cases are ok, I wanted just to understand.

ADD REPLY • link 8.0 years ago by User000 ▴ 710

2

Entering edit mode

That difference is explained in his last sentence. See also: Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems