Question

Extract Sequences From [Gtf File] + [Genome Fasta File], combine matching XLOCs

1

Entering edit mode

8.9 years ago

Tom Koch ▴ 110

Hi,

I found someone, years ago, had a very similar (almost the exact same) question, but it involved using Galaxy, and I would like to use the command line. My goal: Take a sequence of XLOC exons, get their sequences from the genome, and concatenate them. Perhaps blast them afterward, but that can be done separately.

So far I've gotten very close, using the 'bioio' fasta program to have scriptable output.

while IFS=$'\t' read -r -a myArray
do
    tr -d '"'  | awk -v FASTAOUTPUT="`fasta $1":"$4"-"$5`" -F '[;\t ]' '$10 ~ /XLOC/ {printf(">%s\n",$10 $13); system("fasta ~/mm.fa " $1":"$4"-"$5)}'  
done < merged.gtf

The results are pretty good.

>XLOC_000010TCONS_00000036
GAAAACATGTTATCTGAAGAGAAGCAGAGGATCATGCTCCTAGAACGG
>XLOC_000010TCONS_00000036
CATTGCAGTTGAAAGAAGAAGAAAACAAGCGGTTAAATCAAAGACTG
>XLOC_000010TCONS_00000036
TGTCTCAGAGTTTGTCCTCAGTCTCTTCAAGGCATTCTGAAAAAATAGCCATTAGAGA
>XLOC_000010TCONS_00000036
TTTCAGGTGGGAGATTTGGTTCTCATCATCCTAGATGAGCGGCACGACAATTATGTATTGTTTACTGTTAGTCCTACTTTATATTTTCTGCACTCAGAGTCTCTTCCTGCCCTGGATCTCAAACCAGGTGAGGGAG
>XLOC_000010TCONS_00000036
TTCAGGTGCATCTAGAAGACCCTGGGTCCTTGGGAAAGTAATGGAAAAGGAATACTGTCAAGCCAAAAAG

However, the dream is to take those lines and turn them, as long as the XLOC and transcript_id (here, $13) match, into a long sequence.

The other reason I'm posting this is so people googling and who see the Galaxy post might also find this post.

Thanks for your time.

RNA-Seq blast genome bash • 4.8k views

ADD COMMENT • link updated 8.9 years ago by GouthamAtla 12k • written 8.9 years ago by Tom Koch ▴ 110

0

Entering edit mode

Can you post how one of your XLOC looks in GTF file ?

ADD REPLY • link 8.9 years ago by GouthamAtla 12k

0

Entering edit mode

XLOC in GTF: http://pastebin.com/WW5aNfZH

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by Tom Koch ▴ 110

Ram · Answer 1 · 2016-01-26

Are you trying to construct a transcriptome from GTF and a fasta ? Tophat could construct a transcriptome if you provide a genome (and index) and a gtf file.

Example GTF:

chr1    Cufflinks    exon    700    900    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000003"; exon_number "3"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1000    1200    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000003"; exon_number "4"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1500    1600    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000003"; exon_number "5"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1800    1900    .    -    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    1905    1950    .    -    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "2"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    1980    2000    .    -    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "3"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    2005    2050    .    -    .    gene_id "XLOC_000003"; transcript_id "TCONS_00000004"; exon_number "3"; oId "CUFF.5.1"; tss_id "TSS4";

Command:

tophat2 -T -G example_gtf.gtf --transcriptome-index=transcriptome hg19mini

Output fasta:

>0 TCONS_00000003 chr1+ 700-900,1000-1200,1500-1600
.......
>1 TCONS_00000004 chr1- 1800-1900,1905-1950,1980-2000,2005-2050
.......

If your goal is to concatenate XLOC instead of TCONS, just remove the transcript_ids and replace gene_id with transcript _id

Example:

chr1    Cufflinks    exon    700    900    .    +    .    transcript_id "XLOC_000001"; exon_number "3"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1000    1200    .    +    .    transcript_id "XLOC_000001"; exon_number "4"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1500    1600    .    +    .    transcript_id "XLOC_000001"; exon_number "5"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1800    1900    .    -    .    transcript_id "XLOC_000002"; exon_number "1"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    1905    1950    .    -    .    transcript_id "XLOC_000002"; exon_number "2"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    1980    2000    .    -    .    transcript_id "XLOC_000002"; exon_number "3"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    2005    2050    .    -    .    transcript_id "XLOC_000003"; exon_number "3"; oId "CUFF.5.1"; tss_id "TSS4";

Output fasta:

>0 XLOC_000001 chr1+ 700-900,1000-1200,1500-1600
.......
>1 XLOC_000002 chr1- 1800-1900,1905-1950,1980-2000
.......
>2 XLOC_000003 chr1- 2005-2050
.......