Question

Extracting string from every other line

0

Entering edit mode

7.4 years ago

samuel ▴ 260

I have an ammended fasta file like so:

>ENST00000517147.1 ncrna chromosome:GRCh38:1:9437669:9437778:-1 gene:ENSG00000252956.1 gene_biotype:rRNA transcript_biotype:rRNA gene_symbol:RNA5SP40 description:RNA, 5S ribosomal pseudogene 40 [Source:HGNC Symbol;Acc:HGNC:42816]
GTCTATGGCCATTGCACCCTGAACGTGCCAGATCTTGTCTCATCTTGGAAGCTAAGCAGGGTTGGGCTTGGAGGGGAGGAGGGTGAACCTCAGTTCAGGTTACTTAGCCT
>ENST00000576449.1 ncrna chromosome:GRCh38:CHR_HSCHR18_1_CTG1_1:50319002:50319120:1 gene:ENSG00000262132.1 gene_biotype:rRNA transcript_biotype:rRNA gene_symbol:RNA5SP458 description:RNA, 5S ribosomal pseudogene 458 [Source:HGNC Symbol;Acc:HGNC:43358]
TTTCTATGGCATACCAACCTGAGTGTGCCCAGTCTCATCCAATCTCAGAACGTAAGCAGGATTGGGCCTGGTTAGAACTTGGATGGGAAAATGCCAGTTAAAATCTGTACTAAAAAATT

and an ammended gtf file like so:

1       ENSEMBL gene    9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; level 3;
1       ENSEMBL transcript      9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; transcript_id "ENST00000517147.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; transcript_type "rRNA"; transcript_status "KNOWN"; transcript_name "RNA5SP40-201"; level 3; transcript_support_level "NA"; tag "basic";
1       ENSEMBL exon    9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; transcript_id "ENST00000517147.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; transcript_type "rRNA"; transcript_status "KNOWN"; transcript_name "RNA5SP40-201"; exon_number 1; exon_id "ENSE00002089424.1"; level 3; transcript_support_level "NA"; tag "basic";

I want to extract the gene_name from the gtf file i.e. RNA5SP40 and the corresponding ENSG** from either the gtf or fasta file and the print the matching fasta sequence on the following line i.e.:

RNA5SP40|ENSG00000252956.1
GTCTATGGCCATTGCACCCTGAACGTGCCAGATCTTGTCTCATCTTGGAAGCTAAGCAGGGTTGGGCTTGGAGGGGAGGAGGGTGAACCTCAGTTCAGGTTACTTAGCCT

I am a complete beginner at programming and don't really know where to start. I could probably use awk to extract the gene name and ENSG* from the same file but wouldn't know how to match this to print out the fasta sequence from the other file?? Please help!

sequencing alignment sequence • 1.5k views

ADD COMMENT • link updated 7.4 years ago by Matteo Schiavinato ★ 3.6k • written 7.4 years ago by samuel ▴ 260

score 2 · Answer 1 · 2017-07-19

cat FILE.fasta | sed -e s'/.*gene:/>/'| sed -e s'/gene_biotype.*gene_symbol://' | sed -e s'/description.*//' | awk -F " " '{if (substr($0, 0, 1)==">") {print $1"|"$2} else {print $0}}'

This will:

substitute everything which comes before "gene:" with just ">"
remove the part between "gene_biotype" and "gene_symbol"
remove everything from "description" on
concatenate the two strings you want with a pipe ("|") only in the fasta name (sequence stays as it is)

>ENSG00000262132.1|RNA5SP458 TTTCTATGGCATACCAACCTGAGTGTGCCCAGTCTCATCCAATCTCAGAACGTAAGCAGGATTGGGCCTGGTTAGAACTTGGATGGGAAAATGCCAGTTAAAATCTGTACTAAAAAATT

it is not guaranteed to work 100%, some of your sequences might have the name fields in different order (even though usually they don't). That is why it is usually useful to know a language with dictionary support (python, perl, etc) so that you can hash the string in a key:value pair and call back only what you want depending on the key.