Extracting string from every other line
1
0
Entering edit mode
7.4 years ago
samuel ▴ 260

I have an ammended fasta file like so:

>ENST00000517147.1 ncrna chromosome:GRCh38:1:9437669:9437778:-1 gene:ENSG00000252956.1 gene_biotype:rRNA transcript_biotype:rRNA gene_symbol:RNA5SP40 description:RNA, 5S ribosomal pseudogene 40 [Source:HGNC Symbol;Acc:HGNC:42816]
GTCTATGGCCATTGCACCCTGAACGTGCCAGATCTTGTCTCATCTTGGAAGCTAAGCAGGGTTGGGCTTGGAGGGGAGGAGGGTGAACCTCAGTTCAGGTTACTTAGCCT
>ENST00000576449.1 ncrna chromosome:GRCh38:CHR_HSCHR18_1_CTG1_1:50319002:50319120:1 gene:ENSG00000262132.1 gene_biotype:rRNA transcript_biotype:rRNA gene_symbol:RNA5SP458 description:RNA, 5S ribosomal pseudogene 458 [Source:HGNC Symbol;Acc:HGNC:43358]
TTTCTATGGCATACCAACCTGAGTGTGCCCAGTCTCATCCAATCTCAGAACGTAAGCAGGATTGGGCCTGGTTAGAACTTGGATGGGAAAATGCCAGTTAAAATCTGTACTAAAAAATT

and an ammended gtf file like so:

1       ENSEMBL gene    9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; level 3;
1       ENSEMBL transcript      9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; transcript_id "ENST00000517147.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; transcript_type "rRNA"; transcript_status "KNOWN"; transcript_name "RNA5SP40-201"; level 3; transcript_support_level "NA"; tag "basic";
1       ENSEMBL exon    9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; transcript_id "ENST00000517147.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; transcript_type "rRNA"; transcript_status "KNOWN"; transcript_name "RNA5SP40-201"; exon_number 1; exon_id "ENSE00002089424.1"; level 3; transcript_support_level "NA"; tag "basic";

I want to extract the gene_name from the gtf file i.e. RNA5SP40 and the corresponding ENSG** from either the gtf or fasta file and the print the matching fasta sequence on the following line i.e.:

RNA5SP40|ENSG00000252956.1
GTCTATGGCCATTGCACCCTGAACGTGCCAGATCTTGTCTCATCTTGGAAGCTAAGCAGGGTTGGGCTTGGAGGGGAGGAGGGTGAACCTCAGTTCAGGTTACTTAGCCT

I am a complete beginner at programming and don't really know where to start. I could probably use awk to extract the gene name and ENSG* from the same file but wouldn't know how to match this to print out the fasta sequence from the other file?? Please help!

sequencing alignment sequence • 1.5k views
ADD COMMENT
2
Entering edit mode
7.4 years ago
cat FILE.fasta | sed -e s'/.*gene:/>/'| sed -e s'/gene_biotype.*gene_symbol://' | sed -e s'/description.*//' | awk -F " " '{if (substr($0, 0, 1)==">") {print $1"|"$2} else {print $0}}'

This will:

  1. substitute everything which comes before "gene:" with just ">"
  2. remove the part between "gene_biotype" and "gene_symbol"
  3. remove everything from "description" on
  4. concatenate the two strings you want with a pipe ("|") only in the fasta name (sequence stays as it is)

    >ENSG00000262132.1|RNA5SP458 TTTCTATGGCATACCAACCTGAGTGTGCCCAGTCTCATCCAATCTCAGAACGTAAGCAGGATTGGGCCTGGTTAGAACTTGGATGGGAAAATGCCAGTTAAAATCTGTACTAAAAAATT

it is not guaranteed to work 100%, some of your sequences might have the name fields in different order (even though usually they don't). That is why it is usually useful to know a language with dictionary support (python, perl, etc) so that you can hash the string in a key:value pair and call back only what you want depending on the key.

ADD COMMENT

Login before adding your answer.

Traffic: 2029 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6