Entering edit mode
4.6 years ago
marcelolaia
▴
10
I try the script suggested here, but, it retrieve more fasta that is needed.
I have:
Exon-file.fa
>exon-EucgrC_t009-1 transcript=rna-EucgrC_t009 gene=gene-EucgrC_t009 name=trnE-UUC seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302717.1-11 transcript=rna-NM_001302717.1 gene=gene-MIOX name=MIOX seq_id=NW_010092440.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-2 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-4 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-5 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-6 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-7 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-1 transcript=rna-TRNAA-AGC gene=gene-TRNAA-AGC name=TRNAA-AGC seq_id=NW_010092438.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-10-1 transcript=rna-TRNAA-AGC-10 gene=gene-TRNAA-AGC-10 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-11-1 transcript=rna-TRNAA-AGC-11 gene=gene-TRNAA-AGC-11 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-12-1 transcript=rna-TRNAA-AGC-12 gene=gene-TRNAA-AGC-12 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-13-1 transcript=rna-TRNAA-AGC-13 gene=gene-TRNAA-AGC-13 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-14-1 transcript=rna-TRNAA-AGC-14 gene=gene-TRNAA-AGC-14 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-15-1 transcript=rna-TRNAA-AGC-15 gene=gene-TRNAA-AGC-15 name=TRNAA-AGC seq_id=NW_010092448.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAF-GAA-14-1 transcript=rna-TRNAF-GAA-14 gene=gene-TRNAF-GAA-14 name=TRNAF-GAA seq_id=NW_010092718.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-XM_010025031.2-2 transcript=rna-XM_010025031.2 gene=gene-LOC104414038 name=LOC104414038 seq_id=NW_010092445.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104451586-21 transcript=gene-LOC104451586 gene=nbis-pseudogene-2383 name=LOC104451586 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>nbis-exon-54 transcript=gene-EucgrC_p059 gene=nbis-gene-55 name=rpl16 seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
ListID-file.csv
ExonID
exon-EucgrC_t009-1
exon-NM_001302717.1-11
exon-NM_001302727.1-2
exon-TRNAA-AGC-1
exon-TRNAF-GAA-14-1
id-LOC104447927
id-LOC104451586-21
nbis-exon-54
Output.fa that I expect
>exon-EucgrC_t009-1 transcript=rna-EucgrC_t009 gene=gene-EucgrC_t009 name=trnE-UUC seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302717.1-11 transcript=rna-NM_001302717.1 gene=gene-MIOX name=MIOX seq_id=NW_010092440.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-2 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-1 transcript=rna-TRNAA-AGC gene=gene-TRNAA-AGC name=TRNAA-AGC seq_id=NW_010092438.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAF-GAA-14-1 transcript=rna-TRNAF-GAA-14 gene=gene-TRNAF-GAA-14 name=TRNAF-GAA seq_id=NW_010092718.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104451586-21 transcript=gene-LOC104451586 gene=nbis-pseudogene-2383 name=LOC104451586 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>nbis-exon-54 transcript=gene-EucgrC_p059 gene=nbis-gene-55 name=rpl16 seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
Output.fa that I see when I run @Mehmet script
>exon-EucgrC_t009-1 transcript=rna-EucgrC_t009 gene=gene-EucgrC_t009 name=trnE-UUC seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302717.1-11 transcript=rna-NM_001302717.1 gene=gene-MIOX name=MIOX seq_id=NW_010092440.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-2 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-4 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-5 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-6 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-7 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-1 transcript=rna-TRNAA-AGC gene=gene-TRNAA-AGC name=TRNAA-AGC seq_id=NW_010092438.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-10-1 transcript=rna-TRNAA-AGC-10 gene=gene-TRNAA-AGC-10 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-11-1 transcript=rna-TRNAA-AGC-11 gene=gene-TRNAA-AGC-11 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-12-1 transcript=rna-TRNAA-AGC-12 gene=gene-TRNAA-AGC-12 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-13-1 transcript=rna-TRNAA-AGC-13 gene=gene-TRNAA-AGC-13 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-14-1 transcript=rna-TRNAA-AGC-14 gene=gene-TRNAA-AGC-14 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-15-1 transcript=rna-TRNAA-AGC-15 gene=gene-TRNAA-AGC-15 name=TRNAA-AGC seq_id=NW_010092448.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAF-GAA-14-1 transcript=rna-TRNAF-GAA-14 gene=gene-TRNAF-GAA-14 name=TRNAF-GAA seq_id=NW_010092718.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-XM_010025031.2-2 transcript=rna-XM_010025031.2 gene=gene-LOC104414038 name=LOC104414038 seq_id=NW_010092445.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-2 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-3 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-4 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-5 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-6 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-7 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-8 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-9 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104451586-21 transcript=gene-LOC104451586 gene=nbis-pseudogene-2383 name=LOC104451586 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>nbis-exon-54 transcript=gene-EucgrC_p059 gene=nbis-gene-55 name=rpl16 seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
Please, could you help me? Thank you!
Hi,
I think if you change
$0 ~ pattern
to$1 ~ pattern
it will give you the right output.You can also try this one:
Please make sure there's no empty lines in your files. Also I suggest you remove the header (ExonID) from the text even though it's not causing any problems in your case.
If there are still extra sequences in the output, then you can try regular grep just to see what pattern is picking the extra lines:
Hi!
No! It do not solve that behavior .
It is effective in select only the desired IDs, but, return only the first line of the fasta sequence.
Thank you so much!
Are you sure? Because when I changed it to $1 and tried your sample data, I got correct results, but when it was $0 I got extra sequences as you mentioned.
If you'd like to use my command line, you need single line sequences. You can use the command line in this link to convert your multi-line sequences into single line:
C: Multiline Fasta To Single Line Fasta
You can remove the first line of the output (empty line)
Have you tried any other solutions? This is one of the top 5 probably requested tasks, so there are no shortage of pretty much ready to go solutions. Take a look at the bar on the right hand side which may contain relevant threads.
Yes! I did a search in the forum and found a lot of approaches here. @Mehmet script is the first one I tried. Second one was @emazep perl script. I started the perl script yesterday, at night, and it become in a loop until today. It run overnight without end. So, I decided to post a question to improve the first one script, @Mehmet. Thank you so much!