Extract fasta sequences based on ids file
3
0
Entering edit mode
4.6 years ago
marcelolaia ▴ 10

I try the script suggested here, but, it retrieve more fasta that is needed.

I have:

Exon-file.fa

>exon-EucgrC_t009-1 transcript=rna-EucgrC_t009 gene=gene-EucgrC_t009 name=trnE-UUC seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302717.1-11 transcript=rna-NM_001302717.1 gene=gene-MIOX name=MIOX seq_id=NW_010092440.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-2 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-4 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-5 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-6 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-7 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-1 transcript=rna-TRNAA-AGC gene=gene-TRNAA-AGC name=TRNAA-AGC seq_id=NW_010092438.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-10-1 transcript=rna-TRNAA-AGC-10 gene=gene-TRNAA-AGC-10 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-11-1 transcript=rna-TRNAA-AGC-11 gene=gene-TRNAA-AGC-11 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-12-1 transcript=rna-TRNAA-AGC-12 gene=gene-TRNAA-AGC-12 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-13-1 transcript=rna-TRNAA-AGC-13 gene=gene-TRNAA-AGC-13 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-14-1 transcript=rna-TRNAA-AGC-14 gene=gene-TRNAA-AGC-14 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-15-1 transcript=rna-TRNAA-AGC-15 gene=gene-TRNAA-AGC-15 name=TRNAA-AGC seq_id=NW_010092448.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAF-GAA-14-1 transcript=rna-TRNAF-GAA-14 gene=gene-TRNAF-GAA-14 name=TRNAF-GAA seq_id=NW_010092718.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-XM_010025031.2-2 transcript=rna-XM_010025031.2 gene=gene-LOC104414038 name=LOC104414038 seq_id=NW_010092445.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104451586-21 transcript=gene-LOC104451586 gene=nbis-pseudogene-2383 name=LOC104451586 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>nbis-exon-54 transcript=gene-EucgrC_p059 gene=nbis-gene-55 name=rpl16 seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG

ListID-file.csv

ExonID
exon-EucgrC_t009-1
exon-NM_001302717.1-11
exon-NM_001302727.1-2
exon-TRNAA-AGC-1
exon-TRNAF-GAA-14-1
id-LOC104447927
id-LOC104451586-21
nbis-exon-54

Output.fa that I expect

>exon-EucgrC_t009-1 transcript=rna-EucgrC_t009 gene=gene-EucgrC_t009 name=trnE-UUC seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302717.1-11 transcript=rna-NM_001302717.1 gene=gene-MIOX name=MIOX seq_id=NW_010092440.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-2 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-1 transcript=rna-TRNAA-AGC gene=gene-TRNAA-AGC name=TRNAA-AGC seq_id=NW_010092438.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAF-GAA-14-1 transcript=rna-TRNAF-GAA-14 gene=gene-TRNAF-GAA-14 name=TRNAF-GAA seq_id=NW_010092718.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104451586-21 transcript=gene-LOC104451586 gene=nbis-pseudogene-2383 name=LOC104451586 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>nbis-exon-54 transcript=gene-EucgrC_p059 gene=nbis-gene-55 name=rpl16 seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG

Output.fa that I see when I run @Mehmet script

>exon-EucgrC_t009-1 transcript=rna-EucgrC_t009 gene=gene-EucgrC_t009 name=trnE-UUC seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302717.1-11 transcript=rna-NM_001302717.1 gene=gene-MIOX name=MIOX seq_id=NW_010092440.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-2 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-4 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-5 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-6 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-7 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-1 transcript=rna-TRNAA-AGC gene=gene-TRNAA-AGC name=TRNAA-AGC seq_id=NW_010092438.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-10-1 transcript=rna-TRNAA-AGC-10 gene=gene-TRNAA-AGC-10 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-11-1 transcript=rna-TRNAA-AGC-11 gene=gene-TRNAA-AGC-11 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-12-1 transcript=rna-TRNAA-AGC-12 gene=gene-TRNAA-AGC-12 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-13-1 transcript=rna-TRNAA-AGC-13 gene=gene-TRNAA-AGC-13 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-14-1 transcript=rna-TRNAA-AGC-14 gene=gene-TRNAA-AGC-14 name=TRNAA-AGC seq_id=NW_010092444.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-15-1 transcript=rna-TRNAA-AGC-15 gene=gene-TRNAA-AGC-15 name=TRNAA-AGC seq_id=NW_010092448.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAF-GAA-14-1 transcript=rna-TRNAF-GAA-14 gene=gene-TRNAF-GAA-14 name=TRNAF-GAA seq_id=NW_010092718.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-XM_010025031.2-2 transcript=rna-XM_010025031.2 gene=gene-LOC104414038 name=LOC104414038 seq_id=NW_010092445.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-2 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-3 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-4 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-5 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-6 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-7 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-8 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104447927-9 transcript=gene-LOC104447927 gene=nbis-pseudogene-2213 name=LOC104447927 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>id-LOC104451586-21 transcript=gene-LOC104451586 gene=nbis-pseudogene-2383 name=LOC104451586 seq_id=NW_010092443.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>nbis-exon-54 transcript=gene-EucgrC_p059 gene=nbis-gene-55 name=rpl16 seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG

Please, could you help me? Thank you!

sequence fasta perl awk sed • 1.9k views
ADD COMMENT
1
Entering edit mode

Hi,

I think if you change $0 ~ pattern to $1 ~ pattern it will give you the right output.

You can also try this one:

cat id_file.txt | while read id; do grep -A 1 ">$id " seq-file.fasta  >> output_file.fasta ; done

Please make sure there's no empty lines in your files. Also I suggest you remove the header (ExonID) from the text even though it's not causing any problems in your case.

If there are still extra sequences in the output, then you can try regular grep just to see what pattern is picking the extra lines:

cat id_file.txt | while read id; do grep -o ">$id " seq-file.fasta --color=always ; done
ADD REPLY
0
Entering edit mode

Hi!

I think if you change $0 ~ pattern to $1 ~ pattern it will give you the right output.

No! It do not solve that behavior .

cat id_file.txt | while read id; do grep -A 1 ">$id " seq-file.fasta  >> output_file.fasta ; done

It is effective in select only the desired IDs, but, return only the first line of the fasta sequence.

>exon-XM_010027051.2-12 transcript=(.....)
tTCCCGTGTTGTTGACTTGATTGTCCACATGGCGTCCTGCGCTTGAATGGAAATCCGTCT

Thank you so much!

ADD REPLY
0
Entering edit mode

Are you sure? Because when I changed it to $1 and tried your sample data, I got correct results, but when it was $0 I got extra sequences as you mentioned.

If you'd like to use my command line, you need single line sequences. You can use the command line in this link to convert your multi-line sequences into single line:

C: Multiline Fasta To Single Line Fasta

You can remove the first line of the output (empty line)

ADD REPLY
0
Entering edit mode

Have you tried any other solutions? This is one of the top 5 probably requested tasks, so there are no shortage of pretty much ready to go solutions. Take a look at the bar on the right hand side which may contain relevant threads.

ADD REPLY
0
Entering edit mode

Yes! I did a search in the forum and found a lot of approaches here. @Mehmet script is the first one I tried. Second one was @emazep perl script. I started the perl script yesterday, at night, and it become in a loop until today. It run overnight without end. So, I decided to post a question to improve the first one script, @Mehmet. Thank you so much!

ADD REPLY
2
Entering edit mode
4.6 years ago
GenoMax 147k

Using seqkit (LINK or install with conda)

$ more id_file
exon-EucgrC_t009-1
exon-NM_001302717.1-11
exon-NM_001302727.1-2
exon-TRNAA-AGC-1

$ seqkit grep -f id_file seq_file
>exon-EucgrC_t009-1 transcript=rna-EucgrC_t009 gene=gene-EucgrC_t009 name=trnE-UUC seq_id=NC_014570.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302717.1-11 transcript=rna-NM_001302717.1 gene=gene-MIOX name=MIOX seq_id=NW_010092440.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-NM_001302727.1-2 transcript=rna-NM_001302727.1 gene=gene-EGM2 name=EGM2 seq_id=NW_010092442.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
>exon-TRNAA-AGC-1 transcript=rna-TRNAA-AGC gene=gene-TRNAA-AGC name=TRNAA-AGC seq_id=NW_010092438.1 type=exon
ACTG....ACTG....ACTG.....ACTG.....ACTG
ADD COMMENT
1
Entering edit mode
4.6 years ago
Juke34 8.9k

You can use gaas_fasta_extract_sequence_from_id.pl from GAAS.

ADD COMMENT
0
Entering edit mode
4.6 years ago
marcelolaia ▴ 10

I solved it run:

$ ./faSomeRecords in.exons.fna id-file.txt out.fa

faSomeRecords can be downloaded from here

Thank you!

ADD COMMENT

Login before adding your answer.

Traffic: 2241 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6