Dear All,
I have a big embl file that has over 1000 IDs, and I want to split it based on IDs. Each ID should be a separate file containing all information of related ID, and should have the ID name.
How to do that?
Dear All,
I have a big embl file that has over 1000 IDs, and I want to split it based on IDs. Each ID should be a separate file containing all information of related ID, and should have the ID name.
How to do that?
One option:
$ awk -v RS="ID " -v FS="\n" {'n = split($1, a, ";"); sub(/ /, "_", a[1]); a[1] = a[1]".embl"; outFn = a[1]; record = "ID "$0; printf("%s", record) > outFn;'} ./in.embl
Given test input:
$ more test.embl
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
ID AA03519 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
ID AA03520 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
ID AA03521 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
Run the script:
$ awk -v RS="ID " -v FS="\n" {'n = split($1, a, ";"); sub(/ /, "_", a[1]); a[1] = a[1]".embl"; outFn = a[1]; record = "ID "$0; printf("%s", record) > outFn;'} ./test.embl
This creates four files, one for each record, named by the record's ID:
$ ls -al AA*.embl
-rw-r--r-- 1 areynolds wheel 519 Oct 26 15:01 AA03518_standard.embl
-rw-r--r-- 1 areynolds wheel 519 Oct 26 15:01 AA03519_standard.embl
-rw-r--r-- 1 areynolds wheel 519 Oct 26 15:01 AA03520_standard.embl
-rw-r--r-- 1 areynolds wheel 519 Oct 26 15:01 AA03521_standard.embl
For example:
$ more AA03520_standard.embl
ID AA03520 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
If IDs are not uniquely namable, then this could cause previous results to get overwritten. It may be safer to use >>
instead of >
within the awk
statement.
This awk
script will create how ever many files as there are ID lines, at most. See the use of redirection operator >
inside the awk
script. Each record is written to its own file (so long as the record ID is unique).
If you want to count ahead of time how many ID files you should expect to come out of this script, you could run grep /$ID/ in.embl | wc -l
.
The grep
statement returns lines that start with ID
, and wc -l
counts them.
It would help to look at your EMBL-formatted file and see what changes are needed to the awk script.
Can you post the first 10 ID lines to your question? You could run something like:
$ grep /$ID/ in.embl | head -10 > firstTenIDs.txt
Then edit your question and add the contents of firstTenIDs.txt
to it, so that we can see what you're working with.
This is one of embl files produced after the script.
ID scaffold00001 scaffold00001; ; ; DNA; ; UNC; 6279 BP. XX AC scaffold00001; XX DE scaffold00001 XX OS . OC . XX FH Key Location/Qualifiers FT CDS join(371..871,936..953) FT /source="EVM" FT /locus_tag="species1_s00001.1" FT CDS join(2234..2326,2407..2652,2880..3080) FT /source="EVM" FT /locus_tag="species1_s00001.2" FT CDS complement(join(5766..5891,5280..5603,4956..5138)) FT /source="EVM" FT /locus_tag="species1_s00001.3" SQ Sequence 6279 BP; 1875 A; 1232 C; 1136 G; 1736 T; 300 other; TATTCGTATt cAGAAGAAGA GCAGCTGAGT AGTCATCAAA GCGAAGAAAT AAAGTTGCCG 60 GCCCAAGTAG TATAGTGGTT AGTATTCCAC GTTGTGGCCG TGGCGACACA GGTTCGATTC 120 CTGTCTTGGG CAAGTGTtCT TTTTtGAGTT ACAgTtCGGG GGgAACTAAC CcAGAaTTGT 180 TTATTGAGGA TGGTCGTATT TtACTAtACT TTTTTtAAAA GACCGCGAAA ATTTATTTTC 240 AAATCTGTTG CAAAATTTTA TGAGATAACT TTTTCCAATT TCTTTtGGGG TTTTCAGCCC 300
But I would like to have as this one:
ID scaffold00001; SV 1; linear; unassigned DNA; STD; UNC; 6279 BP. XX FH Key Location/Qualifiers FH FT CDS 371..871 FT CDS 936..953 FT CDS 371..953 FT CDS 2234..2326 FT CDS 2407..2652 FT CDS 2880..3080 FT CDS complement(4956..5138) FT CDS complement(5280..5603) FT CDS complement(5766..5891) FT CDS 2234..3080 FT CDS complement(4956..5891) XX SQ Sequence 6279 BP; 1875 A; 1232 C; 1136 G; 1736 T; 300 other; TATTCGTATt cAGAAGAAGA GCAGCTGAGT AGTCATCAAA GCGAAGAAAT AAAGTTGCCG 60 GCCCAAGTAG TATAGTGGTT AGTATTCCAC GTTGTGGCCG TGGCGACACA GGTTCGATTC 120 CTGTCTTGGG CAAGTGTtCT TTTTtGAGTT ACAgTtCGGG GGgAACTAAC CcAGAaTTGT 180 TTATTGAGGA TGGTCGTATT TtACTAtACT TTTTTtAAAA GACCGCGAAA ATTTATTTTC 240 AAATCTGTTG CAAAATTTTA TGAGATAACT TTTTCCAATT TCTTTtGGGG TTTTCAGCCC 300
Can you post the first 10 ID lines to your question? You could run something like:
$ grep /$ID/ in.embl | head -10 > firstTenIDs.txt
Then edit your question and add the contents of firstTenIDs.txt to it, so that we can see what you're working with.
You can add four spaces in front of each line to put the text into “code” form.
@OP: please post first embl record for better parsing.
Note: Before proceeding, create a test directory, copy the original embl (test.embl below) and run the script and check the output for random files.
OP embl ID line:
ID scaffold00001; SV 1; linear; unassigned DNA; STD; UNC; 6279 BP.
based on OP ID line,
$ grep -i "ID" test.embl | awk '{gsub(";",""); print $2}' | parallel "awk '/{}/,/\/\//' test.embl > {}.embl"
Above script should create multiple embl files with ID as file name.
Example embl file from post Split one embl file into several (close to OP embl IDs):
$ cat test.embl
ID comp0_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 205 BP.
XX
DE len=205 path=[1:0-135 1445:136-204]
XX
SQ Sequence 205 BP; 64 A; 54 C; 31 G; 56 T; 0 other;
GTATTGAACT GCAGAGCATT AAATGCTGCA ACTCAGTGCT TAGAATTCAT TAGATTCAGA 60
GCAACGAACC CTAAATACTG AGCTGTCCCA TTAAATACTC TGCAGTTCAA TACTTAGCAT 120
TCACCATTAA ACATAACACT TCCCGAGTTT CCACCATCCA TAAACAGCAG GCATTGTAAC 180
CTGTAGGCTC TCTCCACGGT TACCT 205
//
ID comp0_c0_seq2; SV 1; linear; unassigned DNA; STD; UNC; 205 BP.
XX
DE len=205 path=[4094:0-135 1445:136-204]
XX
SQ Sequence 205 BP; 59 A; 50 C; 35 G; 61 T; 0 other;
AGAGTATTAA ATGTTGCAGT TCAGTGCTTA AAATTTATTG GATTCAGAGA ATCTTCAAAT 60
TCAACGGACC CTAAACACTG AGCTGTCGCA TTAAATGCTC TGCAGTTCAA TGCTTAGCTT 120
TCACCATTAA GCATAGCACT TCCCGAGTTT CCACCATCCA TAAACAGCAG GCATTGTAAC 180
CTGTAGGCTC TCTCCACGGT TACCT 205
//
ID comp1_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 244 BP.
XX
DE len=244 path=[3:0-88 875:89-243]
XX
SQ Sequence 244 BP; 71 A; 51 C; 63 G; 59 T; 0 other;
GCAGAATTTA AGGCTATGAA TCAGGAGGTT CATAATTCCT TAAGGAGGGG AGTATGATGC 60
GGAGCATCCA CGCTCACCTC CACTCCACCG CATTGTCTTC GAGCTGTGAC AGCCAGCGCA 120
TAATATTCAA GAGCTATTGA CAGGTGTTGA AACGCGGCAG CCTTGCATAC TATTGAAGGA 180
CCACGTTTCA TTATTGTGAT CTATAAGAAG ACAGCTGATG CGATCATGAG GAAGGAAGAA 240
GGCT 244
//
output:
$ ls *.embl
comp0_c0_seq1.embl comp0_c0_seq2.embl comp1_c0_seq1.embl test.embl
.
$ more comp0_c0_seq1.embl
ID comp0_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 205 BP.
XX
DE len=205 path=[1:0-135 1445:136-204]
XX
SQ Sequence 205 BP; 64 A; 54 C; 31 G; 56 T; 0 other;
GTATTGAACT GCAGAGCATT AAATGCTGCA ACTCAGTGCT TAGAATTCAT TAGATTCAGA 60
GCAACGAACC CTAAATACTG AGCTGTCCCA TTAAATACTC TGCAGTTCAA TACTTAGCAT 120
TCACCATTAA ACATAACACT TCCCGAGTTT CCACCATCCA TAAACAGCAG GCATTGTAAC 180
CTGTAGGCTC TCTCCACGGT TACCT 205
//
$ more comp1_c0_seq1.embl
ID comp1_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 244 BP.
XX
DE len=244 path=[3:0-88 875:89-243]
XX
SQ Sequence 244 BP; 71 A; 51 C; 63 G; 59 T; 0 other;
GCAGAATTTA AGGCTATGAA TCAGGAGGTT CATAATTCCT TAAGGAGGGG AGTATGATGC 60
GGAGCATCCA CGCTCACCTC CACTCCACCG CATTGTCTTC GAGCTGTGAC AGCCAGCGCA 120
TAATATTCAA GAGCTATTGA CAGGTGTTGA AACGCGGCAG CCTTGCATAC TATTGAAGGA 180
CCACGTTTCA TTATTGTGAT CTATAAGAAG ACAGCTGATG CGATCATGAG GAAGGAAGAA 240
GGCT 244
//
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Split one embl file into several
I know this. What I want is to split big embl file and write output of each ID in a output file based on ID.