How does one use rename in seqkit to change sequence labels
1
0
Entering edit mode
2.2 years ago
shpak.max ▴ 50

As a follow-up to an earlier question, I am trying to figure out how to use seqkit to change duplicated read names

Unfortunately, the documentation and help for rename are very limited, and I don't see how the example listed can be applied to a case like mine:

https://bioinf.shenwei.me/seqkit/usage/#rename

Specifically, if I have a fastq sequence where two different reads have the same name, what do I do to append an _N to the second occurrence of a name?

In other words, given file.fastq, what input and output arguments do I need to apply so that seqkit rename <some arguments> file.fastq <some arguments> outfile.fastq

Returns outfile fastq with duplicate names changes to <name>_N

seqkit • 1.6k views
ADD COMMENT
1
Entering edit mode

What have you tried? Create a model input from your FASTQ (with ~10 unique reads and maybe 1-2 duplicates) and test against it - the manual is pretty straightforward on how to use the tool. Pro-tip: Use the -n just in case. Try the -n with the examples below to see the difference:

echo -e ">a comment\nacgt\n>b comment of b\nACTG\n>a comment2\naaaa\n>a comment\nbbbbb"  | seqkit rename
echo -e ">a comment\nacgt\n>b comment of b\nACTG\n>a comment2\naaaa\n>a comment\nbbbbb"  | seqkit rename -n

Note that seqkit rename will not give you <full_name_line>_N but <id_part>_N <rest_of_name_line>.

ADD REPLY
1
Entering edit mode
2.2 years ago
GenoMax 147k

Besides seqkit you can also use rename.sh from BBMap suite.

rename.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> addpairnum=t

Back to your seqkit question.

$ more test.fq
@HWI-D00:74:H1493:2:1101:1186:2441
TTTTCTGCTACACTTCGAAAAACATATGATTCGTCTTTTCAGTTAGTTAAATATGTCTATAAGATGCCATCAATTAAAAAAAGTAACTTCACTATAATCG
+
CCCFFFFFGHHHGIEIJIJJGIIJJJJJHIJIGGGIJJJIJIIIJJJJJJJIJJJGIJJEIIIJJJJFHGIFIIGHIHHHFDD@CDDEEDDDDDDDDEDD
@HWI-D00:74:H1493:2:1101:1186:2441
TTTTCTGCTACACTTCGAAAAACATATGATTCGTCTTTTCAGTTAGTTAAATATGTCTATAAGATGCCATCAATTAAAAAAAGTAACTTCACTATAATCG
+
CCCFFFFFGHHHGIEIJIJJGIIJJJJJHIJIGGGIJJJIJIIIJJJJJJJIJJJGIJJEIIIJJJJFHGIFIIGHIHHHFDD@CDDEEDDDDDDDDEDD

You will now do. It will add a _2 to duplicate name for second sequence at end of fastq header

$ seqkit rename test.fq
@HWI-D00:74:H1493:2:1101:1186:2441
TTTTCTGCTACACTTCGAAAAACATATGATTCGTCTTTTCAGTTAGTTAAATATGTCTATAAGATGCCATCAATTAAAAAAAGTAACTTCACTATAATCG
+
CCCFFFFFGHHHGIEIJIJJGIIJJJJJHIJIGGGIJJJIJIIIJJJJJJJIJJJGIJJEIIIJJJJFHGIFIIGHIHHHFDD@CDDEEDDDDDDDDEDD
@HWI-D00:74:H1493:2:1101:1186:2441_2 
TTTTCTGCTACACTTCGAAAAACATATGATTCGTCTTTTCAGTTAGTTAAATATGTCTATAAGATGCCATCAATTAAAAAAAGTAACTTCACTATAATCG
+
CCCFFFFFGHHHGIEIJIJJGIIJJJJJHIJIGGGIJJJIJIIIJJJJJJJIJJJGIJJEIIIJJJJFHGIFIIGHIHHHFDD@CDDEEDDDDDDDDEDD
ADD COMMENT
0
Entering edit mode

Thanks, this should work for me.

ADD REPLY

Login before adding your answer.

Traffic: 2394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6