I just want to do some modification to my read id in my fastq file.
And i use grep to get the id i want to edit , then i use sed to make the change .
But i find there is no change in my original fastq file .
here is my command:
cat test.fastq |grep '^@.*/1'| sed 's/@/@ILUMINA/g'
simply run the sed command on your original file to modify it, omitting the grep part
Keep in mind though that the original will then not be present anymore (as you will have changed it), a better approach might be to redirect it to a new file
cat test.fastq | sed 's/@/@ILUMINA/g' > some-new_file
this might not be restrictive enough though, as it will also change all other occurrences of '@'
i just want change the each id of my reads .
i think the way you recommend will change the quality also.
the "grep '^@.*/1'" in my command just restrict the row i want to change to the id line in my fastq file.
anyway ,thanks a lot
just a example , i just want to prefix the id.
because the stupid sequencing company give me the pair-end fastq file whose id like this : @307/1
it cant support me to do markduplicate in GATK
that really make me mad :(
And adding "ILLUMINA" to the headers will make markduplicate work? Are you referring to Picard MarkDuplicates? I thought it was supposed to work on bam files, not on fastq files.
Did you ask the sequencing company why the headers are like this? Illumina headers follow a different naming convention.
beacause picard just told me "Value was put into PairInfoMap more than once",
and when i find solution on the net , i just find someone said this error results from some lane id in the fastq file is repeat.
so i just want to edit the id of reads to solve it .
this way really solve the problem at least now.
maybe the way you told me works well ,but i dont how to do it. :(
I don't know if this is of any particular consequence for what you want to do, but you've missed an L out in ILLUMINA. You may also want to consider changing the substitution to:
/^@/@ILUMINA:/ since all the fields in the header lines are : delimited, and this might make it easier to separate out the string later on.
Use at own risk though, as messing with the FASTQ headers is liable to break other programs.
x~y is generic syntax for sed called an ‘address’ that basically says: starting on the 1st line, and every 4th thereafter, (~4), make the substitution defined in the /.../.../. This way it knows to ignore the quality line if it finds an @ at the start
you are such a nice person , thank you very much!
i think i should buy a more advanced book rather than a basic book to study linux command.
thanks a lot again!
You don’t even really need a book, all you need is Google, and:
a well formulated question.
For example, this question, once you really think about what needs to happen is you need to process all lines starting with “@“ right? Well, no, as Pierre and others mention, we can’t use @! - Oh no, we need to think about the problem another way.
What else do we know about FASTQ format? Well, every entry is always 4 lines (assuming the file isn’t malformed, but if it is you have other, bigger, problems). So, all we really need to do is “edit every nth line of a file (with sed)”. And this right here is your google search phrase.
Now the title of that thread might not seem immediately relevant, but it is. You’ve just found out the magic of how to edit every nth line, now you need only combine that with what you already know about how sed works (i.e. the substitution part) and you’re done!
Useful to know, but not needed here, is the sponge command from moreutils which can be used to perform in-place edits using any command even if it does not support -i for in-place edits. Example:
anyCommand test.fastq | sponge test.fastq
in which test.fastq won't be re-written unless anyCommand completes without error.
never use '@' as a signal that the line is the header, because '@' is also a valid character for the fastq quality.
simply run the
sed
command on your original file to modify it, omitting the grep partKeep in mind though that the original will then not be present anymore (as you will have changed it), a better approach might be to redirect it to a new file
this might not be restrictive enough though, as it will also change all other occurrences of '@'
re: "simply run the
sed
command" - note: you must pass-i
to modify it in place (assuming GNUsed
)That's not really something I would advise to novice users. Great way to lose your input data.
Agreed. Don't use the
-i
switch unless you're really sure what the sed does and you're sure you don't need the unmodified content later.i just want change the each id of my reads . i think the way you recommend will change the quality also. the "grep '^@.*/1'" in my command just restrict the row i want to change to the id line in my fastq file. anyway ,thanks a lot
Your grep command wouldn’t have solved that issue anyway, as it would still match a quality line that begins with
@
Out of curiosity: why do you want to add "ILLUMINA" to every header?
just a example , i just want to prefix the id. because the stupid sequencing company give me the pair-end fastq file whose id like this : @307/1 it cant support me to do markduplicate in GATK that really make me mad :(
And adding "ILLUMINA" to the headers will make markduplicate work? Are you referring to Picard MarkDuplicates? I thought it was supposed to work on bam files, not on fastq files.
Did you ask the sequencing company why the headers are like this? Illumina headers follow a different naming convention.
beacause picard just told me "Value was put into PairInfoMap more than once", and when i find solution on the net , i just find someone said this error results from some lane id in the fastq file is repeat. so i just want to edit the id of reads to solve it . this way really solve the problem at least now. maybe the way you told me works well ,but i dont how to do it. :(
If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.