Hello, I am trying to save the complete header lines in my Fasta file using BWA. Once I've mapped the reads to the references genome and I want to extract the ones that mapped and output them to a fasta file. I need the reads to have the complete header name they originally had.
After looking for a while I see the option: bwa mem -R ’@RG\tID:foo\tSM:bar’. The problem is I don't understand this string i need to input and I get an error every time I try to use it. I know the above string is just an example, but I would be very grateful if some could explain this. Or propose a different way to output the complete header line for the reads from bwa. Thanks
I'm a bit confused on what you're trying to do and why. Are starting with a fasta file and you want to end up with a fasta file containing only the reads that map to the reference? What are you using readgroups for? Are the read headers important to keep unchanged, or are you just trying to use them for extracting reads?
I assume you want to save the entire fasta header (which has spaces in the name)? If that is the case you would need to convert those spaces to "_" and make the header a long string. Fasta format specification ignores anything that follows the first space in the header (which is how bwa is treating it, my guess).
Yes, this is exactly my question. I just want to be able to save the whole line of the header, but bwa is chopping some of the info off. I am later doing a search with the original header line to match against the bwa reads produced and they don't match.
Note that the default behavior of BBMap is to NOT chop off header after the first whitespace, and it can directly output to fasta, like this:
Either use BBMap or convert the spaces in the names to "_" like I said before, if you want to keep using bwa.