Hi all,
I have a set of adapters which were given to me by a collaborator in a regular text file (i5R.txt). I moved these sequences onto my institution's linux HPC and attempted to use the files to pull sequences from a fastq using grep -f like so:
grep -f i5R.txt myseqs.fastq
This returned nothing, which was surprising because I know that the adaptors are there because I can match them in vim. Suspecting some pesky invisible characters, I typed out the characters in vim into a new text file called i5R.seqs. This fixed the pattern matching issue with grep.
Here is the diff of the two files, to show that they appear identical.
[geneticatt]$ diff i5R.txt i5R.seqs
1,8c1,8
< CCTGATAC
< TTAAGTTG
< CGGACAGT
< GCACTACA
< TGGTGCCT
< TCCACGGC
< ATGTCGTG
< CCACGACA
---
> CCTGATAC
> TTAAGTTG
> CGGACAGT
> CGACTACA
> TGGTGCCT
> TCCACGGC
> ATGTCGTG
> CCACGACA
What type of character could be the culprit? I searched for \r because I've had problems with that one before, but this is another invisible character. How does one go about hunting down and removing the invisible characters that plague their workflow? Further, what preventative measures can I take to make sure I don't get hung up on something like this again?
You could have looked at the file using
cat -vet
which would have shown all characters in the file. Printable and non.Another way to see hidden characters is to pipe them through octal dump:
cat infile | od -c
this will print out hidden characters, newlines, etc.