Non-standard TCGA FASTQ files
1
2
Entering edit mode
8.5 years ago

I am looking at RNA-seq data from TCGA and noticed that some nucleotides are replaced by '.' in the FASTQ files. So far, I only encountered this coding in FASTA files that were repeat-masked. Can this encoding potentially cause any mappers or other down-stream software to fail on such files, or should I even replace the '.' by N?

Please see the second read in this output:

zcat TCGA/RNAseq/OV/data/246cf927-a407-4c5a-9f60-0822bf34b208/D0W8YACXX_7_TGACCA_R1.fastq.gz | head -8
@HS2_288:7:1101:1487:2168/1 1:N:0:TGACCA
GGTCAGTCTGCTTTCCCCCTGTTTTATAATGTTGGTGGTTTTAATCCGTATTTCTTTGCAACTTCTGTCTGGGCA
+
BC?DFFFFHHHHHJJJJJJJJJJJJEIEIHIFHGIGHIFHIJIIJIGHHGHJJHIJJJJIJJJJJJJJJJIGHHH
@HS2_288:7:1101:1723:2143/1 1:N:0:TGACCA
.CTGGTTATGTGCTCCCTTCCACAGGGCTACATGACGGCACTTTATTTTAAATCCTTTAAACAAAATACATATGG
+
#1=DFDFFHGFHHJJJJJJJJJJJJJJJJEHHJID9DHGIJJJJIJJJJIJJJJJJJIIEIIJJGHHHHHGFFFF

Best, Daniel

RNA-Seq TCGA FASTQ • 1.8k views
ADD COMMENT
1
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2037 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6