Question

grep value from html file

1

Entering edit mode

18 months ago

arshad1292 ▴ 110

I have 200 html files that contain information such as Filename, Filetype, total Sequences etc. Please see attached the screenshot enter image description here

I need to grep the Filename and Total Sequences from the Value column (in this screenshot I need IGM17-B_S162_read_1.fastq and the value 9237623) and save it in a seperate.txt file.

May be with grep or cat command. Again, these are html files.

I would really appreciate help from anyone who's expert in writing the script in the command line.

cat script commandline shell grep • 898 views

ADD COMMENT • link updated 18 months ago by dariober 15k • written 18 months ago by arshad1292 ▴ 110

1

Entering edit mode

This can be done, but it seems that you wish to aggregate FastQC reports and possibly other logfiles. So maybe you want to try MultiQC first before trying to come up with an own solution?

ADD REPLY • link 18 months ago by Matthias Zepper 5.0k

0

Entering edit mode

This may be a fun read https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

ADD REPLY • link 18 months ago by dariober 15k

score 4 · Answer 1 · 2023-05-25

html produced by fastqc in XML+HTML, so you can use a XPATH expression to extract things.

$ xmllint --xpath '//tr[td[1]/text()="Filename"]/td[2]/text()'   fastqc_report.html
jeter.fastq.gz

 xmllint --xpath '//tr[td[1]/text()="Total Sequences"]/td[2]/text()'  fastqc_report.html
147142898

fastqc also comes with a text file fastqc_data.txt

$ grep -E '(Filename|Total Sequences)'  fastqc_data.txt 
Filename    jeter.fastq.gz
Total Sequences 147142898