Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> How is it technically fair? The answer is objectively wrong - you can tokenize XHTML using regexes.

Yes, but that you can tokenize XHTML using regular expressions is not the same thing as being able to use a single regular expression to extract XHTML tokens. Remember that context free languages are a superset of regular expressions. I don't personally know enough about the XHTML syntax to say off the bat whether the syntax can be described with a regular expression, but generally a recursive definition of valid syntax is not possible to express with regular expressions.

> You cannot use a parser, since a parser does not emit tokens but emit the element tree and abstracts away syntactic details like the difference between <x></x> and <x />.

You can use a parser, just not any XHTML parser. The parser would need to be constructed with the objectives in mind, to parse into a data structure that doesn't abstract these details away.

That said, maybe an even simpler solution exists, such as to use several regular expressions to first remove comment and CDATA before matching. I'm not immediately aware of any other cases that would cause problems for the trivial match suggested in the question post.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: