Beautiful work! I'm not even gonna wonder if any of it was AI-generated, because the code is clearly crafted meticulously by an experienced C engineer, very readable, and shorter than I expected.
> Flex or Bison generated code is also hard to maintain plus it complicates builds.
This is, in all honesty, a solved problem in any reasonable build system. (And I have little patience left for people making life hard for themselves through their own choices.)
It's been about five years since I've used Flex and Bison, but if I recall I just didn't check in the generated files and had a Makefile that built everything all together.
If I'm not misremembering that case then, it sounds like this should've never been an issue (well, as long as this is after basic version control and make). Curious if I'm missing something.
It seems like it would be easy to create such a parser for PEGs, but the semantics here don’t seem like a good fit for context-free grammars where alternation is completely symmetrical.
So many parser combinators operate on bytes assuming ASCII input only. I'd be more interested in a parser combinator lib that has UTF-8 decoding already abstracted away, operating on `wchar_t`, or even polymorphic input stream element types.
I’d still use a byte slice for that. Some formats may mix encodings, or have a text header and binary payload. For those cases one would need to use memchr for the first byte, then compare the remaining few bytes. So I don’t think it would be a huge performance impact
Isn't working with the utf8 stream sufficient? Especially if you only have ASCII keywords/operators/brackets, I feel a ASCII parser should work with utf8 out of the box
Yeah, a parser has no need to understand what a string or glyph is, let alone ASCII or UTF-8. The point is to take a stream of arbitrary data and process it into something that can be reasoned about. Unless you know your input stream is regular in some way, processing it at the finest level of granularity (usually bytes) is probably the only thing to do.
I'd rather not. Most of the time, you don't need it, and when you do, it's for a very small part of the input. And `wchar_t` is an abomination (it's UTF-32 on Linux, UTF-16 on Windows, and all of that is allowed); you probably really want `char32_t`, and again, not for the whole of the input; streaming such data a single rune/codepoint at a time is probably fine as well for most uses.
On the other hand, if your parser combinators process char-by-char, then maintaining a small "is this valid UTF-8 so far" context on the side should be pretty simple, so providing it would be an useful option, but actually decoding? Please don't.
I'm not familiar with parser combinators. The parser generators that I'm familiar with (YACC, ANTLR3,5) parse a stream of lexemes/tokens, not characters. Is there a reason why combinators don't operate on lexemes?
This is, in all honesty, a solved problem in any reasonable build system. (And I have little patience left for people making life hard for themselves through their own choices.)
If I'm not misremembering that case then, it sounds like this should've never been an issue (well, as long as this is after basic version control and make). Curious if I'm missing something.
On the other hand, if your parser combinators process char-by-char, then maintaining a small "is this valid UTF-8 so far" context on the side should be pretty simple, so providing it would be an useful option, but actually decoding? Please don't.