parsing in rust with the ability to refill

Posted on November 28, 2018 by dario

parsing in rust

a parser in what i mean here is a piece of software that interprets a sequence of bytes into a more meaningful, structured representation. there are various flavors, some of them are yacc-style generators of large state machines, others are parsec-style combinator libraries, where you program your parser as a combination of smaller functions.

preserving the work done

rust has all of them. so far i know of nom and chomp from the combinators camp, pest and lalrpop (generated state machines) and rust-peg (similar to combinators, but different). what they all vary in (and i want to focus on that here) is their behaviour when the input is incomplete (i.e. the input might be interpreted into a valid object if the right bytes were to follow the current end of input). some just fail the parse (reasonable, since the current input is not valid), some report back that this input is incomplete (reasonable, since that is a different thing from being an invalid/non-interpretable byte sequence), but all of them return a (somewhat definitive) result of parsing the input that in some way encodes that we must try again with more input. what i’ve been missing so far is a way to keep the work that was already done and continue, once we have more bytes to parse.

owning things

in rust, questions about scope are a lot more obvious than in other languages, mostly because you need to be explicit about which scope owns an object. this sometimes leads to having to pass data to a library’s user (bundled up in an object that doesn’t allow them to change or even safely access it, possibly) to have said object live as long as the user wants to use the current session/action/handle.

the question about who owns what part of the data that i’ve been asking myself here is: shold the caller own the input and the way to refill it, or the parser? in the first case, we would pass a reference to a buffer (owned by the caller) to the parser code, and the parser code would return a closure to continue parsing, accepting more input and returning a new ParseResult (i.e. done, or a new closure). in the second case, we’d be handing the parser a way to refill its parse buffer, i.e. a closure containing the source in its environment and calling receive_new_bytes(source) under the hood.

interruptible, resumable computations

i’ve written the above paragraphs one and a half years ago, and since thought back to this problem sometimes, but largely put it aside. now that i’m re-reading them (and rephrasing some of it for clarity), i’m thinking: resumable computations? rust has async-await now, can’t we use that to build computations (Futures) that could yield back to the caller on incomplete input, and wait for more?

TODO