-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more crazy idea for benchmarking: compare against optimal AST structure (found by applying simplex) #3
Comments
I'm not sure what your first paragraph means, sorry. I would be happy to consider what you are talking about, it's just that you are referring to things I don't know about. It sounds to me like Simplex is used to mathematically optimize a systems of equations? I'm not sure how that is applicable to an AST. Is non-recursive tree traversal better? I think it depends. Regardless, I'm mainly focusing on producing the AST as the next part of this project. How someone consumes it is up to them, they could use recursion or simulate it in a loop. I could provide an iterator helper if desired. As far as data being encoded by layout goes, the answer is yes, definitely. The reason why this hasn't been done yet is because you have to be smart about how you optimize the construction of the AST. Generating an implicit or semi-implicit AST laid out in pre-order DFS is a little difficult to do efficiently. The problem is mainly to do with how operator precedence determines the AST hierarchy and thus how it should be laid out in memory. E.g. the infix expression The conventional algorithm to convert infix expressions to a more amenable form is called Shunting Yard when implemented with explicit stacks. The Zig compiler uses an equivalent technique called Pratt Parsing which uses the program stack via recursion instead. The standard forms of this algorithm can only get you a Postfix ordering, which is how the Zig AST is currently laid out in memory. In order to get a prefix ordering, we have to do one of the following:
If you think about doing string concatenation in terms of ropes or cords (data structures that represent strings as trees in order to avoid having to copy the string data somewhere else every time one does concatenation), I think if you squint you can see how method 3 and 4 are very similar. However, 4 has less rigid prescriptions for the implementation, and therefore more optimizations are available to us with method 4. I have a few tricks up my sleeve for method 4, although, again, method 1 could be better if we can pull it off. |
The provided stackoverflow example is only to speed up tree traversal, ie for DFS, say I have no experience, how much bitpacking the "from which parent is the node" Assume an given AST:
The optimal representation, assuming optimal is when each node has at max. 4
One can pack E being child from D as (10|30bit of E) into E, which means:
Under the assumption that this does not affect performance (I suspect this is
Are there (more recent) benchmarks with performance numbers for SWAR and SIMD on these things in publications or does this mean tinkering with educated guesses? |
I think the idea you talked about is rendered unnecessary by better AST representations. With an implicit AST, you don't "need" pointers at all, unless you wanted to jump around/skip over parts of the tree. I am planning on a compile-time boolean that allows one to generate an extra buffer where you can look up level-order pointers for this purpose, but in the compiler we should not need pointers in the AST, because, so far as I can tell, we have to look at the entire AST in the compiler every time, and we can just go in the normal order. I worked on roblox-ts heavily for 2 years, and we pretty much had no choice but to traverse the AST in the normal, DFS order no matter what happened. I haven't studied all of the ASTGen code but from what I have looked at, Zig just traverses the whole AST in the same order no matter what. Allowing for arbitrary skips and hops is only necessary for use-cases that exist outside of a compiler. We can use an implicit representation because all nodes either have 1. fixed-arity or 2. could be delimited by an An alternative to
I do not know how speculative execution could make your idea worse, unless maybe you are afraid that increasing the number of necessary speculative loads would worsen cache coherency. However, in general, I would expect that this could be true even without speculation occurring. In practice, bitstrings are normally pretty good for cache coherency though. A 64-byte cache line is 512 bits! That's kind of a lot of bits, that is, if you can make efficient use of them.
Assuming that we 1. don't want a post-order layout in memory and 2. don't want to traverse Zig tokens backwards to construct the AST (which might work, tbh, I need to write it to tell), and 3. don't want to traverse a post-ordered AST backwards, option 4 is the obviously best option, simply because performance tuning actually is available to us in ways that it simply isn't the case for the other options. Yes, this involves "educated guesses" in what I think is going to be faster, but, come on, what doesn't involve "educated guesses" when it comes to optimization? I usually have a pretty good sense of what is faster for hardware to do and so I try to think about what will be best on average. This project, for example, gets somewhere between a 2 to 2.5x speed improvement on real-world files, even though you could look at the code and make a file that specifically only has cases where one could maybe expect worse performance. E.g. I just tested a file filled entirely with a repeating sequence of
Turns out we still have a performance win though, probably due to other things being optimized. The bloat that we have for extreme edge-cases in the real world benchmark where we do this maximum of 3 character backtracking is very minimal. My point is, it is designed so that the typical case is faster, specifically, matching variable-length tokens faster. It's a balancing act, but it's typically not that difficult to guess the general thing you need to optimize for. For some problems, you probably should collect some data on the common case to figure out what to optimize, but for this problem it's obvious to me that parsing variable-length tokens is going to be the bottleneck, so I optimize for that. In the case of option 4, we can use SWAR/SIMD for a few things, and it's relatively obvious how I can get performance wins over the naive implementation. It is highly improbable that option 3 is actually faster, which is effectively one of the naive ways to do option 4. The other naive way is to do concatenations by doing a new allocation every time and copying all the data over every time, and deallocating the old string. (This would technically make expression parsing O(n²) by the way.) It's "obvious" to me that copying into a fixed-size buffer and using a linked list in the case of intentionally awful file inputs will be most efficient. I could be wrong, of course, but I will be pretty surprised if someone thinks of a better technique. (I have not fully spelled out my technique here, but you'll see it once I publish it that I have a few tricks up my sleeve). There will be some tinkering involved with trying out different size buffers, but that's literally me just changing a variable in the code and re-running it and seeing what's faster (for my machine, in the future we could do comptime switches for different targets). I think primarily the optimization work would be in getting the compiler to actually emit the best instructions to do my algorithm, not thinking of a fundamentally different algorithm. Actually, I am expecting that the compiler will give me awful emit for certain operations and be unable to vectorize the code I want it to, and I might have to temporarily dip into assembly until the Zig compiler gets support for more vector operations (I actually already do this in the utf8 validator, mostly written by travisstaloch, ported from simdjson). |
Hey @matu3ba and other interested parties, I just wanted to share that I did, in fact, make some progress this week. I mostly finished a lot of the SWAR stuff for the tokenizer (although the utf8 validator still could use some love), did a few more optimizations to the tokenizer, and now I am at the point where the tokenizer is mostly done in the sense that it works how it's going to work. I might try to make a slightly faster algorithm for the operator parsing at some point, but on the whole, it's pretty much as good as I can make it, at least until we get automatic profile-guided optimization. Obviously there is a little tidying up of the code related to comments and assertions that's in order too, but it's getting there. Thank you so much for your interest, feedback, and suggestions! I am sorry things haven't been moving faster but I am plugging along, and within the next 2 weeks I should have something to show on the AST front. For now, there's some SWAR stuff to look at. I will recombine some of the functions I had separated during testing of different strategies but other than that it looks pretty close to how the end result will look. One thing you might find interesting is a trick I learned in the course of studying SWAR. It turns out you can do a SWAR mov_mask operation using a single multiply and a shift! Here is my implementation which is based on this article. For reference, in SWAR we read bytes into a normal register and then try to operate on each byte separately. We can mask out the highest bit of each byte and then we can add Today I made a working shunting yard implementation that sort of proves that the strategy I plan to use for AST generation is feasible and hopefully it will be efficient in the end. Admittedly, at this stage, it is kind of like a proof-of-concept of a proof-of-concept, but I am optimistic that everything can be implemented in an extremely very efficient manner in the coming days. |
A relative optimal AST structure can be found from computing simplex with the derived constrains on the first AST parse + copy things over.
See also https://en.wikipedia.org/wiki/Simplex_algorithm. The general idea of non-recursive tree traversal is given here: https://stackoverflow.com/questions/28544980/data-oriented-tree-traversal-without-recursion/28616278#28616278.
I got the idea from user Prokop on discord, which was asking "is the adjacency of ast nodes in the node list used somehow to store information?".
The text was updated successfully, but these errors were encountered: