= Technical Notes Kein-Hong Man 2011-09-13 == Lexer Notes The lexer (`llex.lua`) is a version of the native 5.1.x lexer from Yueliang 0.4.0, with significant modifications. It does have several limitations: * The decimal point must be `.` (period). There is no localized decimal point replacement magic. * There is no support for nested `[[`...`]]` long strings (no `LUA_COMPAT_LSTR`). * The lexer may not properly lex source code with characters beyond the normal ASCII character set. Identifiers with accented characters (or any character beyond a byte value of 127) cannot be recognized. Instead of returning one token on each call, `llex.lua` processes an entire string (all data from an entire file) and returns. Two lists (tokens and semantic information items) are set up in the module for use by the caller. For maximum flexibility during processing, the lexer returns non-grammar lexical elements as tokens too. Non-grammar elements, such as comments, whitespace, line endings, are classified along with “normal” tokens. The lexer classifies 7 kinds of grammar tokens and 4 kinds of non-grammar tokens, as follows: [cols="m,d"] |=== | Grammar Token | Description | TK_KEYWORD | keywords | TK_NAME | identifiers | TK_NUMBER | numbers (unconverted, kept in original form) | TK_STRING | strings (no translation is done, includes delimiters) | TK_LSTRING | long strings (no translation is done, includes delimiters) | TK_OP | operators and punctuation (most single-char, some double) | TK_EOS | end-of-stream (there is only one for each file/stream) |=== [cols="m,d"] |=== | Whitespace Token | Description | TK_SPACE | whitespace (generally, spaces, \t, \v and \f) | TK_COMMENT | comments (includes delimiters, also includes special first line shbang, which is handled specially in the optimizer) | TK_LCOMMENT | block comments (includes delimiters) | TK_EOL | end-of-lines (excludes those embedded in strings) |=== A list of tokens can be generated by using the _--dump-lexer_ option, like this: [source, sh] lua LuaSrcDiet.lua --dump-lexer llex.lua > dump_llex.dat == Lexer Optimizations We aim to keep lexer-based optimizations free of parser considerations, i.e. we allow for generalized optimization of token sequences. The table below considers the requirements for all combinations of significant tokens (except `TK_EOS`). Other tokens are whitespace-like. Comments can be considered to be a special kind of whitespace, e.g. a short comment needs to have a following EOL token, if we do not want to optimize away short comments. [cols="h,6*m", options="header"] |=== | _1st \ 2nd Token_ | Keyword | Name | Number | String | LString | Oper | Keyword | [S] | [S] | [S] | – | – | – | Name | [S] | [S] | [S] | – | – | – | Number | [S] | [S] | [S] | – | – | [1] | String | – | – | – | – | – | – | LString | – | – | – | – | – | – | Oper | – | – | [1] | – | – | [2] |=== A dash (`-`) in the above means that the first token can abut the second token. `*[S]*`:: Need at least one whitespace, set as either a space or kept as an EOL. `*[1]*`:: Need a space if operator is a `.`, all others okay. A `+` or `-` is used as part of a floating-point spec, but there does not appear to be any way of creating a float by joining with number with a `+` or `-` plus another number. Since an `e` has to be somewhere in the first token, this can’t be done. `*[2]*`:: Normally there cannot be consecutive operators, but we plan to allow for generalized optimization of token sequences, i.e. even sequences that are grammatically illegal; so disallow adjacent operators if: * the first is in `[=<>]` and the second is `=` * disallow dot sequences to be adjacent, but `...` first okay * disallow `[` followed by `=` or `[` (not optimal) Also, a minus `-` cannot preceed a Comment or LComment, because comments start with a `--` prefix. Apart from that, all Comment or LComment tokens can be set abut with a real token. == Local Variable Renaming The following discusses the problem of local variable optimization, specifically _local variable renaming_ in order to reduce source code size. === TK_NAME Token Considerations A `TK_NAME` token means a number of things, and some of these cannot be renamed without analyzing the source code. We are interested in the use of `TK_NAME` in the following: [loweralpha] . global variable access, . local variable declaration, including `local` statements, `local` functions, function parameters, implicit `self` locals, . local variable access, including upvalue access. `TK_NAME` is also used in parts of the grammar as constant strings – these tokens cannot be optimized without user assistance. These include usage as: [loweralpha, start=4] . keys in `key=value` pairs in table construction, . field or method names in `a:b` or `a.b` syntax forms. For the local variable name optimization scheme used, we do not consider (d) and (e), and while global variables cannot be renamed without some kind of user assistance, they need to be considered or tracked as part of Lua’s variable access scheme. === Lifetime of a Local Variable Consider the following example: [source, lua] local string, table = string, table In the example, the two locals are assigned the values of the globals with the same names. When Lua encounters the declaration portion: [source, lua] local string, table the parser cannot immediately make the two local variable available to following code. In the parser and code generator, locals are inactive when entries are created. They are activated only when the function `adjustlocalvars()` is called to activate the appropriate local variables. NOTE: The terminology used here may not be identical to the ones used in the Dragon Book – they merely follow the LuaSrcDiet code as it was written before I have read the Dragon Book. In the example, the two local variables are activated only after the whole statement has been parsed, that is, after the last `table` token. Hence, the statement works as expected. Also, once the two local variables goes out of scope, `removevars()` is called to deactivate them, allowing other variables of the same name to become visible again. Another example worth mentioning is: [source, lua] local a, a, a, = 1, 2, 3 The above will assign 3 to `a`. Thus, when optimizing local variable names, (1) we need to consider accesses of global variable names affecting the namespace, (2) for the local variable names themselves, we need to consider when they are declared, activated and removed, and (3) within the “live” time of locals, we need to know when they are accessed (since locals that are never accessed don’t really matter.) === Local Variable Tracking Every local variable declaration is considered an object to be renamed. From the parser, we have the original name of the local variable, the token positions for declaration, activation and removal, and the token position for all the `TK_NAME` tokens which references this local. All instances of the implicit `self` local variable are also flagged as such. In addition to local variable information, all global variable accesses are tabled, one object entry for one name, and each object has a corresponding list of token positions for the `TK_NAME` tokens, which is where the global variables were accessed. The key criteria is: *Our act of renaming cannot change the visibility of any of these locals and globals at the time they are accessed*. However, _their scope of visibility may be changed during which they are not accessed_, so someone who tries to insert a variable reference somewhere into a program that has its locals renamed may find that it now refers to a different variable. Of course, if every variable has a unique name, then there is no need for a name allocation algorithm, as there will be no conflict. But, in order to maximize utilization of short identifier names to reduce the final code size, we want to reuse the names as much as possible. In addition, fewer names will likely reduce symbol entropy and may slightly improve compressibility of the source code. LuaSrcDiet avoids the use of non-ASCII letters, so there are only 53 single-character variable names. === Name Allocation Theory To understand the renaming algorithm, first we need to establish how different local and global variables can operate happily without interfering with each other. Consider three objects, local object A, local object B and global object G. A and B involve declaration, activation and removal, and within the period it is active, there may be zero or more accesses of the local. For G, there are only global variable accesses to look into. Assume that we have assigned a new name to A and we wish to consider its effects on other locals and globals, for which we choose B and G as examples. We assume local B has not been assigned a new name as we expect our algorithm to take care of collisions. A’s lifetime is something like this: ---- Decl Act Rem + +-------------------------------+ ------------------------------------------------- ---- where “Decl” is the time of declaration, “Act” is the time of activation, and “Rem” is the time of removal. Between “Act” and “Rem”, the local is alive or “live” and Lua can see it if its corresponding `TK_NAME` identifier comes up. ---- Decl Act Rem + +-------------------------------+ ------------------------------------------------- * * * * (1) (2) (3) (4) ---- Recall that the key criteria is to not change the visibility of globals and locals during when they are accessed. Consider local and global accesses at (1), (2), (3) and (4). A global G of the same name as A will only collide at (3), where Lua will see A and not G. Since G must be accessed at (3) according to what the parser says, and we cannot modify the positions of “Decl”, “Act” and “Rem”, it follows that A cannot have the same name as G. ---- Decl Act Rem + +-----------------------+ --------------------------------- (1)+ +---+ (2)+ +---+ (3)+ +---+ (4)+ +---+ --------- --------- --------- --------- ---- For the case of A and B having the same names and colliding, consider the cases for which B is at (1), (2), (3) or (4) in the above. (1) and (4) means that A and B are completely isolated from each other, hence in the two cases, A and B can safely use the same variable names. To be specific, since we have assigned A, B is considered completely isolated from A if B’s activation-to-removal period is isolated from the time of A’s first access to last access, meaning B’s active time will never affect any of A’s accesses. For (2) and (3), we have two cases where we need to consider which one has been activated first. For (2), B is active before A, so A cannot impose on B. But A’s accesses are valid while B is active, since A can override B. For no collision in the case of (2), we simply need to ensure that the last access of B occurs before A is activated. For (3), B is activated before A, hence B can override A’s accesses. For no collision, all of A’s accesses cannot happen while B is active. Thus position (3) follows the “A is never accessed when B is active” rule in a general way. Local variables of a child function are in the position of (3). To illustrate, the local B can use the same name as local A and live in a child function or block scope if each time A is accessed, Lua sees A and not B. So we have to check all accesses of A and see whether they collide with the active period of B. If A is not accessed during that period, then B can be active with the same name. The above appears to resolve all sorts of cases where the active times of A and B overlap. Note that in the above, the allocator does not need to know how locals are separated according to function prototypes. Perhaps the allocator can be simplified if knowledge of function structure is utilized. This scheme was implemented in a hurry in 2008 — it could probably be simpler if Lua grammar is considered, but LuaSrcDiet mainly processes various index values in tables. === Name Allocation Algorithm To begin with, the name generator is mostly separate from the name allocation algorithm. The name generator returns the next shortest name for the algorithm to apply to local variables. To attempt to reduce symbol entropy (which benefit compression algorithms), the name generator follows English frequent letter usage. There is also an option to calculate an actual symbol entropy table from the input data. Since there are 53 one-character identifiers and (53 * 63 - 4) two-character identifiers (minus a few keywords), there isn’t a pressing need to optimally maximize name reuse. The single-file version of LuaSrcDiet 0.12.0, at just over 3000 SLOC and 156 kiB in size, currently allocates around 55 unique local variable names. In theory, we should need no more than 260 local identifiers by default. Why? Since `LUAI_MAXVARS` is 200 and `LUAI_MAXUPVALUES` is 60, at any block scope, there can be at most `(LUAI_MAXVARS + LUAI_MAXUPVALUES)` locals referenced, or 260. Also, those from outer scopes not referenced in inner scopes can reuse identifiers. The net effect of this is that a local variable name allocation method should not allocate more than 260 identifier names for locals. The current algorithm is a simple first-come first-served scheme: [loweralpha] . One local object that use the most tokens is named first. . Any other non-conflicting locals with respect to the first object are assigned the same name. . Assigned locals are removed from consideration and the procedure is repeated for objects that have not been assigned new names. . Steps (a) to (c) repeats until no local objects are left. In addition, there are a few extra issues to take care of: [loweralpha, start=5] . Implicit `self` locals that have been flagged as such are already “assigned to” and so they are left unmodified. . The name generator skips `self` to avoid conflicts. This is not optimal but it is unlikely a script will use so many local variables as to reach `self`. . Keywords are also skipped for the name generator. . Global name conflict resolution. For (h), global name conflict resolution is handled just after the new name is generated. The name can still be used for some locals even if it conflicts with other locals. To remove conflicts, global variable accesses for the particular identifier name is checked. Any local variables that are active when a global access is made is marked to be skipped. The rest of the local objects can then use that name. The algorithm has additional code for handling locals that use the same name in the same scope. This extends the basic algorithm that was discussed earlier. For example: [source, lua] ---- local foo = 10 -- <1> ... local foo = 20 -- <2> ... print(e) ---- Since we are considering name visibility, the first `foo` does not really cease to exist when the second `foo` is declared, because if we were to make that assumption, and the first `foo` is removed before (2), then I should be able to use `e` as the name for the first `foo` and after (2), it should not conflict with variables in the outer scope with the same name. To illustrate: [source, lua] ---- local e = 10 -- 'foo' renamed to 'e' ... local t = 20 -- error if we assumed 'e' removed here ... print(e) ---- Since `e` is a global in the example, we now have an error as the name as been taken over by a local. Thus, the first `foo` local must have its active time extend to the end of the current scope. If there is no conflict between the first and second `foo`, the algorithm may still assign the same names to them. The current fix to deal with the above chains local objects in order to find the removal position. It may be possible to handle this in a clean manner – LuaSrcDiet handles it as a fix to the basic algorithm. == Ideas The following is a list of optimization ideas that do not require heavy-duty source code parsing and comprehension. === Lexer-Based Optimization Ideas * Convert long strings to normal strings, vice versa. + _A little desperate for a few bytes, can be done, but not real keen on implementing it._ * Special number forms to take advantage of constant number folding. + _For example, 65536 can be represented using 2^16^, and so on. An expression must be evaluated in the same way, otherwise this seems unsafe._ * Warn if a number has too many digits. + _Should we warn or “test and truncate”? Not really an optimization that will see much use._ * Warn of opportunity for using a `local` to zap a bunch of globals. + _Current recommendation is to use the HTML plugin to display globals in red. The developer can then visually analyze the source code and make the appropriate fixes. I think this is better than having the program guess the intentions of the developer._ * Spaces to tabs in comments, long comments, or long strings. + _For long strings, need to know user’s intention. Would rather not implement._ === Parser-Based Optimization Ideas Heavy-duty optimizations will need more data to be generated by the parser. A full AST may eventually be needed. The most attractive idea that can be quickly implemented with a significant code size “win” is to reduce the number of `local` keywords. * Remove unused ``local``s that can be removed in the source. + _Need to consider unused ``local``s in multiple assignments._ * Simplify declaration of ``local``s that can be merged. + _From:_ + [source, lua] ---- -- separate locals local foo local bar -- separate locals with assignments local foo = 123 local bar = "pqr" ---- + _To:_ + [source, lua] ---- -- merged locals local foo,bar -- merged locals with assignments local foo,bar=123,"pqr" ---- * Simplify declarations using `nil`. + _From:_ [source, lua] local foo, bar = nil, nil + _To:_ [source, lua] local foo,bar * Simplify ``return``s using `nil`. + _How desirable is this? From Lua list discussions, it seems to be potentially unsafe unless all return locations are known and checked._ * Removal of optional semicolons in statements and removal of commas or semicolons in table constructors. + _Yeah, this might save a few bytes._ * Remove table constructor elements using `nil`. + _Not sure if this is safe to do._ * Simplify logical or relational operator expressions. + _This is more suitable for an optimizing compiler project._