= Features and Usage Kein-Hong Man 2011-09-13 == Features LuaSrcDiet features include the following: * Predefined default, _--basic_ (token-only) and _--maximum_ settings. * Avoid deleting a block comment with a certain message with _--keep_; this is for copyright or license texts. * Special handling for `#!` (shbang) lines and in functions, `self` implicit parameters. * Dumping of raw information using _--dump-lexer_ and _--dump-parser_. See the `samples` directory. * A HTML plugin: outputs files that highlights globals and locals, useful for eliminating globals. See the `samples` directory. * An SLOC plugin: counts significant lines of Lua code, like SLOCCount. * Source and binary equivalence testing with _--opt-srcequiv_ and _--opt-binequiv_. List of optimizations: * Line endings are always normalized to LF, except those embedded in comments or strings. * _--opt-comments_: Removal of comments and comment blocks. * _--opt-whitespace_: Removal of whitespace, excluding end-of-line characters. * _--opt-emptylines_: Removal of empty lines. * _--opt-eols_: Removal of unnecessary end-of-line characters. * _--opt-strings_: Rewrite strings and long strings. See the `samples` directory. * _--opt-numbers_: Rewrite numbers. See the `samples` directory. * _--opt-locals_: Rename local variable names. Does not rename field or method names. * _--opt-entropy_: Tries to improve symbol entropy when renaming locals by calculating actual letter frequencies. * _--opt-experimental_: Apply experimental optimizations. LuaSrcDiet tries to allow each option to be enabled or disabled separately, but they are not completely orthogonal. If comment removal is disabled, LuaSrcDiet only removes trailing whitespace. Trailing whitespace is not removed in long strings, a warning is generated instead. If empty line removal is disabled, LuaSrcDiet keeps all significant code on the same lines. Thus, a user is able to debug using the original sources as a reference since the line numbering is unchanged. String optimization deals mainly with optimizing escape sequences, but delimiters can be switched between single quotes and double quotes if the source size of the string can be reduced. For long strings and long comments, LuaSrcDiet also tries to reduce the `=` separators in the delimiters if possible. For number optimization, LuaSrcDiet saves space by trying to generate the shortest possible sequence, and in the process it does not produce “proper” scientific notation (e.g. 1.23e5) but does away with the decimal point (e.g. 123e3) instead. The local variable name optimizer uses a full parser of Lua 5.1 source code, thus it can rename all local variables, including upvalues and function parameters. It should handle the implicit `self` parameter gracefully. In addition, local variable names are either renamed into the shortest possible names following English frequent letter usage or are arranged by calculating entropy with the _--opt-entropy_ option. Variable names are reused whenever possible, reducing the number of unique variable names. For example, for `LuaSrcDiet.lua` (version 0.11.0), 683 local identifiers representing 88 unique names were optimized into 32 unique names, all which are one character in length, saving over 2600 bytes. If you need some kind of reassurance that your app will still work at reduced size, see the section on verification below. == Usage LuaSrcDiet needs a Lua 5.1.x (preferably Lua 5.1.4) binary to run. On Unix machines, one can use the following command line: [source, sh] LuaSrcDiet myscript.lua -o myscript_.lua On Windows machines, the above command line can be used on Cygwin, or you can run Lua with the LuaSrcDiet script like this: [source, sh] lua LuaSrcDiet.lua myscript.lua -o myscript_.lua When run without arguments, LuaSrcDiet prints a list of options. Also, you can check the `Makefile` for some examples of command lines to use. For example, for maximum code size reduction and maximum verbosity, use: [source, sh] LuaSrcDiet --maximum --details myscript.lua -o myscript_.lua === Output Example A sample output of LuaSrcDiet 0.11.0 for processing `llex.lua` at _--maximum_ settings is as follows: ---- Statistics for: LuaSrcDiet.lua -> sample/LuaSrcDiet.lua *** local variable optimization summary *** ---------------------------------------------------------- Variable Unique Decl. Token Size Average Types Names Count Count Bytes Bytes ---------------------------------------------------------- Global 10 0 19 95 5.00 ---------------------------------------------------------- Local (in) 88 153 683 3340 4.89 TOTAL (in) 98 153 702 3435 4.89 ---------------------------------------------------------- Local (out) 32 153 683 683 1.00 TOTAL (out) 42 153 702 778 1.11 ---------------------------------------------------------- *** lexer-based optimizations summary *** -------------------------------------------------------------------- Lexical Input Input Input Output Output Output Elements Count Bytes Average Count Bytes Average -------------------------------------------------------------------- TK_KEYWORD 374 1531 4.09 374 1531 4.09 TK_NAME 795 3963 4.98 795 1306 1.64 TK_NUMBER 54 59 1.09 54 59 1.09 TK_STRING 152 1725 11.35 152 1717 11.30 TK_LSTRING 7 1976 282.29 7 1976 282.29 TK_OP 997 1092 1.10 997 1092 1.10 TK_EOS 1 0 0.00 1 0 0.00 -------------------------------------------------------------------- TK_COMMENT 140 6884 49.17 1 18 18.00 TK_LCOMMENT 7 1723 246.14 0 0 0.00 TK_EOL 543 543 1.00 197 197 1.00 TK_SPACE 1270 2465 1.94 263 263 1.00 -------------------------------------------------------------------- Total Elements 4340 21961 5.06 2841 8159 2.87 -------------------------------------------------------------------- Total Tokens 2380 10346 4.35 2380 7681 3.23 -------------------------------------------------------------------- ---- Overall, the file size is reduced by more than 9 kiB. Tokens in the above report can be classified into “real” or actual tokens, and “fake” or whitespace tokens. The number of “real” tokens remained the same. Short comments and long comments were completely eliminated. The number of line endings was reduced by 59, while all but 152 whitespace characters were optimized away. So, token separators (whitespace, including line endings) now takes up just 10 % of the total file size. No optimization of number tokens was possible, while 2 bytes were saved for string tokens. For local variable name optimization, the report shows that 38 unique local variable names were reduced to 20 unique names. The number of identifier tokens should stay the same (there is currently no optimization option to optimize away non-essential or unused “real” tokens). Since there can be at most 53 single-character identifiers, all local variables are now one character in length. Over 600 bytes was saved. _--details_ will give a longer report and much more information. A sample output of LuaSrcDiet 0.12.0 for processing the one-file `LuaSrcDiet.lua` program itself at _--maximum_ and _--opt-experimental_ settings is as follows: ---- *** local variable optimization summary *** ---------------------------------------------------------- Variable Unique Decl. Token Size Average Types Names Count Count Bytes Bytes ---------------------------------------------------------- Global 27 0 51 280 5.49 ---------------------------------------------------------- Local (in) 482 1063 4889 21466 4.39 TOTAL (in) 509 1063 4940 21746 4.40 ---------------------------------------------------------- Local (out) 55 1063 4889 4897 1.00 TOTAL (out) 82 1063 4940 5177 1.05 ---------------------------------------------------------- *** BINEQUIV: binary chunks are sort of equivalent Statistics for: LuaSrcDiet.lua -> app_experimental.lua *** lexer-based optimizations summary *** -------------------------------------------------------------------- Lexical Input Input Input Output Output Output Elements Count Bytes Average Count Bytes Average -------------------------------------------------------------------- TK_KEYWORD 3083 12247 3.97 3083 12247 3.97 TK_NAME 5401 24121 4.47 5401 7552 1.40 TK_NUMBER 467 494 1.06 467 494 1.06 TK_STRING 787 7983 10.14 787 7974 10.13 TK_LSTRING 14 3453 246.64 14 3453 246.64 TK_OP 6381 6861 1.08 6171 6651 1.08 TK_EOS 1 0 0.00 1 0 0.00 -------------------------------------------------------------------- TK_COMMENT 1611 72339 44.90 1 18 18.00 TK_LCOMMENT 18 4404 244.67 0 0 0.00 TK_EOL 4419 4419 1.00 1778 1778 1.00 TK_SPACE 10439 24475 2.34 2081 2081 1.00 -------------------------------------------------------------------- Total Elements 32621 160796 4.93 19784 42248 2.14 -------------------------------------------------------------------- Total Tokens 16134 55159 3.42 15924 38371 2.41 -------------------------------------------------------------------- * WARNING: before and after lexer streams are NOT equivalent! ---- The command line was: [source, sh] lua LuaSrcDiet.lua LuaSrcDiet.lua -o app_experimental.lua --maximum --opt-experimental --noopt-srcequiv The important thing to note is that while the binary chunks are equivalent, the source lexer streams are not equivalent. Hence, the _--noopt-srcequiv_ makes LuaSrcDiet report a warning for failing the source equivalence test. `LuaSrcDiet.lua` was reduced from 157 kiB to about 41.3 kiB. The _--opt-experimental_ option saves an extra 205 bytes over standard _--maximum_. Note the reduction in `TK_OP` count due to a reduction in semicolons and parentheses. `TK_SPACE` has actually increased a bit due to semicolons that are changed into single spaces; some of these spaces could not be removed. For more performance numbers, see the <> page. == Verification Code size reduction can be quite a hairy thing (even I peer at the results in suspicion), so some kind of verification is desirable for users who expect processed files to _not_ blow up. Since LuaSrcDiet has been talked about as a tool to reduce code size in projects such as WoW add-ons, `eLua` and `nspire`, adding a verification step will reduce risk for all users of LuaSrcDiet. LuaSrcDiet performs two kinds of equivalence testing as of version 0.12.0. The two tests can be very, very loosely termed as _source equivalence testing_ and _binary equivalence testing_. They are controlled by the _--opt-srcequiv_ and _--opt-binequiv_ options and are enabled by default. Testing behaviour can be summarized as follows: * Both tests are always executed. The options control the resulting actions taken. * Both options are normally enabled. This will make any failing test to throw an error. * When an option is disabled, LuaSrcDiet will at most print a warning. * For passing results, see the following subsections that describe what the tests actually does. You only need to disable a testing option for experimental optimizations (see the following section for more information on this). For anything up to and including _--maximum_, both tests should pass. If any test fail under these conditions, then something has gone wrong with LuaSrcDiet, and I would be interested to know what has blown up. === _--opt-srcequiv_ Source Equivalence The source equivalence test uses LuaSrcDiet’s lexer to read and compare the _before_ and _after_ lexer token streams. Numbers and strings are dumped as binary chunks using `loadstring()` and `string.dump()` and the results compared. If your file passes this test, it means that a Lua 5.1.x binary should see the exact same token streams for both _before_ and _after_ files. That is, the parser in Lua will see the same lexer sequence coming from the source for both files and thus they _should_ be equivalent. Touch wood. Heh. However, if you are _cross-compiling_, it may be possible for this test to fail. Experienced Lua developers can modify `equiv.lua` to handle such cases. === _--opt-binequiv_ Binary Equivalence The binary equivalence test uses `loadstring()` and `string.dump()` to generate binary chunks of the entire _before_ and _after_ files. Also, any shbang (`#!`) lines are removed prior to generation of the binary chunks. The binary chunks are then run through a fake `undump` routine to verify the integrity of the binary chunks and to compare all parts that ought to be identical. On a per-function prototype basis (where _ignored_ means that any difference between the two binary chunks is ignored): * All debug information is ignored. * The source name is ignored. * Any line number data is ignored. For example, `linedefined` and `lastlinedefined`. The rest of the two binary chunks must be identical. So, while the two are not binary-exact, they can be loosely termed as “equivalent” and should run in exactly the same manner. Sort of. You get the idea. This test may also cause problems if you are _cross-compiling_. == Experimental Stuff The _--opt-experimental_ option applies experimental optimizations that generally, makes changes to “real” tokens. Such changes may or may not lead to the result failing binary chunk equivalence testing. They would likely fail source lexer stream equivalence testing, so the _--noopt-srcequiv_ option needs to be applied so that LuaSrcDiet just gives a warning instead of an error. For sample files, see the `samples` directory. Currently implemented experimental optimizations are as follows: === Semicolon Operator Removal The semicolon (`;`) operator is an optional operator that is used to separate statements. The optimization turns all of these operators into single spaces, which are then run through whitespace removal. At worst, there will be no change to file size. * _Fails_ source lexer stream equivalence. * _Passes_ binary chunk equivalence. === Function Call Syntax Sugar Optimization This optimization turns function calls that takes a single string or long string parameter into its syntax-sugar representation, which leaves out the parentheses. Since strings can abut anything, each instance saves 2 bytes. For example, the following: [source, lua] fish("cow")fish('cow')fish([[cow]]) is turned into: [source, lua] fish"cow"fish'cow'fish[[cow]] * _Fails_ source lexer stream equivalence. * _Passes_ binary chunk equivalence. === Other Experimental Optimizations There are two more of these optimizations planned, before focus is turned to the Lua 5.2.x series: * Simple `local` keyword removal. Planned to work for a few kinds of patterns only. * User directed name replacement, which will need user input to modify names or identifiers used in table keys and function methods or fields.