Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you care less about space efficiency and more about maintainability of the script, you can also encode the binary as base64 and put an

  echo '...base64 data...' | base64 -d > somefile
in your script.

Or add compression to reclaim at least some of the wasted space:

  echo '...base64 gzipped data...' | base64 -d | gunzip > somefile
Also note that bash accepts line breaks in quoted strings and the base64 utility has an "ignore garbage" option that lets it skip over e.g. whitespace in its input. You can use those to break up the base64 over multiple lines:

  echo '
    ...base64 gzipped data...
    ...more data...
    ...even more data...
  ' | base64 -di | gunzip > somefile


If you care about maintainability, you keep the binary data out of the source file and have a build process.


I meant "maintainability" in the sense that you can open it in a text editor without either corrupting the file or having the editor crash.

It's also useful if you want to check the file into a git repo and want to keep using the diff and merge tools.


For something small. I would take a data.bin file rather than a build process. But yes.


The "build process" in this case is concatenation.


You can also use here-documents to avoid hitting any argv length limits:

    { base64 -d | gunzip > output; } <<EOF12345
    ...data...
    EOF12345


An even simpler way would be to include a marker to denote the end of the shell script, and the start of the data. For example, if you put this in extract.sh

    #!/bin/sh
    sed -E '1,/^START-OF-TAR-DATA$/d' "$0" | tar xvzf -
    exit
    START-OF-TAR-DATA
and then run:

    cat extract.sh ../foobar.tar.gz > foobar.tar.gz.sh
You can then run foobar.tar.gz.sh to self-extract. And you still get the benefit of being able to modify the shell script without needing to count lines or characters without sacrificing any compression.


You've already got a marker to denote the end of the shell script baked in there, it's ^exit$


There may be more than one exit.


There will be only one that is neither preceded by indentation nor followed by an exit code, so that it could match ^exit$, unless you contrive some hypothetical nonsense purely for the sake of contrarianism.

Any reasonable person will indent a conditional exit within the block testing its condition, and more than one unconditional exit doesn't make sense.


It would still make me uncomfortable that a 2nd exit or lack of indent would break my script. And it's less clear than an explicit marker


Ah fair enough. I didn’t think about using the indent, that’s pretty clever.


Is there an encoding that is less wasteful that base64 but not vulnerable to text editor corruption issues? I think avoiding 0x0 to 0x20 should be enough to not get corrupted by text editors, though base64 avoids a lot more than that.


If you can count on every printable ascii character being not-mangled, you can use ascii85/base85/Z85 (5 "ascii characters" to 4 bytes) instead of base64.


there's also base91, with an efficiency of 6.5 bits of data per printable character, compared to 6.4 with ascii85, 6.0 with base64


There's probably a base(bigger number) with Unicode chars today


base65536, and look who the author is :-D

https://github.com/qntm/base65536


Who is the author?




But you need to make sure to use utf-16 or utf-32 instead of utf-8, or you may be worse off.


Those get mangled by text editors that don't support them.


While a couple of people suggested Base65536, that encoding isn't particularly compact, and it can't be as elegant as 65536 would suggest because it has to dodge special cases in unicode.

It's almost always the case that either Base32768 is denser, or encodings with 2^17 or 2^20 characters are denser.


At that point you're basically doing yEnc.


if you mean the thing you want to encode is mostly-ascii, then https://en.wikipedia.org/wiki/Quoted-printable ... it's a real throwback, I've not seen this in the wild since the 90s, but it's there in the python standard library (quopri), perl (MIME::QuotedPrint) etc


Base85, also called Ascii85. Also yEnc.


base16


Woah, why the downvotes.


Beware that often you need to append -n to echo to not include a newline when you base64 encode/decode something.


I think the -i flag when decoding should handle this and cause the decoder to skip the newlines.


Just to be sure I’m following you correctly, what is the advantage of zipping the base64 data vs having the original binary, zipped if you like?


As I understood, you base64 the zipped data on input and the other way around on output.

The reasoning being that the base64'd binary data is safe from being corrupted when the file is edited in text editors, as a response to the warning stated on the last paragraph of the original post.


The idea is to first zip the binary, then base64 the zipped data. Conversely, the script first decodes the base64 to a zipped binary, then unzips the binary.

It's just to mitigate the wastefulness of base64 encoding. You end up with a file that is text editor friendly and not quite as bloated as directly encoding the binary would be - but of course the file is still larger than simply appending the binary directly like in the OP.

Also, if you don't care about text editor friendlyness, you could indeed just zip the binary and then append it to the script for an even smaller file.


The build process would produce the concatted file




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: