If you care less about space efficiency and more about maintainability of the sc...

mike_hock · on July 5, 2023

If you care about maintainability, you keep the binary data out of the source file and have a build process.

xg15 · on July 6, 2023

I meant "maintainability" in the sense that you can open it in a text editor without either corrupting the file or having the editor crash.

It's also useful if you want to check the file into a git repo and want to keep using the diff and merge tools.

dopidopHN · on July 6, 2023

For something small. I would take a data.bin file rather than a build process. But yes.

mike_hock · on July 6, 2023

The "build process" in this case is concatenation.

saltcured · on July 5, 2023

You can also use here-documents to avoid hitting any argv length limits:

    { base64 -d | gunzip > output; } <<EOF12345
    ...data...
    EOF12345

sudobash1 · on July 5, 2023

An even simpler way would be to include a marker to denote the end of the shell script, and the start of the data. For example, if you put this in extract.sh

    #!/bin/sh
    sed -E '1,/^START-OF-TAR-DATA$/d' "$0" | tar xvzf -
    exit
    START-OF-TAR-DATA

and then run:

    cat extract.sh ../foobar.tar.gz > foobar.tar.gz.sh

You can then run foobar.tar.gz.sh to self-extract. And you still get the benefit of being able to modify the shell script without needing to count lines or characters without sacrificing any compression.

oxygen_crisis · on July 6, 2023

You've already got a marker to denote the end of the shell script baked in there, it's ^exit$

withinboredom · on July 6, 2023

There may be more than one exit.

oxygen_crisis · on July 6, 2023

There will be only one that is neither preceded by indentation nor followed by an exit code, so that it could match ^exit$, unless you contrive some hypothetical nonsense purely for the sake of contrarianism.

Any reasonable person will indent a conditional exit within the block testing its condition, and more than one unconditional exit doesn't make sense.

8n4vidtmkvmk · on July 6, 2023

It would still make me uncomfortable that a 2nd exit or lack of indent would break my script. And it's less clear than an explicit marker

withinboredom · on July 6, 2023

Ah fair enough. I didn’t think about using the indent, that’s pretty clever.

dheera · on July 5, 2023

Is there an encoding that is less wasteful that base64 but not vulnerable to text editor corruption issues? I think avoiding 0x0 to 0x20 should be enough to not get corrupted by text editors, though base64 avoids a lot more than that.

ElectricalUnion · on July 5, 2023

If you can count on every printable ascii character being not-mangled, you can use ascii85/base85/Z85 (5 "ascii characters" to 4 bytes) instead of base64.

balou23 · on July 6, 2023

there's also base91, with an efficiency of 6.5 bits of data per printable character, compared to 6.4 with ascii85, 6.0 with base64

raverbashing · on July 5, 2023

There's probably a base(bigger number) with Unicode chars today

bashinator · on July 5, 2023

base65536, and look who the author is :-D

https://github.com/qntm/base65536

cassianoleal · on July 5, 2023

Who is the author?

yccs27 · on July 5, 2023

https://www.qntm.org https://news.ycombinator.com/from?site=qntm.org

CaptainFever · on July 5, 2023

Base65536? https://github.com/qntm/base65536

sudobash1 · on July 5, 2023

But you need to make sure to use utf-16 or utf-32 instead of utf-8, or you may be worse off.

dheera · on July 6, 2023

Those get mangled by text editors that don't support them.

Dylan16807 · on July 6, 2023

While a couple of people suggested Base65536, that encoding isn't particularly compact, and it can't be as elegant as 65536 would suggest because it has to dodge special cases in unicode.

It's almost always the case that either Base32768 is denser, or encodings with 2^17 or 2^20 characters are denser.

delusional · on July 5, 2023

At that point you're basically doing yEnc.

bazzargh · on July 6, 2023

if you mean the thing you want to encode is mostly-ascii, then https://en.wikipedia.org/wiki/Quoted-printable ... it's a real throwback, I've not seen this in the wild since the 90s, but it's there in the python standard library (quopri), perl (MIME::QuotedPrint) etc

Tepix · on July 6, 2023

Base85, also called Ascii85. Also yEnc.

doublerabbit · on July 5, 2023

base16

doublerabbit · on July 6, 2023

Woah, why the downvotes.

snowstormsun · on July 6, 2023

Beware that often you need to append -n to echo to not include a newline when you base64 encode/decode something.

xg15 · on July 6, 2023

I think the -i flag when decoding should handle this and cause the decoder to skip the newlines.

AlDante2 · on July 5, 2023

Just to be sure I’m following you correctly, what is the advantage of zipping the base64 data vs having the original binary, zipped if you like?

newx · on July 6, 2023

As I understood, you base64 the zipped data on input and the other way around on output.

The reasoning being that the base64'd binary data is safe from being corrupted when the file is edited in text editors, as a response to the warning stated on the last paragraph of the original post.

xg15 · on July 6, 2023

The idea is to first zip the binary, then base64 the zipped data. Conversely, the script first decodes the base64 to a zipped binary, then unzips the binary.

It's just to mitigate the wastefulness of base64 encoding. You end up with a file that is text editor friendly and not quite as bloated as directly encoding the binary would be - but of course the file is still larger than simply appending the binary directly like in the OP.

Also, if you don't care about text editor friendlyness, you could indeed just zip the binary and then append it to the script for an even smaller file.

floomk · on July 6, 2023

The build process would produce the concatted file