I'm doing my PhD in compression-based machine learning, wanted to contribute a f...

I'm doing my PhD in compression-based machine learning, wanted to contribute a few clarifying points.

The relationship between probabilistic modeling and lossless compression is very direct. A model that predicts the next symbol with probability p can on average losslessly compress that symbol with the help of an entropy coder (e.g. arithmetic coding) in -log(p) bits. Therefore improved probabilistic models immediately translate into improved lossless compressors.

There are two ways people have used compressors for ML: the first is based on the Minimum Description Length (MDL) principle [0] which says that the best model is the one which provides the shortest description of the data, counting the size of the model itself. This is similar to the technique used in this blog post (argmin code length across class-conditioned compressors) except for counting the model size. Basically you can train a compressor per class and then choose the class compressor which best compresses the test data. It is a maximum likelihood argument because of Shannon's source coding theorem: a code length L corresponds to a probability 2^-L. The second way is the Normalized Compression Distance (NCD) [1], which uses code lengths to calculate information-theoretic distances which can then be plugged into distance-based algorithms like kNN. MDL interprets compressed lengths as likelihoods, while NCD interprets them for distance calculations. The theoretical foundation is Kolmogorov complexity which is the ideal (& uncomputable) lossless compressor, used to define information distance, an "ideal" distance metric based on algorithmic similarity.

Data compression is not in principle restricted to syntactic similarity. As others have mentioned, the new wave of neural compressors (e.g. NNCP [2], CMIX [3], leading the large text compression benchmark [4]) outperform their traditional counterparts (e.g. gzip) because of the ability of neural networks to learn complex semantic patterns. Their improved ability to predict the next token means improved lossless compression. This has also been shown to be the case with pre-trained LLMs [5].

I think it's neat that improving compression improves machine learning, and improving machine learning improves compression!

Looking forward to hearing other thoughts.

[0] https://arxiv.org/abs/math/0406077 [1] https://arxiv.org/abs/cs/0111054 [2] https://bellard.org/nncp/nncp_v2.pdf [3] https://www.byronknoll.com/cmix.html [4] https://www.mattmahoney.net/dc/text.html [5] https://arxiv.org/abs/2309.10668