Lempel Ziv Markov Algorithm and 7-Zip

I’ve been working with the LZMA compression algorithm and code for a little while playing around with rewriting it in C and just understanding the algorithm and how it works. Certainly it’s attracted some attention with its impressive compression ratios, especially for just random bits of one’s disk. However, I’ve also found a lot of confusion about what is LZMA and what is 7-Zip. As in some ways 7-Zip is LZMA’s interface and in some ways and it’s completely orthogonal to a discussion about LZMA. So hopefully these questions and answers can clarify some things for folks just starting to look into this cool compression algorithm.

What is LZMA and what’s it have to do with 7za?

Lempel-Ziv-Markov-Algorithm (LZMA) is a very nifty compression algorithm recently designed by Igor Pavlov. The algorithm is implemented in a few different software packages, but it’s most often discussed for being in the 7-Zip archiver 7za(1).

The 7-Zip archiver, by Pavlov as well, was originally written for Windows but later ported to Unix. The 7-Zip archiver is an archiver (like tar(1) or cpio(1)), which outputs in the 7z Format. The format is open and supports multiple compression algorithms. LZMA compression is the default for the format, however, it also supports bzip2 and deflate (gzip), amongst others. The format supports (AES-256) encryption, but not being designed for a Unix system originally UID/GID pairs aren’t stored, similarly directories aren’t (or at least don’t seem) supported on Unix systems – a significant drawback for an archive format.

Why would I use LZMA?

LZMA is a kick-butt compression algorithm for size’s sake. It does a super job of compressing files usually providing a 10% addition to bzip2 compression, but not always. For example, a SPARC binary /usr/bin/gimp-2.4 on my Ultra 45 is 4,997KB and LZMA compressed it to 1,411KB for a savings of 72%, versus bzip2 -9 which resulted in a 1,923KB file size for a savings of 62% only. Again, a SPARC core dump lying around from some application test was 3,588KB which LZMA got to 2,051KB (a 43% savings) and bzip2 -9 got 2,307KB (for a 36% savings). Meanwhile, for a 130KB English LaTeX document in ASCII encoding using LZMA compression compressed it to 11KB (91.5% savings), but bzip2 -9 achieved 10KB (92.3% savings). (Similar results occur for my /usr/sfw/man/man1/gcc.1 where I got for the original,LZMA, and bzip2 -9 514K,104K,99K respectively.) Oh well, can’t win them all. Still, on random chunks of my disk using dd(1) LZMA seems to do closer to the binary performance than anything (likely due to the composition of my disk being largely binary files).

Well LZMA looks cool, tell me more:

LZMA though new, and very little documented, is an adaptation of LZ77 with the goal of large compression and fast decompression. It uses range encoding (to deflate’s (gzip) Huffman coding), and uses a large dictionary as necessary (up to ~1GB) which is searched with various hashing algorithms stored applied to various binary tree algorithms or a hashed array of lists. There is an on-disk format for LZMA. The LZMA on-disk format is different from the 7z Format which is not solely a wrapper for the LZMA on-disk format – one must transcode/convert them. The LZMA format is very straight-forward: first, the properties used by the LZMA encoder, then, the dictionary size, and uncompressed size, followed by the actual compressed data.

The compression algorithm has been primarily written in C++ (but C# and Java ports are available too). A C decompression example is provided by the LZMA SDK, however, no non-OO port of the compression algorithm seems yet available.

So why do I care about 7za(1) anyways?

Though some other tools support LZMA and have been written to take advantage of it, few tools allow one to compress a file with LZMA other than 7za(1), which currently is a feature rich “LZMAzip” akin to gzip and bzip2. The easiest way to use 7za(1) for compression is with the standard in/out -si and -so flags to make it part of a tar(1) or cpio(1) pipeline as previous compression utilities have been used for years, just remember your data will be compressed not just in LZMA, but LZMA and the 7z Format too (not an overhead worry – just something to understand). If you’d like to play with 7za(1) now, you can. If you’re running Nevada build 79 or newer, p7zip version 4.55 went back into the SFW Consolidation on November 16th making /usr/bin/7za available.

6 responses to “Lempel Ziv Markov Algorithm and 7-Zip”

  1. mohammad tahghghi

    Hi Dear sir

    Could you help me to know more about LZMA algorithm ?

    Any resource (paper or note) exist to help me ?

    I need detail of implementation of LZMA , please help me.

  2. diana

    me too..

    i need to analyze LZMA algorythm for my final project..

    thx :)

  3. daphne

    Hi Sir,

    I am currently studying LZMA SDK as I need to enhance it for my final year project.I would like to ask you is that LZMA written by the writer in C# version does not support folder compression?I have tried quite many commands but failed.If it does, can you tell me what command is that?

    Thank you very much for your help.=)

  4. Clayton Baenziger

    Hi Mohammad and Diana,
    My apologies for not getting back sooner; however, I used the book _Data Compression: The Complete Reference_ by David Salomon for a better overview of the LZMA algorithm. It’s not a very thorough analysis but it’s pretty reasonable. Igor Pavlov was very helpful in the C implementation and of course my coworker Alok Aggarwal was did a significant amount of work too.

  5. Clayton Baenziger

    Hi Daphne,

    Have you thought about approaching folder compression differently. For example, in traditional unix form the command tar(1) is often used to "package" a folder into a single file. Perhaps using such a method would help you out? Certainly, one can use` tar -czf – <directory> | lzma -dc > directory.tar.lzma` perhaps this insight will be of use. Good luck.

  6. Yagnesh Dave

    hello sir,
    i i want to know the working of LZM 7- algorithm.how the compression occurs by this algorithm.i want the details of this algorithm.the how it works in compression-decompression. so plz sir send me information in my this email id:-yagneshdave86@gmail.com

Leave a Reply