Project Description

Directly Compress Generic Lists of Data with Custom Codecs, Transforms, and Finishers.

Changes

Version 1.4 has a few breaking changes that are not yet covered in the documentation. But I wanted to get this out because it also includes support for two additional types, Char and String. This brings the total number of supported simple types to 17. It also adds support for directly compressing generic Dictionaries (Maps). When I get a chance I'll add a full set of MapTests to cover typical key-value combinations (e.g. DateTime-Double, etc.). But performance should be almost exactly the same as for the other multi-field equivalents.

Summary

This library makes it extremely easy to compress data for cache, database, or file storage, and for very fast network transmission. It includes codecs for a variety of existing algorithms:
  • MS Deflate
  • DotNetZip (Deflate, Zlib, BZip2)
  • SharpZipLib (Deflate, BZip2)
  • QuickLZ
  • LZ4
  • SevenZip LZMA (in the next release)
Additional algorithms can be "wrapped" into the framework with only a few lines of code.

The simplest way to use a codec implemented in this framework looks something like this:

var codec = DeflateCodec.Instance;
var bytes = codec.Encode(list);
var listOut = codec.Decode<DateTime>(bytes);

As you can see, lists of the basic data types are first-class citizens. And it is very simple to create derivations that handle more complex data structures. Of course, a time series will usually include at least 2 fields. So we also include support for several multi-field data types (Dictionaries, Tuples, etc.).

We define one example data type called Struple (structure tuple) that can handle up to 17 fields. The number of fields was chosen because this is the number of fundamental types supported, and this allows them to all be encoded at once for testing purposes.
var list = new List<Struple<DateTime, Double>> 
   { new Struple<DateTime,Double>{Item1 = DateTime.Now.Date,  Item2 = 2345.25}, ... };
var args = new StrupleEncodingArgs<DateTime, Double>(list, level: CompressionLevel.Fastest)
   { Granularity1 = DateTime.MinValue.AddDays(1), Granularity2 = 0.25 };
var bytes = codec.Encode(args);

Obviously, for each field we may want to specify different arguments (hence StrupleEncodingArgs). For custom types these will usually be preconfigured. But on-the-fly you can simply specify global arguments in the constructor and individual field arguments, such as granularity, in the initialization block.

Struple is handy for ad hoc usage, but we highly recommend defining strongly typed methods on a derived codec for your own custom data types. This is quite easy to do if you just follow the same pattern. Tuples are also supported for up to 7 fields. But this can easily be extended to use the logic of an eighth field ("TRest") for daisy-chaining tuples.

Design

The design of the framework is quite simple, but it is also very flexible.

A DeltaCodec is a combination of a transform and a finisher. In the example above we are using a trivial codec that uses a NullTransform to simply pass the data through to a DeflateFinisher. The latter uses built-in DeflateStream compression, but any other general purpose finisher can be used instead. The codec itself is only responsible for handling the serialization of results with header information required for decoding.

The framework not only simplifies usage, it also allows us to mix and match transforms and finishers to create custom codecs in endless variety.

Performance

To give you an idea of what is available out-of-the-box, we'll show some output from a few of the many included performance tests that compare different codecs, data types, and parameter settings. We'll start with the codec described above. And then we'll make a few simple adjustments to see how we can improve results.

My company, Stability Systems LLC, develops highly optimized commercial codecs such as RandomWalkCodec (RWC). We use that as a benchmark since it shows just how high the bar can be raised in pure C# implementations.
Encoding IList<DateTime> 1,000,000 Elements (Granularity = seconds):
DateTimeBySeconds_SerialOptimal.png

Now we're going to replace DeflateCodec with DeflateDeltaCodec. This replaces NullTransform with DeltaTransform. The data will now be differenced before being passed to the finisher.

DateTimeBySeconds_SerialDeltaNoFactorOptimal.png

The Ratios/Multiples are all greatly improved, simply because we differenced the data. Unfortunately, the encoding and decoding speeds still leave much to be desired. Now let's throw a little parallelism into the mix and see what happens:

DateTimeBySeconds_ParallelDeltaGranularOptimal.png

If you notice the field called Parts (partitions), the value has been increased from 1 to 4. That indicates the number of blocks that we are encoding in parallel. The speeds are now increased by roughly 3X. Not bad!

There are, of course, a tremendous variety of other possible optimizations, as evidenced by the performance of the benchmark, RWC. But this is just a framework and we leave it to your imagination to come up with clever transforms.

Complex Data Types

I'll show one more example here dealing with multi-field encoding.

Struple_15.png

This shows a list of generic Struples (structure tuples) that has fields for most of the supported data types (support for char and string fields was just added, so those don't yet appear in the example). There is nothing much to say about this except that it works and is handy as a template for creating strongly-typed custom codec methods for your own complex data types.

Finally

Be sure to visit the Documentation wiki pages, and look over the docs included in the solution folder. The docs include discussion of Architecture and Usage, as well as extensive Performance results to sift through. And, of course, you can play around with the actual tests in Visual Studio.

NOTE: The output of a performance test is best viewed in NUnit or Resharper. They don't format very well in the default VS display.

Last edited Jul 15, 2015 at 9:26 PM by bstabile, version 113