Parquet.Net 3.2.0 Released

...
  • By Ivan Gavryliuk
  • In Apache Parquet  |  C#  |  Performance
  • Posted 25/10/2018

Parquet 3.2.0 is released which marks a new stage in powerful capabilities of serializing C# classes to parquet files. Serialization is one of the original Parquet.Net features no one else amongst other parquet implementation support today. It allows you to take a C# class and just write it to a file or read from a file int array of classes. Some documentation is available here.

There are a lot of challenges involved in making this work in terms of engineering. One could always use reflection to walk over the class, fetch the data and put it into a file, however this method while works, is extremely slow, especially when it comes to a very large amount of data. The only choice I had in terms of designing it extremely performance is to handcraft MSIL on the fly using System.Reflection.Emit. This method is although guarantees extreme performance when written correctly, requires a lot of time to prepare even a small amount of code. It's like in old days coding in Assembler which is what MSIL essentially is - assembly language for .NET runtime.

For instance, just to express the following trivial piece of code of C#

for(int i = 0; i < 10; i++)
{
    Console.WriteLine(i);
}

you would have to write the following piece of MSIL manually:

IL_0000:  nop
IL_0001:  ldc.i4.0
IL_0002:  stloc.0     // i
IL_0003:  br.s        IL_0012
IL_0005:  nop
IL_0006:  ldloc.0     // i
IL_0007:  call        System.Console.WriteLine
IL_000C:  nop
IL_000D:  nop
IL_000E:  ldloc.0     // i
IL_000F:  ldc.i4.1
IL_0010:  add
IL_0011:  stloc.0     // i
IL_0012:  ldloc.0     // i
IL_0013:  ldc.i4.s    0A
IL_0015:  clt
IL_0017:  stloc.1
IL_0018:  ldloc.1
IL_0019:  brtrue.s    IL_0005
IL_001B:  ret

Now imagine that a whole serialization logic is written in MSIL - it's literally pages and pages of instructions.

Another challenge with MSIL is that compiler wouldn't ever warn you about errors made - they will surface in runtime only, which makes things extremely complicated to design. Loading a wrong instruction or not cleaning up one bit causes the whole program to crash instantly!

Despite than, writing in MSIL can be fun. One can really appreciate the depth of information when creating MSIL, as you can save time by throwing out instructions that are not required and make micro optimizations that are just crazy.

Repeatable Fields

Back to parquet.net, in v3.2.0 I've added support for repeatable fields. What this means is your C# class can now declare an array as a property:

public class MyClass
{
    public int[] Areas { get; set; }
}

and parquet.net will automatically pick it up to serialize as a repeatable field i.e. the whole array will be put into a single cell. Sounds simple, however to make this fast, things get tricky. First of all, MSIL doesn't allow you do do any collection enumeration at all, it's just an assembly language that pushes and pops bits to the evaluation stack, allows to perform basic operations like addition/multiplication etc. and puts bits back on the stack. Also unlike C#, you can't just enumerate array and a collection with the same piece of code (read foreach statement) because they are implemented different in the CLR, so you have to make decisions. Enumerating a collection inside another collection is also a hard task, like in this case I need to enumerate classes first, access class property, and then enumerate array to expand it into parquet flat values and repetition levels.

Parquet.Net codebase now includes some MSIL helpers, for instance this is a short version of generating an array enumerator (i.e. for(int i = 0; i < max; i++) line for C#):

   public static IDisposable ForLoop(this ILGenerator il, LocalBuilder max, out LocalBuilder loopCounter)
   {
      loopCounter = il.DeclareLocal(typeof(int)); //loop counter
#if DEBUG
         il.EmitWriteLine("for-begin");
#endif
      il.Emit(Ldc_I4_0);
      il.Emit(Stloc, loopCounter.LocalIndex);

      Label lBody = il.DefineLabel();     //loop body start
      Label lExit = il.DefineLabel();     //final and return

      il.MarkLabel(lBody); //loop body starts here

      //load counter and max and compare them
      il.Emit(Ldloc, loopCounter.LocalIndex);
      il.Emit(Ldloc, max.LocalIndex);
      //il.Emit(Ldc_I4_3);
      il.Emit(Clt);   //1 - less, 0 - equal or more (exit)
      il.Emit(Brfalse, lExit);  //jump on 1

#if DEBUG
         il.EmitWriteLine("  for-loop");
#endif

      // loop body executes here

      LocalBuilder loopCounterAfter = loopCounter;
      return il.After(() =>
      {
            //increment loop counter
            il.Emit(Ldc_I4_1);
         il.Emit(Ldloc, loopCounterAfter.LocalIndex);
         il.Emit(Add);
         il.Emit(Stloc, loopCounterAfter.LocalIndex);

            //loop again
            il.Emit(Br, lBody);

         il.MarkLabel(lExit);
#if DEBUG
            il.EmitWriteLine("for-end");
#endif
         });
   }

or a foreach loop:

   public static IDisposable ForEachLoop(this ILGenerator il, Type elementType, LocalBuilder collection, out LocalBuilder currentElement)
   {

      TypeInfo iEnumerable = typeof(IEnumerable).GetTypeInfo();
      TypeInfo iEnumerator = typeof(IEnumerator).GetTypeInfo();

      getEnumeratorMethod = iEnumerable.GetDeclaredMethod(nameof(IEnumerable.GetEnumerator));
      moveNextMethod = iEnumerator.GetDeclaredMethod(nameof(IEnumerator.MoveNext));
      getCurrentMethod = iEnumerator.GetDeclaredProperty(nameof(IEnumerator.Current)).GetMethod;

      Label lMoveNext = il.DefineLabel();
      Label lWork = il.DefineLabel();
      currentElement = il.DeclareLocal(elementType);

      //get collection enumerator
      LocalBuilder enumerator = il.DeclareLocal(typeof(IEnumerator));
#if DEBUG
         il.EmitWriteLine("foreach-begin");
#endif
      il.Emit(Ldloc, collection.LocalIndex);
      il.Emit(Callvirt, getEnumeratorMethod);
      il.Emit(Stloc, enumerator.LocalIndex);

      //immediately move to "move next" to start enumeration
      il.Emit(Br, lMoveNext);

      //iteration work block
      il.MarkLabel(lWork);
#if DEBUG
         il.EmitWriteLine("  foreach-loop");
#endif

      //get current element
      il.Emit(Ldloc, enumerator.LocalIndex);
      il.Emit(Callvirt, getCurrentMethod);
      il.Emit(Unbox_Any, elementType);
      il.Emit(Stloc, currentElement.LocalIndex);

      return il.After(() =>
      {
            //"move next" block
            il.MarkLabel(lMoveNext);
         il.Emit(Ldloc, enumerator.LocalIndex);  //load enumerator as an argument
            il.Emit(Callvirt, moveNextMethod);
         il.Emit(Brtrue, lWork);   //if result is true, go to lWork

            //otherwise, dispose enumerator and exit
            //il.Emit(Ldloc, enumerator.LocalIndex);
            //il.Emit(Callvirt, disposeMethod);
#if DEBUG
            il.EmitWriteLine("foreach-end");
#endif
         });
   }

In general, repeatable fields were the biggest challenge to kick off the design of more complicated types such as Lists, Maps and Structs, so stay tuned to get them in the future releases.

Don't forget to subscribe to this blog for future update and my other ramblings on technology as well.


Thanks for reading. If you would like to follow up with future posts please subscribe to my rss feed and/or follow me on twitter.