What's new coming to Parquet 3.1.2

...
  • By Ivan Gavryliuk
  • In C#  |  Apache Parquet  |  Parquet.Net  |  Big Data
  • Posted 09/10/2018

v3.1.2 will be the next minor release of Apache Parquet for .NET and is mostly around improving row-based utilities. It's also launches the first steps towards integrating this library with JSON, specifically Table and Row classes' .ToString() method by default will now produce a multiline JSON output instead of internal formatting.

Internal formatting was just fine for most simple scenarios, however large structures were hard to read, for instance try to understand the following output:

{[London;Derby;Paris;New York];1}

Surely it's something about a list of cities and a number, probably an ID of some sort. It doesn't give you a context of which fields you are looking at even in the simplest case. v3.1.2 instead prints the following:

{'cities': ['London', 'Derby', 'Paris', 'New York'], 'id': 1}

which makes it clear that the list is coming from a column called cities, and it's indeed an id field that has a value of number 1.

More complicated structures are even harder to read, for instance the following output would be completely cryptic in previous versions:

{'addresses': [{'line1': 'Dante Road', 'name': 'Head Office', 'openingHours': [9, 10, 11, 12, 13, 14, 15, 16, 17, 18], 'postcode': 'SE11'}, {'line1': 'Somewhere Else', 'name': 'Small Office', 'openingHours': [6, 7, 19, 20, 21, 22, 23], 'postcode': 'TN19'}], 'cities': ['London', 'Derby'], 'comment': 'this file contains all the permunations for nested structures and arrays to test Parquet parser', 'id': 1, 'location': {'latitude': 51.2, 'longitude': 66.3}, 'price': {'lunch': {'max': 2, 'min': 1}}}

or reformatted:

{
   "addresses":[
      {
         "line1":"Dante Road",
         "name":"Head Office",
         "openingHours":[
            9,
            10,
            11,
            12,
            13,
            14,
            15,
            16,
            17,
            18
         ],
         "postcode":"SE11"
      },
      {
         "line1":"Somewhere Else",
         "name":"Small Office",
         "openingHours":[
            6,
            7,
            19,
            20,
            21,
            22,
            23
         ],
         "postcode":"TN19"
      }
   ],
   "cities":[
      "London",
      "Derby"
   ],
   "comment":"this file contains all the permunations for nested structures and arrays to test Parquet parser",
   "id":1,
   "location":{
      "latitude":51.2,
      "longitude":66.3
   },
   "price":{
      "lunch":{
         "max":2,
         "min":1
      }
   }
}

as you can see this is much more understandable.

Note that by default Parquet.Net is using single quotes which makes it an invalid JSON document. This is mostly fine as it's not intended to be valid, but rather readable. Quotes are the only character which makes it invalid, and you can override the print behavior by using formatter on .ToString("j") - this produces an output parseable by JSON.NET or any other JSON reader. The only reason to have single quotes by default is it's easier to embed them in tests when working in C# language, instead of escaping those all the time.

We are also working on adding a convert command to the .NET Core Global tool that can pretty-print parquets as JSON documents (the screenshot below is a preview version and formatting is a little bit broken, but you'll get the idea):

You can also lauch the tool with --no-color switch and produce a multi-line JSON that is readable by Apache Spark or other big data systems working with JSON:


Thanks for reading. If you would like to follow up with future posts please subscribe to my rss feed and/or follow me on twitter.