- By Ivan Gavryliuk
- Posted 07/03/2018
Apache Parquet for .NET has come a long long way since the original idea in June 2017 (the first commit backdates to June 5).
The first version was a truly naive attempt to implement a native, fully managed Apache Parquet reader and writer to use for many of our clients. It has started as my personal project, but since then moved/mirrored to another company's repo for various reasons, and still remained open-source. It was almost complete, except for design limitations that didn't allow us to implement more complicated data structures like Maps, Structs or Lists, especially with crazy nesting schemes. That's how V2 has started which maintained compatibility for the most part, however was totally rewritten internally to support anything we like.
Version 2 was almost perfect in this respect and brings Parquet supports to a totally new level. We support literally everything, the whole parquet specification. Moreover, we maintain compatibility with Cloudera Impala and Apache Drill which sometimes is quite challenging to be honest as they don't always conform to specs. Although we made it, this is not the end of the project as it turns out it's not perfect (from my pedantic point of view):
- Parquet .NET Uses a lot of boxing. Original idea was to abstract the user from Parquet internal structure as much as possible and expose the data in a convenient way i.e. as a structure of rows. Alsothough it worked out nice, we've realised that rows cannot be represented effectively, as every row has a mixture of data types not known beforehand. This is going to change dramatically in V3 but i'll keep it a secret for now.
- Parquet .NET is way too abstract. It's actually perfect for people who don't want to get into parquet internals, and just want to read/write a file, but imposes quite an invonvenient limitation for those who want to squeeze a maximum performance. At the end of the day Parquet is all about performance. Again, this is going to change, but in a good way.
- Although it's too abstract, some tasks are actually hard to achieve. The dictatorship of an "easy-to-use" domain model imposes some usability restrictions. Most of the tasks (as we have realised by now) are to do with converting predefined POCOs to parquet or the other way around. At the moment, although possible, it's quite a challenging task, and one may even resort to reflection in doing so. Reflection is actually quite slow, even if you know how to sqeeze performance bits from it, so you are losing the battle again. V3 (or maybe even V4, I don't know) will dramatically simplify this by allowing you to serialise/deserialise native C# classes to/from parquet files. In doing so we're using some MSIL code generation magic which is proven to be around 95 times faster than reflection, although this figure may change, probably to the better side.
If you have any thoughts on parquet in general or our C# library please feel free to comment here or on our GitHub repo.
Thanks for reading. If you would like to follow up with future posts please subscribe to my
rss feed and/or follow me on twitter.