Apache Parquet on .NET

...
  • By Ivan Gavryliuk
  • In Big Data  |  Apache Parquet
  • Posted 28/07/2017

Preamble

If you are in Big Data, you know about Apache Parquet format. It's a de facto standard for storing enormous amounts of data on big data processing clusters like Apache Spark. What's making it so good is all data stored as binary, in the most effective way possible. Unlike formats like JSON, CSV, XML etc. which are mosly user faces, Parquet prefers data compaction and speed of access over visibility.

I'm not even going to compare it with JSON or CSV, as it's smaller and faster in tens of times, therefore a huge adoption by big data community.

As big data solutions are mosly working on JVM platforms, it's never been available for .NET/C# community (at least until now).

Why Parquet?

Why do we need it on .NET platform if all big data solutions are not supporting .NET?

Integration

First of all, scientific algorithms may be running on products like Spark, written in Scala, Python or Java, however they still need to get data from somewhere. Those languages are not dominant in traditional software development environments because they are not designed for many scenarions (imagine writing a web site in Scala, it's a nightmare!). If you need to export data for analysis, today's solutions is to write it out in more traditional formats like CSV or JSON, import it by Spark (or convert to Parquet first) and then process. Considering Spark resources are consting considerably more than traditional compute resources this is just not cost effective and often really slow.

Data cleansing

Spark and others are really good for solving scientific problems, however they are just not designed or suited for preparing data for processing known as cleansing. It way much easier to do that in C# with it's rich capabilities and straightforward approach, not mentioning the speed. Wouldn't it be nice if you could also write the data out to native Parquet format missing the long conversion steps?

Bringing .NET to Big Data

Althought .NET got later in this game, some good solutions already exist. For instance take the Azure Data Lake or Azure Data Factory from Microsoft. None of them still support Parquet for data processing.

Parquet.Net

Recenly myself and another company I was working with on other projects have decided to tackle this problem. I have creaed a complete solution to this problem in the form of an opensource .NET library. This is completely amazing as it doesn't have any external dependencies, works both on classic desktop .NET and .NET Core, and almost fully supports Parquet format.

See GitHub project for more details.

The possibilities are endless, we are already using it in real projects for data transformation, cleansing, and saving tons of money where Apache Spark is just too expensive and unsuitable for those kinds of tasks.


Thanks for reading. If you would like to follow up with future posts please subscribe to my rss feed and/or follow me on twitter.