How to Extract a ZIP Archive in Parallel

...
  • By Ivan Gavryliuk
  • In C#
  • Posted 03/01/2018

These modern days .NET Plarform has a built-in support for ZIP archives in System.IO.Compression Namespace. I find it exciting as there is no need to depend on a popular third-party library and native support from Microsoft is more encouraging.

One limitation that we've hit using this library is that you cannot extract files in parallel, the library is not thread safe. It's sort of understandable as .zip format wasn't designed for parallel processing. Internally it consists of small chunks of data from different files and moving between them in multi-threaded fashion is not a good option as file handles are not thread safe by nature. In fact, this is the primary limitation.

I've tried many answers on StackOverflow which never worked for me or were overcomplicated/overengineered therefore attaching a code fragment solving this problem here:

   public class ParallelZipArchive : IDisposable
   {
      private readonly string _filePath;

      public ParallelZipArchive(string filePath)
      {
         _filePath = filePath;
      }

      public IReadOnlyCollection<string> GetEntries()
      {
         using (FileStream fs = File.OpenRead(_filePath))
         {
            using (var archive = new ZipArchive(fs, ZipArchiveMode.Read, true))
            {
               return archive.Entries.Select(e => e.FullName).ToList();
            }
         }
      }

      public Dictionary<string, string> Extract(IEnumerable<string> entries, int maxDop, int maxFilesPerThread, CancellationToken cancellationToken)
      {
         var result = new ConcurrentDictionary<string, string>();

         IEnumerable<IEnumerable<string>> batched = entries.Chunk(maxFilesPerThread);

         try
         {
            Parallel.ForEach(
               batched,
               new ParallelOptions { MaxDegreeOfParallelism = maxDop, CancellationToken = cancellationToken },
               entry => ExtractSequentiall(entry, result, cancellationToken));
         }
         catch(OperationCanceledException)
         {
            //when the task is cancelled it's fine to ignore it and return an empty result
            log.Trace("zip extraction cancelled");
         }

         return new Dictionary<string, string>(result);
      }

      private void ExtractSequentiall(IEnumerable<string> entries, ConcurrentDictionary<string, string> result, CancellationToken cancellationToken)
      {
         using (FileStream fs = File.Open(_filePath, FileMode.Open, FileAccess.Read, FileShare.Read))
         {
            using (var archive = new ZipArchive(fs, ZipArchiveMode.Read, true))
            {
               foreach (string entry in entries)
               {
                  if (cancellationToken.IsCancellationRequested) return;

                  ZipArchiveEntry ze = archive.GetEntry(entry);

                  using (Stream es = ze.Open())
                  {
                     byte[] data = es.ToByteArray();
                     string s = Encoding.UTF8.GetString(data);

                     result[entry] = s;
                  }
               }
            }
         }
      }

      public void Dispose()
      {
      }
   }

this code is using the following extension methods:

/// <summary>
/// Split sequence in batches of specified size
/// </summary>
/// <typeparam name="T">Element type</typeparam>
/// <param name="source">Enumeration source</param>
/// <param name="chunkSize">Size of the batch chunk</param>
/// <returns></returns>
public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> source, int chunkSize)
{
      if(source == null) throw new ArgumentNullException(nameof(source));

      while(source.Any())
      {
            yield return source.Take(chunkSize);
            source = source.Skip(chunkSize);
      }
}

/// <summary>
/// Reads all stream in memory and returns as byte array
/// </summary>
public static byte[] ToByteArray(this Stream stream)
{
      if(stream == null) return null;
      using(var ms = new MemoryStream())
      {
            stream.CopyTo(ms);
            return ms.ToArray();
      }
}

The idea here is really simple - we are opening the same file in multiple threads, making it thread safe. maxDop is a number specifying max degree of paralellism i.e. maximum number of threads to use, maxFilesPerThread specifies how many files to use within each thread. The second value exists because opening a zip file is itself not a cheap operation, therefore it would be nice to reuse the same thread for a few files. I found that setting those numbers to something like 20, 20 achieves the maximum performance on average server.


Thanks for reading. If you would like to follow up with future posts please subscribe to my rss feed and/or follow me on twitter.