Using dotnet to read gzip streams from s3

Written by Craig Godden-Payne

On November 21, 2018

Reading this will take about 3 minutes.

TLDR - Theres a bug in the dotnet gzip compression library, and it doesn’t seem like it’s going to be fixed anytime soon

So I came across a bug recently when reading gzip streams.

I was working on a project where the logs from an ALB were being stored in s3. Periodically, my code would call s3 and read the streams and process them into elasticsearch. What I found was that the official dotnet gzip library would only read about the first 6 or 7 lines. I was pulling my hair out, I tried different things, I was convinced I had somehow not flushed a stream or copied the decompressed stream incorrectly or something.

My code was very simple:

var s3Object = _s3Client.GetObjectAsync(s3Notification.S3.Bucket.Name, s3Notification.S3.Object.Key).GetAwaiter().GetResult();
using (var decompressionStream = new GZipStream(s3Object.ResponseStream, CompressionMode.Decompress))
using (var decompressedStream = new MemoryStream())
{   
    decompressionStream.CopyTo(decompressedStream);
    decompressionStream.Flush();
    decompressedStream.Flush();
    decompressedStream.Position = 0;
    string line;
    var lines = new List<String>();
    using (var sr = new StreamReader(decompressedStream))
        while ((line = sr.ReadLine()) != null)
            lines.Add(line);
};

It had me head scratching for a long time, and it turns out there’s a bug in the dotnet library for decompression. It seems like there’s a lot of people out there struggling with this.

What I did was to use the SharpZipLib nuget library instead, which worked a charm, and I didn’t actually have to change the code too much:

using (var response = _s3Client.GetObjectStreamAsync(s3Object.BucketName, s3Object.Key, null).GetAwaiter().GetResult())
{
    byte[] dataBuffer = new byte[4096];
    using (var gzipStream = new GZipInputStream(response))
    using (var decodedStream = new MemoryStream())
    {
        StreamUtils.Copy(gzipStream, decodedStream, dataBuffer);
        gzipStream.Flush();
        decodedStream.Flush();
        decodedStream.Position = 0;
        using (var reader = new StreamReader(decodedStream))
        {
            var lines = new List<string>();
            while (!reader.EndOfStream)
                lines.Add(reader.ReadLineAsync().GetAwaiter().GetResult());

            var data = lines.Select(x => _parser.Parse(x));
            _elasticSearchAdapter.CreateElasticRecord(data);
            _s3Client.DeleteObjectAsync(s3Object.BucketName, s3Object.Key).GetAwaiter().GetResult();
        }
    }
}

I will blog about the solution I created for writing logs to elasticsearch using dotnet and lambda serverless in a later post. It was a proof of concept, but seems to work quite well!

In the past I’ve used logstash to process logs into elastic, but I wanted somethat that I could run serverless, and only process when I wanted, which is why I went down this route.

Written on November 21, 2018.