Claus Witt

We had some code read partial responses from a server. This code was tested thoroughly and worked as intended. We got the amount of bytes we needed from the head of a file, and then were able to parse that data as a binary blob.

However; when we put this into production - containers started crashing randomly.

Or rather the code has been in production for years without a crash, but now we were utilizing this code path many times an hour instead of a couple of times per day.

The culprit was hard to find - but it all releates to how ruby works, how networking works, and how I though it worked more like in c.

The old code did something like this:

bytes = nil

uri = URI(url)
begin
  http = Net::HTTP.new(uri.host, uri.port)
  if url.start_with?('https')
    http.use_ssl = true
    http.verify_mode = OpenSSL::SSL::VERIFY_NONE
  end
  http.start do |h|
    request = Net::HTTP::Get.new(uri.request_uri)
    h.request(request) do |response|
      bytes = response.socket.read(count)
    end
  end
rescue IOError => e
  # ignore
end

bytes

and the new code does something far more simple; uses the range header

headers = {'Range' => "bytes=0-#{limit}"}
uri = URI(url)
response = Net::HTTP.get_response(uri, headers)

The former code retrieved the complete file response from the server, and then read the amount of bytes into a variable; the latter only requests the bytes from the server that is needed.

For small files the difference is negligible; but the larger the files get the larger the problems becomes. Response times of the former code goes up, memory usage also goes up. And since video files can be more than 100mb in size in our usage; the slow downloads combined with the large memory usage causes the OOMkiller to destroy the container before the process finishes; further re-enqueing the same job - which increases the possibility that multiple jobs of this type gets handled by the same container in "the same time" - further increasing the possibility of the container crashing again.


Recent posts