Processing large amount of data in Ruby

We tend to hear that Ruby is slow, although for most application I’d say it’s fast enough. It can be sufficient even for processing large amount of data, which we currently do (yes, with Ruby too) at Boostcom. Here are some notes from obstacles I found on my way to achieve decent performance with Ruby, without introducing another language.

The Problem

Let’s say you have logs from the webserver of your webpage. It is visited by many people and they usually engage, meaning that they follow the internal links etc. They also come back from time to time. You want to store the last page each user was at, along with the date it happened. Let’s also add one additional condition: if the page is /say_goodbye, you forget about this user.

Now, you might think that this is some weird, non-real-world problem to solve. But it’s actually pretty much precisely the task I was going to solve some time ago. Of course, business domain was different, but it’s simpler this way to explain what I wanted to achieve.

For simplicity, let’s say the the log file are in JSON format and they have some user_id unique per user.

First approach - Ruby hash

My first idea was very simple: to use Ruby hashes. This problem actually begs for it: we want to have one entry per one user and we want to easily overwrite this entry. So, the code would look like this:

user_last_visits = {}

File.open('logfile.log').each_line do |line|
  event = JSON.parse(line)
  user_id = event['user_id']
  timestamp = event['timestamp']
  url = event['url']

  if url == '/say_goodbye'
    user_last_visits.delete(user_id)
  else
    user_last_visits[user_id] = { url: url, timestamp: timestamp }
  end
end

ResultFormatter.dump(user_last_visits)

The code is pretty straightforward and, most importantly, it does the job.

… unless it does not. When a Ruby hash grows, it starts to take more and more memory, causing garbage collector to kick in very often. With something between one and two million of keys (in my case) in the hash, it becomes unusably slow. Every operation would wake up GC, and even if it’s fast, it would slow down the whole script almost infinitely.

This is a no-go. So what can we improve?

Optimized hashes

My second idea was to try to use something more efficient that Ruby hash. The most promising thing I found was Google Hash functions. They are written in C, so they should be at least equally fast. Also, since they make assumptions about data type that is stored in the hash, it can be much more memory-efficient. If I was to believe the README, even if GC kick in, its run should be 50 times faster.

But I did not believe it, as project is quite old and Ruby garbage collector improved a lot since then. So I ran benchmark provided in the repository myself:

# ruby
"each gc now takes"
"0.025354643000127908"

# google hash (dense, but same result for sparse)
"each gc now takes"
"0.005538118999993458"

Ok, maybe it’s only 5 times faster, but still can make a huge difference. The code looks pretty much the same, too. In this case my user_id is a long number, so I can use the benefits of the library.

user_last_visits = GoogleHashSparseLongToRuby.new

File.open('logfile.log').each_line do |line|
  event = JSON.parse(line)
  user_id = event['user_id'].to_i
  timestamp = event['timestamp']
  url = event['url']

  if url == '/say_goodbye'
    user_last_visits.delete(user_id)
  else
    user_last_visits[user_id] = { url: url, timestamp: timestamp }
  end
end

ResultFormatter.dump(user_last_visits)

Unfortunately, I allowed the script to take significantly more data before choking, but still wasn’t enough.

Redis to the rescue?

When a Ruby (especially Rails) programmer has a storage-efficiency problem, they usually offload it to Redis. It became a solution of choice for… Well… Pretty much everything.

And yes, Redis might have helped here, however it has one pesky feature - it’s in-memory only. I was already worried about about fitting in RAM when I was writing pure Ruby solutions, but when introducing an external tool I wanted to be sure I can avert it.

The winner is: RocksDB

RocksDB is a key-value database developed by Facebook. It is already used to power CockroachDB and also for new storage engine for MySQL called MyRocks. Among it’s features is being optimized for SSD disk storage, so I don’t need to worry about memory issues - the database would do it for me.

Unfortunately, the gem called rocksdb-ruby was quite outdated and did not want to compile agains new versions of RocksDB. That’s why we came up with our own fork that works with most recent versions of the database.

How the code looks?

database = RocksDB::DB.new("/tmp/user_visits_db")

File.open('logfile.log').each_line do |line|
  event = JSON.parse(line)
  user_id = event['user_id'].to_i
  timestamp = event['timestamp']
  url = event['url']

  if url == '/say_goodbye'
    database.delete(user_id)
  else
    database.put(user_id, JSON.dump(url: url, timestamp: timestamp))
  end
end

ResultFormatter.dump(database)

A few things to note here:

  • You can only store strings in RocksDB, therefore you have to convert a hash to JSON string
  • RocksDB writes (puts) are very fast, because they are async. The database uses a structure called write-ahead log (WAL) to efficiently store overwrites, however because of that reads (gets) are relatively slow. We don’t do any in our example, but it’s a thing to keep in mind.
  • Physical storage on the hard drive is a directory in the filesystem. Obviously, you need to make sure that /tmp exists, RocksDB will take care of creating user_visits_db dir.

Now, about ResultFormatter implementation, the efficient way to iterate over RocksDB database is using each_with_index method:

database.each_with_index do |key, value|
  value = JSON.parse(value)
  # something else
end

This method allowed us to process our data in reasonable time and, most importantly, with reasonable memory footprint. If your needs are like that (lots of writes or overwrites) I recommend strongly using RocksDB for the task.

Related posts