Categories
Blog Coding Rails

Import CSV Files 100x Faster in Rails 3

I recently tasked myself with helping a good friend import a relatively large (1 million+ records) CSV file into a database powering a Rails application.
The database and code were hosted on a 512MB Chunk at Chunkhost, which is also where I host once hosted Quotir (I now use DigitalOcean, they really rock!).
As any well-intentioned Railsist would, I embarked on the journey armed with my usual import toolbox, which consists of:

  • A Rake Task to handle the import
  • FasterCSV (which is the default CSV import tool in Rails 3)
  • ActiveRecord, because life is so much more beautiful when we don’t have to deal with the database directly

The initial code (pre-optimization) looked a bit like this:

namespace :foo do
  desc "Import bar data"
  task bar: :environment do
    require 'csv'

    headers = [
      "xxx",
      "yyy",
      "zzz"      
    ]

    CSV.foreach("#{path}/bar_data.csv", {headers: :first_row}) do |row|
      bar = Bar.new

      headers.each_with_index do |key, idx|
        bar.send("#{key}=", row[idx])
      end

      # ... do some cool stuff, update some records in other tables

      bar.save

      # ... do some more cool stuff, which happens to also create some new records in other tables
    end
  end
end

In my specific case, the CSV.foreach loop would iterate about once per second. Dear, was I not ready to accept waiting 12 days for this script to finish running!

You see, the beautiful thing about constraints is how they remind you of what is absolutely necessary, and what is well, extraneous.
One doesn’t have to look very far to realize that ActiveRecord is a commodity, a luxury one has to part with when the carriage becomes too heavy for the horse.

So we parted… Me and ActiveRecord bid each other safe travels, and promised we’d meet again.
Now friend, the reason you’re still reading this is because you too, at some point, might have to leave AR (That’s its nickname) by the curb to get to your destination faster.

How did I do it?
Without further ado, here’s what the code above looks like after optimization:

namespace :foo do
  desc "Import bar data"
  task bar: :environment do
    require 'csv'

    headers = [
      "xxx",
      "yyy",
      "zzz"      
    ]

    ActiveRecord::Base.establish_connection

    CSV.foreach("#{path}/bar_data.csv", {headers: :first_row}) do |row|
      sql_keys = []
      sql_vals = []

      created_at = Time.now.strftime("%Y-%m-%d %H:%M:%S")

      headers.each_with_index do |key, idx|
        val = row[idx]

        sql_keys << key
        sql_vals << ActiveRecord::Base.connection.quote(val)
      end

      sql = "
        INSERT INTO bars (#{sql_keys.join(', ')}, created_at, updated_at) 
        VALUES (#{sql_vals.join(', ')}, '#{created_at}', '#{created_at}')
      "

      res = ActiveRecord::Base.connection.execute(sql)
      bar_id = ActiveRecord::Base.connection.last_inserted_id(res)

      # do some cool stuff, like create records in other tables that reference bar_id
      # use ActiveRecord::Base.connection.execute(your_sql) in subsequent calls.
      # no need to close the connection, or reopen it before calling execute.      
    end
  end
end

 

It all starts with a call to
ActiveRecord::Base.establish_connection
The line above will establish a direct connection with the database set by your environment in the config/database.yml file.

Two important details to note in the code above:
ActiveRecord::Base.connection.quote(val)
This is one of the proper ways to escape your SQL. Use it. If you favor prepared statements, then by all means look at sanitize_sql_array
The second detail is the time stamps. If you’re working in Rails you will want to add the time stamps for created_at/updated_at manually, as they cannot be null.

And this is how my import script went from taking 12 days to run to taking 3 hours only!

See that? A tavern! This is where me and this post will rest for a while, right by the “Post Comment” button, so how about you stop by and share your thoughts on our journey and yours, fellow traveler?

Note: This code should work with Rails 2 with minimal changes.

EDIT: As mentioned by “Taladar” on the Reddit thread for this post, using LOAD DATA INFILEย can drastically reduce import times. Depending on the case at hand (and specifically if it’s a simple record insert per CSV row), this might be the optimal solution.

By jonaphin

Sr. Software Engineer / Project Lead at Adobe Systems, Inc.

8 replies on “Import CSV Files 100x Faster in Rails 3”

Good Job,
I’m using CSV imports in many of my apps, and I have to say it’s quite slow. I wouldn’t recommend it for dynamically creating files in real time.
It’s great for activities running in the background.

Or you could use ActiveRecord::Fixtures.create_fixtures. Although it has problems with empty columns (NULLs) in CSV.

very nice. Your using this for a very large data set. I only had to deal with a max of about 10k rows per import (catalog data). So i batched up the sql, and ran one large query. Went from 5.7k rows in 17 seconds to 1 sec. Thanks for the push in the right direction.

I’m curious about these two lines:
res = ActiveRecord::Base.connection.execute(sql)
bar_id = ActiveRecord::Base.connection.last_inserted_id(res)

What guarantees that you’ll get bar_id from the previous statement? Couldn’t it be possible that you get the id from insert that happened when some user inserts the data at almost exactly the same time like you?

This is the third time I read this page while I’m searching about Rails optimization and I am stil wondering : why did you choose this illustration (on top of your article) ? Either I miss a cultural reference or there’s something I didn’t spot on the picture but I don’t find any relation with the actual topic.

This solution worked great for me in Rails 4.0, but Rails 4.1 has removed the deprecated ActiveRecord::Base#connection method.

The change log suggests accessing via the class. What code change did / would you make to get this solution to work in Rails 4.1?