Import CSV Files 100x Faster in Rails 3

I recently tasked myself with helping a good friend import a relatively large (1 million+ records) CSV file into a database powering a Rails application. The database and code were […]

I recently tasked myself with helping a good friend import a relatively large (1 million+ records) CSV file into a database powering a Rails application.
The database and code were hosted on a 512MB Chunk at Chunkhost, which is also where I host once hosted Quotir (I now use DigitalOcean, they really rock!).
As any well-intentioned Railsist would, I embarked on the journey armed with my usual import toolbox, which consists of:

  • A Rake Task to handle the import
  • FasterCSV (which is the default CSV import tool in Rails 3)
  • ActiveRecord, because life is so much more beautiful when we don’t have to deal with the database directly

The initial code (pre-optimization) looked a bit like this:

namespace :foo do
  desc "Import bar data"
  task bar: :environment do
    require 'csv'

    headers = [
      "xxx",
      "yyy",
      "zzz"      
    ]

    CSV.foreach("#{path}/bar_data.csv", {headers: :first_row}) do |row|
      bar = Bar.new

      headers.each_with_index do |key, idx|
        bar.send("#{key}=", row[idx])
      end

      # ... do some cool stuff, update some records in other tables

      bar.save

      # ... do some more cool stuff, which happens to also create some new records in other tables
    end
  end
end

In my specific case, the CSV.foreach loop would iterate about once per second. Dear, was I not ready to accept waiting 12 days for this script to finish running!

You see, the beautiful thing about constraints is how they remind you of what is absolutely necessary, and what is well, extraneous.
One doesn’t have to look very far to realize that ActiveRecord is a commodity, a luxury one has to part with when the carriage becomes too heavy for the horse.

So we parted… Me and ActiveRecord bid each other safe travels, and promised we’d meet again.
Now friend, the reason you’re still reading this is because you too, at some point, might have to leave AR (That’s its nickname) by the curb to get to your destination faster.

How did I do it?
Without further ado, here’s what the code above looks like after optimization:

namespace :foo do
  desc "Import bar data"
  task bar: :environment do
    require 'csv'

    headers = [
      "xxx",
      "yyy",
      "zzz"      
    ]

    ActiveRecord::Base.establish_connection

    CSV.foreach("#{path}/bar_data.csv", {headers: :first_row}) do |row|
      sql_keys = []
      sql_vals = []

      created_at = Time.now.strftime("%Y-%m-%d %H:%M:%S")

      headers.each_with_index do |key, idx|
        val = row[idx]

        sql_keys << key
        sql_vals << ActiveRecord::Base.connection.quote(val)
      end

      sql = "
        INSERT INTO bars (#{sql_keys.join(', ')}, created_at, updated_at) 
        VALUES (#{sql_vals.join(', ')}, '#{created_at}', '#{created_at}')
      "

      res = ActiveRecord::Base.connection.execute(sql)
      bar_id = ActiveRecord::Base.connection.last_inserted_id(res)

      # do some cool stuff, like create records in other tables that reference bar_id
      # use ActiveRecord::Base.connection.execute(your_sql) in subsequent calls.
      # no need to close the connection, or reopen it before calling execute.      
    end
  end
end

 

It all starts with a call to
ActiveRecord::Base.establish_connection
The line above will establish a direct connection with the database set by your environment in the config/database.yml file.

Two important details to note in the code above:
ActiveRecord::Base.connection.quote(val)
This is one of the proper ways to escape your SQL. Use it. If you favor prepared statements, then by all means look at sanitize_sql_array
The second detail is the time stamps. If you’re working in Rails you will want to add the time stamps for created_at/updated_at manually, as they cannot be null.

And this is how my import script went from taking 12 days to run to taking 3 hours only!

See that? A tavern! This is where me and this post will rest for a while, right by the “Post Comment” button, so how about you stop by and share your thoughts on our journey and yours, fellow traveler?

Note: This code should work with Rails 2 with minimal changes.

EDIT: As mentioned by “Taladar” on the Reddit thread for this post, using LOAD DATA INFILE┬ácan drastically reduce import times. Depending on the case at hand (and specifically if it’s a simple record insert per CSV row), this might be the optimal solution.

About jonaphin

Sr. Software Engineer at Adobe Systems, Inc.