I recently tasked myself with helping a good friend import a relatively large (1 million+ records) CSV file into a database powering a Rails application.
The database and code were hosted on a 512MB Chunk at Chunkhost, which is also where I host once hosted Quotir (I now use DigitalOcean, they really rock!).
As any well-intentioned Railsist would, I embarked on the journey armed with my usual import toolbox, which consists of:
- A Rake Task to handle the import
- FasterCSV (which is the default CSV import tool in Rails 3)
- ActiveRecord, because life is so much more beautiful when we don’t have to deal with the database directly
The initial code (pre-optimization) looked a bit like this:
namespace :foo do desc "Import bar data" task bar: :environment do require 'csv' headers = [ "xxx", "yyy", "zzz" ] CSV.foreach("#{path}/bar_data.csv", {headers: :first_row}) do |row| bar = Bar.new headers.each_with_index do |key, idx| bar.send("#{key}=", row[idx]) end # ... do some cool stuff, update some records in other tables bar.save # ... do some more cool stuff, which happens to also create some new records in other tables end end end
In my specific case, the CSV.foreach loop would iterate about once per second. Dear, was I not ready to accept waiting 12 days for this script to finish running!
You see, the beautiful thing about constraints is how they remind you of what is absolutely necessary, and what is well, extraneous.
One doesn’t have to look very far to realize that ActiveRecord is a commodity, a luxury one has to part with when the carriage becomes too heavy for the horse.
So we parted… Me and ActiveRecord bid each other safe travels, and promised we’d meet again.
Now friend, the reason you’re still reading this is because you too, at some point, might have to leave AR (That’s its nickname) by the curb to get to your destination faster.
How did I do it?
Without further ado, here’s what the code above looks like after optimization:
namespace :foo do desc "Import bar data" task bar: :environment do require 'csv' headers = [ "xxx", "yyy", "zzz" ] ActiveRecord::Base.establish_connection CSV.foreach("#{path}/bar_data.csv", {headers: :first_row}) do |row| sql_keys = [] sql_vals = [] created_at = Time.now.strftime("%Y-%m-%d %H:%M:%S") headers.each_with_index do |key, idx| val = row[idx] sql_keys << key sql_vals << ActiveRecord::Base.connection.quote(val) end sql = " INSERT INTO bars (#{sql_keys.join(', ')}, created_at, updated_at) VALUES (#{sql_vals.join(', ')}, '#{created_at}', '#{created_at}') " res = ActiveRecord::Base.connection.execute(sql) bar_id = ActiveRecord::Base.connection.last_inserted_id(res) # do some cool stuff, like create records in other tables that reference bar_id # use ActiveRecord::Base.connection.execute(your_sql) in subsequent calls. # no need to close the connection, or reopen it before calling execute. end end end
It all starts with a call to
ActiveRecord::Base.establish_connection
The line above will establish a direct connection with the database set by your environment in the config/database.yml file.
Two important details to note in the code above:
ActiveRecord::Base.connection.quote(val)
This is one of the proper ways to escape your SQL. Use it. If you favor prepared statements, then by all means look at sanitize_sql_array
The second detail is the time stamps. If you’re working in Rails you will want to add the time stamps for created_at/updated_at manually, as they cannot be null.
And this is how my import script went from taking 12 days to run to taking 3 hours only!
See that? A tavern! This is where me and this post will rest for a while, right by the “Post Comment” button, so how about you stop by and share your thoughts on our journey and yours, fellow traveler?
Note: This code should work with Rails 2 with minimal changes.
EDIT: As mentioned by “Taladar” on the Reddit thread for this post, using LOAD DATA INFILEย can drastically reduce import times. Depending on the case at hand (and specifically if it’s a simple record insert per CSV row), this might be the optimal solution.
8 replies on “Import CSV Files 100x Faster in Rails 3”
Good Job,
I’m using CSV imports in many of my apps, and I have to say it’s quite slow. I wouldn’t recommend it for dynamically creating files in real time.
It’s great for activities running in the background.
Or you could use ActiveRecord::Fixtures.create_fixtures. Although it has problems with empty columns (NULLs) in CSV.
very nice. Your using this for a very large data set. I only had to deal with a max of about 10k rows per import (catalog data). So i batched up the sql, and ran one large query. Went from 5.7k rows in 17 seconds to 1 sec. Thanks for the push in the right direction.
I’m curious about these two lines:
res = ActiveRecord::Base.connection.execute(sql)
bar_id = ActiveRecord::Base.connection.last_inserted_id(res)
What guarantees that you’ll get bar_id from the previous statement? Couldn’t it be possible that you get the id from insert that happened when some user inserts the data at almost exactly the same time like you?
This is the third time I read this page while I’m searching about Rails optimization and I am stil wondering : why did you choose this illustration (on top of your article) ? Either I miss a cultural reference or there’s something I didn’t spot on the picture but I don’t find any relation with the actual topic.
@Viktor Fonic: It is entirely possible, so long as the data was inserted within the same connection.
Because of this, you may want to look into connection pools, how to check them out and release them yourself: http://apidock.com/rails/ActiveRecord/ConnectionAdapters/ConnectionPool.
@Paul A: Very good question! Looks like you found us… ๐
@jonaphin : oh ๐ ok, then
This solution worked great for me in Rails 4.0, but Rails 4.1 has removed the deprecated ActiveRecord::Base#connection method.
The change log suggests accessing via the class. What code change did / would you make to get this solution to work in Rails 4.1?