-
Notifications
You must be signed in to change notification settings - Fork 7
How rails sharding connection handling works
First of all, to understand what the rails-sharding gem does, it is necessary to understand how ActiveRecord handles its connection to the database, and how we can use its existing interface to connect to shards.
ActiveRecord v5 defines the following class hierarchy:
ConnectionHandler -[has many]-> ConnectionPool -[has many]-> Connection (Adapter-specific classes)
Let's understand what each one of these classes do.
This is a high-level interface to database connection management. A common running Rails app has only one instance ConnectionHandler, that can be accessed via ActiveRecord::Base.connection_handler. The actual connection_handler getter is defined in ActiveRecord::Core#L134, which is included into ActiveRecord::Base.
The ConnectionHandler can manage several ConnectionPool, usually one per database. In a vanilla Rails app the ConnectionHandler has a single ConnectionPool. Besides holding several ConnectionPool, the ConnectionHandler also has the feature of automatically dealing with forked processes. If a Rails app is forked, the child (new) process cannot use the same ConnectionPool of the parent process. The ConnectionHandler detects the fork automatically and instanciates a new ConnectionPool for the child process, with the same specification as the parent's. This is completely transparent to the ConnectionHandler user.
To create a new ConnectionPool you need to call ConnectionHandler#establish_connection, which is a poorly named method for me, as it doesn't actually establish a connection to the database. It simply creates an empty ConnectionPool associated with a spec_name (just a plain string, which is "primary" by default).
Once the pool is created, you can retrieve it by spec_name using ConnectionHandler#retrieve_connection_pool, or you can even retrieve a DB connection directly just passing the spec_name using ConnectionHandler#retrieve_connection
Of course, Rails users don't usually know about the ConnectionHandler, and they also don't know things such as the spec_name of the database they're using. If you needed to get a connection or connection_pool through the ConnectionHandler, you'd call:
ActiveRecord::Base.connection_handler.retrieve_connection_pool("primary")
ActiveRecord::Base.connection_handler.retrieve_connection("primary")However, Rails sugar-coats this process by including the ConnectionHandling module to ActiveRecord::Base, which makes accessing DB connections much simpler:
ActiveRecord::Base.connection_pool
ActiveRecord::Base.connection
ActiveRecord::Base.with_connection { |connection| ... }Also, when you have different models being saved in different databases, the ConnectionHandling already does the trick for you of returning the connection or the connection pool to the right database, depending on which model you're calling the methods:
ModelA::Base.connection_pool #=> connection pool of the DB where ModelA is
ModelB::Base.connection_pool #=> connection pool of the DB where ModelB isThe case above is one of those cases where a single ConnectionHandler will have multiple ConnectionPool.
The ConnectionPool is responsible for managing and allocating DB connections to threads. When you define your connection in the config/database.yml you specify the pool size of for each database using the pool option (if not, the default is 5). This is the maximum number of database connections a ConnectionPool supports.
The ConnectionPool is lazy. It starts with 0 available connections, and as connections are checked-out, they are created and removed from the available pool. When the pool runs out of connections it will put any thread that is requesting another connection to wait for a given timeout. If it timeouts, it will then raise a ConnectionTimeoutError exception here. You can control the timeout time with the checkout_timeout option in your database specification in config/database.yml
If you're into this kind of stuff you can play with the ConnectionPool directly using the the checkout and checkin methods:
connection_pool = ActiveRecord::Base.connection_pool
connection_1 = connection_pool.checkout
connection_2 = connection_pool.checkout
connection_pool.checkin connection_1
connection_pool.checkin connection_2But be warned! If you forget to check-in the connection it will stay reserved and unavailable for checkout, and you run the risk of running out of connections in your pool.
This is one of the reasons you should probably not use the checkout and checkin methods directly, but rather the higher-level methods connection and release_connection (these are the methods used by Rails to retrieve a connection). The main difference is that connection and release_connection will associate a checked-out connection to the current thread, and will avoid reserving more than one connection per thread. This means that if you forget to release_connection after you have finished using the connection, at least, the next time you request for a connection you will get the same one as before, instead of checking-out a new one:
connection_pool = ActiveRecord::Base.connection_pool
connection_1 = connection_pool.connection
connection_2 = connection_pool.connection
connection_1 == connection_2 #=> trueIn fact, that is how ActiveRecord::Base uses the ConnectionPool. When you load some record like User.first ActiveRecord will request a connection to the ConnectionPool, and it will not call release_connection later. This is actually a sensible thing to do when you have a single database, but might be a problem if you're accessing thousands of different databases randomly.
In terms of using the ConnectionPool directly, it is more memory-efficient to use the with_connection method instead of the pair connection, release_connection. This ensures the connection is checked-in to the poll after the block runs, even if the block raised some exception:
connection_pool = ActiveRecord::Base.connection_pool
connection_pool.with_connection do |connection|
# do your work here
end
# connection has been checked-in again when we get hereJust remember that a checked-in connection is not a closed connection, it just means the connection has returned to the pool of available connections to be checked-out by another thread. This leads to better memory utilization but does not free memory directly.
Another interesting feature of the ConnectionPool is the Reaper. It runs periodically (with frequency configurable through the reaping_frequency config) and it identifies connections checked-out by threads that are no longer alive, and checks them in again. The Reaper does not run by default, only if you specify the reaping_frequency option.
As of now, the ConnectionPool doesn't have any methods to actually disconnect and clear stale connections sitting in the pool for too long. This might be relevant if you are accessing a large number of databases, as every connection ever opened for any database ever queried will be kept open forever. This results in a larger memory footprint of your Rails process and might result in your DB running out of open connection slots (usually in the order of hundreds to thousands for a database). Of course, you will only have this kind of problem if you're accessing a very large number of databases or if you have very large connection pools.
The connection class is defined for each of the DB adapters (MySQL, Postgres, etc) and it holds a raw_connection, which is a DB client defined by some external gem (mysql2 for MySQL, pg for Postgres). You can run queries using the Connection#execute and you can do some simple connection management with Connection#disconnect!, Connection#active? and Connection#reconnect.
If you want to actually close/disconnect a connection, it is important to check it in first to the ConnectionPool, then remove it from the ConnectionPool, and finally you can call disconnect! on it. If you do remove it from the pool first, it will eventually be served to some thread, and it will crash your thread when used.
Now that you understand what ActiveRecord offers us, you're ready to understand how the rails-sharding gem manages connections to different shards.
Based on a configuration file (config/shards.yml) the gem defines a new (and static) ConnectionHandler just to deal with shard connections. In other words, a completely separate handler from the one Rails uses.
ActiveRecord::Base.connection_handler # => ConnectionHandler for Rails
Rails::Sharding::ConnectionHandler # => static ConnectionHandler for shardsAt Rails initialization, the gem creates a ConnectionPool for each shard (in all shard groups) by calling Rails::Sharding::ConnectionHandler#establish_connection for each shard, the same way ActiveRecord does for the primary database.
You can then retrieve connections or connection pools for shards using:
Rails::Sharding::ConnectionHandler.connection_pool(shard_group, shard_name)
Rails::Sharding::ConnectionHandler.retrieve_connection(shard_group, shard_name)
Rails::Sharding::ConnectionHandler.with_connection(shard_group, shard_name) { |connection| ... }And the connection pools will work as usual, opening connections when necessary and allocating them to threads that request them.
Finally, the missing piece to make it all work is the module Rails::Sharding::ShardableModel, that will work as a ConnectionPool switcher, making the model access the primary ConnectionPool or some of the shards' ConnectionPools, depending on the scope.
Usually it is not. You can manage very large Rails applications with or without shards and never have to know any of this. However, if you're connecting to a very large number of shards, or your databases are already near the limit of open connections, you will probably have to deal with that your self, and learn to manage connection pools and connections manually.