How Row Locking Makes Taskling Concurrency Controls Possible

Your Taskling jobs can be configured with concurrency limits and those jobs will never have more than the configured number of executions of that job running at any time.

Some batch and micro-batch jobs need to be singletons, there to be only one execution running at any point in time. This may be to avoid data consistency issues when persisting results or because only a single session can be opened to a third party service etc. Other batch processes need more than one execution running at the same time in order to cope with the data volume but have a concurrency limit in order to not overwhelm downstream systems or third party services.

If you integrate Taskling into your configuration system you can make this concurrency control a real-time knob that can be adjusted at any time.

In this post we'll look at how Taskling makes guarantees about concurrency limits.

Using Row Locking in SQL Server

In Taskling, each batch job has a row in the table Taskling.TaskDefinition. Concurrency limits are managed through the granting of tokens and tokens are stored in the TaskDefinition table. When a Taskling execution starts it requests a token and the Taskling token logic needs to

Ensure that the configured concurrency limit matches the number of tokens (if they differ it either creates new available tokens or removes tokens)
Check if there is an available token and
- if so, then grant it to the execution requesting it
- if not, then check if any of the unavailable tokens belong to a dead task and if so grant it to the requesting process

But in order to make guarantees about the number of executions this logic must run in a linear single-thread fashion so that two Taskling jobs don't get the same token granted at the same time and avoid other race conditions. Using locks in the C# is not enough as executions can be running concurrently across many servers at the same time.

So row locking in SQL Server is used in order to block other executions from running the same logic at the same time. The table TaskDefinition has a column HoldLockTaskExecutionId which is used to get a lock on the row.

Open a serializable transaction
Perform an update on the column HoldLockTaskExecutionId, setting its value to the Id of the requesting execution. This means that this row is now locked and other executions are blocked until the transaction is committed. The update command uses the primary key of the row so there is zero % chance of deadlocks.
Retrieve the tokens from the row into memory, which is a value such as "I:ccc7b851-2404-4b2b-84e3-6898399a976d,S:1,G:153532,I:747ebc4e-add4-4d95-9022-0d7a544af813,S:0,G:153533" which contain two tokens.
1. I is the token id
2. S is the status (0 unavailable, 1 available)
3. G the id of the grantee execution
Adjust the number of tokens based on the configured concurrency limit (it may have been adjusted since the start of the last execution)
If there is a token that is available then assign it to the requesting execution by setting the G value to the id of the execution and the status to 0
Perform an update command to persist the changes to the tokens
Commit the transaction

Then when the execution completes it returns the token by

Open a serializable transaction
Perform an update on the column HoldLockTaskExecutionId, setting its value to the Id of the requesting execution.
Retrieve the token into memory
Set the the status of the token to available
Perform an update command to persist the change
Commit the transaction

How Deadlocks are Avoided and First-Come-First Serve Guarantees are Delivered

To avoid deadlocks and guarantee that executions are allowed access to the tokens on a first come first serve basis, the following methods of access are required

All accesses whether they are read or write use the primary key. This avoids key range locking which can cause two queries to interfere with each other. Using the primary key means a seek is used and there is 0% chance of different batch jobs treading on each other.
Serializable isolation level. In fact because we use primary keys we could use Non Repeatable Read isolation level. The effect that both enable is that once a lock has been granted on a row that lock is not relinquished until the transaction completes. That is important because of the next point.
All "read then write" operations are preceded by a write making ALL these write-read-write operations. If we allow the read to be first then lock escalation can cause two competing queries to attempt the promotion of their read lock (Shared lock) to an exclusive lock. The problem is that neither can do that because they both already have shared locks which are not compatible with exclusive locks and so they are deadlocked. By making a write first they gain an exclusive lock on the row which is not relinquished until the transaction completes. This means while an execution is reading the tokens and making changes it is the only execution that can perform those activities at that moment in time.

So SQL Server is critical to Taskling. You could not simply rip out SQL Server and put Cassandra in its place. Other relational databases such as Oracle, MySql, Postgres etc would equally work as they also have row locking capabilities.

Because reliable transaction behaviour is a fundamental part of SQL Server we can leverage that reliability to allow Taskling a similar level of concurrency control.