Case definitions often have a temporal condition, e.g.,
two or more physician visits at least 30 days apart within two years.
if_date() (for a vector of dates) and
restrict_date() (for data frames and database tables)
decide whether a client’s dates can satisfy such a condition. This
vignette explains how they work under the hood: why the
within part is easy, why the apart part needs
a search algorithm, and how that algorithm gets translated into SQL — a
language with no loops — using metaprogramming and ‘dbplyr’.
The question to answer is: among a client’s dates, is there
any set of n dates whose adjacent gaps are all at
least apart days and whose overall span is at most
within days?
The direct approach is to try every possible set of n
dates:
brute_force <- function(x, n, apart = 0, within = Inf) {
any(utils::combn(
unique(x), n,
function(s) all(diff(sort(s)) >= apart) & (max(s) - min(s) <= within)
))
}This is correct, and it is exactly what the package’s tests compare
against. But the number of sets explodes: a client with 20 dates has
choose(20, 3) = 1140 sets of three, and
choose(20, 5) = 15,504 sets of five. Across millions of
clients this is painful in R — and impossible on a database, because SQL
has no way to enumerate combinations at all. Both conditions need a
smarter route.
withinIf the only condition is “any n dates within
within days”, sorting solves it. After sorting, the
n dates with the smallest span are always n
consecutive dates — skipping over a date in between can never
shrink a span. So instead of all combinations, we only check each run of
n consecutive sorted dates: is the gap between a date and
the (n − 1)-th date after it at most
within?
x <- as.Date("2020-01-01") + c(0, 200, 380, 390)
# any 3 of these within one year? check consecutive triples:
# (0, 200, 380) spans 380; (200, 380, 390) spans 190 -- yes
if_date(x, n = 3, within = 365)
#> [1] TRUEThis is a rolling window over sorted dates — one pass, no
combinations. On a database, the same idea is one window function:
compare each date with lead(date, n - 1) (or
lag(), depending on flag_at) within the
client.
apartNow the condition “any n dates pairwise at least
apart days apart”. Sorting alone does not help here: the
qualifying dates need not be consecutive — in fact they are usually
spread out, and there are combinatorially many ways to pick a spread-out
set. We need a search, but one that avoids enumerating sets.
The key observation: if any qualifying set exists, then a qualifying set that uses the client’s earliest and latest dates also exists.
Why? Take any qualifying set and swap its earliest member for the client’s overall earliest date. That swap moves the first pick further away from (or keeps the same distance to) the second pick, so the first gap can only grow — the set still qualifies. The same argument applies at the other end with the latest date.
This means we never have to wonder which dates to start and end with:
anchor at the two extremes, and the problem shrinks to finding the
middle n − 2 dates. And the middle dates
can’t be just anywhere — each must be at least apart days
after the overall minimum and at least apart days before
the overall maximum. In other words, they must fall inside the
window
[ min + apart, max − apart ].
Inside that window, the same logic repeats: anchor at the window’s
own earliest and latest dates, shrink the window by apart
on both sides, and look for n − 4 dates in the new, smaller
window. Every pass peels two dates off n, so the search
finishes in n %/% 2 passes — regardless of how many dates
the client has:
n, the last pass needs to find
just one date in the innermost window.n, the last two picks are the
innermost window’s own earliest and latest dates, so the final check is
whether their gap is at least apart.n − 2i dates, the answer is no — the search stops
early.Take a client whose dates are days 0, 5, 9, 16, 23, and 30 (relative
to their first visit), and ask for n = 4 dates at least 7
days apart:
n is now down to 0, so apply
the even-n ending: the earliest and latest dates inside the
window are days 9 and 23, and 23 − 9 = 14 ≥ 7.
Qualifies — the witness set is {0, 9, 23, 30} with gaps
9, 14, 7.Now shift two dates so the middle collapses — days 0, 5, 9, 12, 30:
When apart and within are both supplied,
the two tricks combine. Any qualifying set spans at most
within days, so it must live entirely inside the
within-day window that starts at its own earliest date —
and that earliest date is one of the client’s dates. So:
within days of it (an overlap join —
data.table::foverlaps() locally, a non-equi
left_join() on the database);apart search from the previous section inside
each of these windows;This is also where the flag_at/align
argument gets its meaning: it decides whether the window anchored at a
date extends forward (“left”, the date is the start of the period) or
backward (“right”, the date is the end), which in turn decides which
record carries the flag.
The apart search is naturally a loop: for pass i,
compute window i from the dates that survived window i − 1. R can
write that loop in a few lines, and data.table runs it for
the local method. But restrict_date() also has to run on
database tables, and SQL has no for loop.
The way out: the loop does not have to run in SQL — it only
has to run while writing the SQL. Each pass of the
search is expressible as ordinary ‘dplyr’ verbs on grouped data: window
i is a mutate() that uses min(),
max(), and sum() over the client group. So the
package loops in R at query-building time, manufacturing one
pair of columns per pass with ‘rlang’ metaprogramming:
## simplified from healthdb's internals
for (i in 1:(n %/% 2)) {
win_i <- rlang::sym(paste0("in_win_", i)) # column name for this pass
win_prev <- rlang::sym(paste0("in_win_", i - 1))
sum_i <- rlang::sym(paste0("sum_win_", i))
data <- data %>%
mutate(
# is the date inside this pass's window (computed from the
# dates that were inside the previous pass's window)?
!!win_i := case_when(
between(
date,
min(date[!!win_prev == 1L]) + apart,
max(date[!!win_prev == 1L]) - apart
) ~ 1L,
.default = 0L
),
# are there enough dates left in the window?
!!sum_i := case_when(sum(!!win_i) >= n - i * 2L ~ 1L, .default = 0L)
)
}rlang::sym() turns a generated name like
"in_win_2" into a column reference, and !!
injects it into the mutate(). The loop runs
n %/% 2 times in R, leaving behind a chain of
mutate() calls — a fully unrolled version of the
loop. ‘dbplyr’ then translates each mutate() into a layer
of SQL: the aggregate functions become window functions
(MIN(...) OVER (PARTITION BY clnt_id) and friends), and
each pass becomes a nested subquery.
You can see the generated SQL for yourself:
db <- dbplyr::memdb_frame(
clnt_id = 1,
uid = 1:6,
dates = as.Date("2020-01-01") + c(0, 5, 9, 16, 23, 30)
)
restrict_date(db, clnt_id, dates, n = 4, apart = 7, uid = uid) %>%
show_query()
#> ℹ Apply restriction that each client must have 4 records that were at least 7
#> days apart. Records that met the condition were flagged.
#> <SQL>
#> SELECT
#> `clnt_id`,
#> `uid`,
#> `dates`,
#> COALESCE(`temp_nm_flag_apart`, 0) AS `flag_restrict_date`
#> FROM (
#> SELECT
#> `q01`.*,
#> CASE WHEN ((`final_win_gap` * `sum_win_1`) >= 7) THEN 1 ELSE 0 END AS `temp_nm_flag_apart`
#> FROM (
#> SELECT
#> `q01`.*,
#> MAX(CASE WHEN (`in_win_1` = 1) THEN (`dates`) END) OVER `win1` - MIN(CASE WHEN (`in_win_1` = 1) THEN (`dates`) END) OVER `win1` AS `final_win_gap`
#> FROM (
#> SELECT
#> `q01`.*,
#> CASE
#> WHEN (SUM(`in_win_1`) OVER (PARTITION BY `clnt_id`) >= (4 - 1 * 2)) THEN 1
#> ELSE 0
#> END AS `sum_win_1`
#> FROM (
#> SELECT
#> `dbplyr_utv0lI4zz2`.*,
#> CASE
#> WHEN (`dates` BETWEEN MIN(`dates`) OVER `win1` + 7 AND MAX(`dates`) OVER `win1` - 7) THEN 1
#> ELSE 0
#> END AS `in_win_1`
#> FROM `dbplyr_utv0lI4zz2`
#> WINDOW `win1` AS (PARTITION BY `clnt_id`)
#> ) AS `q01`
#> ) AS `q01`
#> WINDOW `win1` AS (PARTITION BY `clnt_id`)
#> ) AS `q01`
#> ) AS `q01`Reading from the innermost subquery outward, you can follow the
passes of the search: in_win_1 (the first window),
sum_win_1 (enough dates left?), then the final gap check —
exactly the worked example above, written by a loop SQL never sees.
Asking for a larger n simply generates more layers; no R
code changes.
This pattern — looping in R to generate set-based SQL — is worth
knowing beyond this package. Any per-group algorithm with a bounded
number of passes, where each pass is expressible as
mutate()/summarise() over the group, can be
unrolled the same way and pushed to the database instead of downloading
the data.
Because the brute force version is unquestionably correct, it makes a good referee. The package’s tests compare both implementations against it; here is the flavor:
set.seed(44)
referee <- replicate(200, {
x <- sample(seq(as.Date("2020-01-01"), as.Date("2021-12-31"), by = 1), 8)
identical(
if_date(x, n = 3, apart = 30, within = 365),
brute_force(x, n = 3, apart = 30, within = 365)
)
})
all(referee)
#> [1] TRUEThe same checks run for the database method against SQLite, PostgreSQL, and SQL Server, so the SQL that the metaprogramming writes is held to the same standard as the R that is easy to read.
For readers who want to relate the algorithm to established ideas (and for the record of what we could and could not find elsewhere):
The selection problem itself is classical — when solved
sequentially. “Pick n points with pairwise gaps at
least g” has a textbook greedy solution: sort the dates,
take the earliest, then repeatedly take the next date at least
g after the last pick, and check whether n
picks were reached. It is linear after sorting, and its correctness
rests on an exchange argument — the same style of proof used in
the anchoring step above (competitive programmers may recognize the
feasibility check inside the “aggressive cows” /
maximize-the-minimum-gap problem). The catch: this greedy is inherently
sequential — each pick depends on the previous one — which makes it a
poor fit for SQL. Expressing it would require recursion (recursive
common table expressions), which is not portable across database
dialects and not supported by ‘dbplyr’.
The shrinking-window formulation removes the sequential
dependency. Each pass needs only min(),
max(), and a count over the client’s group — all ordinary
window aggregates — and the number of passes is fixed at
n %/% 2 before seeing any data. That is what makes the
algorithm expressible in plain, non-recursive SQL. The nearest
well-known SQL patterns solve different problems: gaps-and-islands
classifies consecutive rows into runs, and
LEAD()/LAG() threshold checks (as used for the
within condition here) also only compare
consecutive rows — but a qualifying apart set is
generally not consecutive. Academic work on threshold queries studies a
related family of “are there at least n rows such that …” questions, but
not this gap-constrained subset selection. We have not found a published
set-based equivalent of the shrinking-window search; pointers are
welcome on the issue
tracker.
The condition itself is everywhere in epidemiology, but implemented sequentially. Validation studies routinely define cases by patterns like “two or more claims at least 30 days apart” — for example, in case-finding algorithms for HIV and chronic dialysis — and the published implementations are typically SAS or Stata programs that scan records row by row. Running the same logic as one set-based query inside the database appears to be uncommon.
Generating SQL from a host language is an old idea; this use of it is not well documented. Query builders and ‘dbplyr’ itself are code generators, and the ‘rlang’ injection mechanics are covered in the ‘dbplyr’ translation vignette and community posts such as this one on passing functions to ‘dbplyr’. What we have not seen documented elsewhere is the pattern used here: running a loop at query-building time to unroll a bounded-pass, per-group algorithm into nested window-function subqueries, as a general way to push iterative logic to the database.