Salesforce data engineering interview questions lean SQL-heavy with a serious analytics edge: five SQL primitives (subquery + aggregation for Nth-largest values, cohort aggregation for retention, self-join with date arithmetic for consecutive-day activity, window functions with LAG for month-over-month growth, and aggregation with string manipulation for volume reports) plus two Python primitives that test data-engineering fundamentals (hash-table design for atomic transaction state, and closures with higher-order functions for pipeline composition). The framings are enterprise SaaS analytics—salaries, retention, activity cohorts, coupon revenue, warehouse throughput, transaction state, and ETL function composition.
This guide walks through the seven topic clusters Salesforce actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 7-problem Salesforce set (1 easy, 5 medium, 1 hard)—a medium-heavy hub that rewards window-function fluency and cohort-style SQL over algorithm-puzzle Python.
Top Salesforce data engineering interview topics
From the Salesforce data engineering practice set, the seven numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–7) | Why it shows up at Salesforce |
|---|---|---|
| 1 | SQL subqueries for top-N salary queries | Second Largest Salary—MAX(x) WHERE x < (SELECT MAX(x)) or LIMIT 1 OFFSET 1. |
| 2 | SQL cohort aggregation for user retention rate | User Retention Rate—define cohort with MIN(login_date), divide retained over cohort size. |
| 3 | SQL self-join with date arithmetic for consecutive-day activity | Users Active 3 Consecutive Days—triple self-join with t2.date = t1.date + 1 invariants. |
| 4 | SQL window functions for month-over-month growth | Coupon Sales Analysis With MoM Growth—LAG(units) OVER (ORDER BY month) over a DATE_TRUNC partition. |
| 5 | SQL aggregation with string manipulation for volume reports | Warehouse Volume Metrics Report—SPLIT_PART to parse composite keys, then aggregate. |
| 6 | Python hash tables for atomic transaction state | Atomic Transaction Dictionary—dict-as-state with snapshot/commit/rollback. |
| 7 | Python closures and higher-order functions for pipelines | Function Composition Pipeline—compose(f, g)(x) == f(g(x)) plus functools.reduce. |
Enterprise-analytics framing rule: Salesforce's prompts span B2B SaaS analytics—salaries, retention, activity cohorts, coupon revenue, warehouse throughput, transaction state. The interviewer is grading whether you map each business framing to the right primitive: Nth-largest → subquery; retention → cohort + division; consecutive days → self-join + date arithmetic; MoM growth →
LAGover aDATE_TRUNCpartition; volume report → aggregate over a parsed dimension; transactional state → hash-table snapshot; ETL pipeline → closure-based composition. State the mapping out loud.
1. SQL Subqueries for Top-N Salary Queries
Subqueries for Nth-largest values in SQL for data engineering
"Find the second-largest salary" is the classic SQL subquery interview prompt. The mental model: the second-largest value is the maximum of all values strictly less than the global maximum—a one-line nested aggregate. Three idioms work; choose for clarity and edge-case handling: nested MAX, LIMIT 1 OFFSET N-1, or DENSE_RANK(). Each has subtly different tie semantics.
Pro tip: Watch the empty-set edge case. If the table has only one distinct salary, the "second-largest" is conventionally NULL. The nested
MAXform returns NULL automatically;LIMIT/OFFSETreturns no row and your application has to handle "0 rows" vs "1 row with NULL."
Why MAX alone fails for second-largest
MAX(salary) returns one value—the global maximum. To get the second-largest, you need values strictly below the maximum. The invariant: second-largest = MAX of the subset where value < MAX(all). Drop < for <= and you'll silently return the global max again on tied top values.
-
MAX(salary)alone is the global max—not the answer. -
MAX(salary) WHERE salary < global_maxis the answer; the predicate filter is the active ingredient. -
Tied tops: if two employees share the top salary, both get filtered out by
<, and you get the third distinct salary unless you treat ties as a single rank (DENSE_RANK). -
Worked example: salaries
[100, 200, 300, 300].MAX = 300; filter< 300leaves[100, 200];MAXof that is200→ second-largest distinct = 200.
-- WRONG: returns the global max (300)
SELECT MAX(salary) FROM employee;
Subquery pattern: MAX(x) WHERE x < (SELECT MAX(x))
The canonical form for second-largest is a scalar subquery in the WHERE clause. The outer query computes MAX(salary) over the filtered subset; the inner scalar subquery computes the global maximum. The invariant: the scalar subquery returns one value; the outer MAX aggregates the filtered rows.
- Scalar subquery — one row, one column; usable in any expression slot.
-
<operator — strict, so tied top rows are excluded. -
Empty-set behavior — if the filtered subset is empty (only one distinct salary),
MAXreturns NULL. -
Worked example: for
[100, 200, 300, 300], the scalar subquery returns 300; the outerMAX(salary) WHERE salary < 300returns 200.
| step | expression | result |
|---|---|---|
| 1 | (SELECT MAX(salary) FROM employee) |
300 |
| 2 |
salary < 300 filter |
rows with salary 100, 200 |
| 3 |
MAX(salary) over filtered |
200 |
SELECT MAX(salary) AS second_max_salary
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
LIMIT 1 OFFSET N-1 and DENSE_RANK() alternatives
Two alternative idioms cover broader cases. LIMIT 1 OFFSET N-1 on a sorted distinct list returns the Nth value—generalizes cleanly to "third-largest," "fifth-largest," etc. DENSE_RANK() solves the "what counts as second" question explicitly: ties at the top compress into rank 1, so rank 2 is the next distinct value.
-
SELECT DISTINCT salary FROM employee ORDER BY salary DESC LIMIT 1 OFFSET 1— Nth-distinct generalization (useOFFSET N-1). -
DENSE_RANK() OVER (ORDER BY salary DESC) = 2— explicit rank semantics; ties share rank, no gap. -
Empty-set caveat —
LIMIT/OFFSETreturns 0 rows on insufficient data; the scalar-subquery form returns 1 row with NULL. Choose based on what the application expects. -
Worked example:
[100, 200, 300, 300]→DISTINCTbecomes[100, 200, 300];ORDER BY DESC→[300, 200, 100];LIMIT 1 OFFSET 1→200.
-- LIMIT/OFFSET form (returns 0 rows if no second salary)
SELECT salary AS second_max_salary
FROM (SELECT DISTINCT salary FROM employee) d
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
-- DENSE_RANK form (returns 0 rows if no second salary)
SELECT salary AS second_max_salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk
FROM employee
) ranked
WHERE rk = 2
LIMIT 1;
Common beginner mistakes
- Using
MAX(salary) WHERE salary <= MAX(salary)—returns the global max because<=lets the top tie through. - Forgetting
DISTINCTin theLIMIT/OFFSETform on tied data—returns the same top salary twice. - Confusing
RANKandDENSE_RANKfor "second-largest distinct"—RANKskips numbers on ties,DENSE_RANKdoesn't. - Running two separate queries (max, then "exclude max")—slower and verbose; the nested subquery is one round-trip.
- Returning a row count of zero and forgetting to handle it in application code—the SQL works, the consumer breaks.
SQL interview question on second-largest salary
Table employee(id INT, salary INT). Return the second-highest distinct salary. If there is no second salary (e.g. only one distinct value), return NULL. Output a single column second_max_salary.
Solution using a scalar subquery with MAX
SELECT MAX(salary) AS second_max_salary
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
Step-by-step trace (input: salaries [100, 200, 300, 300]):
-
Inner query first —
(SELECT MAX(salary) FROM employee)scansemployee, computesMAX = 300. Returns one scalar value. -
Outer
WHEREfilter — appliessalary < 300to every row ofemployee. Rows with salary 300 (both of them) are excluded; rows with 100 and 200 survive. -
Outer
MAX— aggregates the surviving rows:MAX(100, 200) = 200. - Final result — single row, single column.
-
Empty-set edge case — if every row had salary 300, step 2 leaves zero rows;
MAXover an empty set returnsNULL, which the outer query emits as the answer (one row, one NULL value).
Output:
| second_max_salary |
|---|
| 200 |
Why this works — concept by concept:
-
Scalar subquery —
(SELECT MAX(salary) FROM employee)returns exactly one value (the global maximum); usable in any expression slot, including aWHEREpredicate. -
Strict less-than filter —
WHERE salary < global_maxexcludes every row at the top tier, leaving values strictly below. -
Outer
MAX— over the filtered subset, the maximum is by definition the second-largest distinct value. -
Empty-set semantics — if only one distinct salary exists, the filter leaves zero rows;
MAXover an empty set returnsNULL— exactly the expected behavior, no special-case code needed. -
Cost — two scans (or one scan plus an aggregate cache) on
employee, bothO(N).
SQL
Topic — subquery
Subquery problems
SQL
Topic — aggregation
Aggregation problems
2. SQL Cohort Aggregation for User Retention Rate
Cohort-style retention with subqueries in SQL for data engineering
User retention is the canonical SaaS KPI: of users who first appeared in week W, what fraction came back in week W+1? The mental model: retention = retained_users / cohort_size. The numerator filters users who logged in during the retention window; the denominator counts the cohort itself. Two aggregates, one ratio—the structure is identical for "next-day retention," "30-day retention," "monthly retention."
Pro tip: Always pin the cohort in a CTE (or scalar subquery) and compute the numerator on top of it. Mixing cohort definition and retention filter into a single
WHEREclause withORproduces ambiguous semantics that an interviewer will pick apart.
Define a cohort: MIN(login_date) GROUP BY user_id
A cohort is the set of users who first did something on a specific date or in a specific bucket. The first-login date per user is the per-user MIN(login_date). The invariant: MIN(login_date) GROUP BY user_id defines the cohort key—truncate to a week / month / quarter to bucket.
-
MIN(login_date)per user — the cohort definition. -
DATE_TRUNC('week', MIN(login_date))— bucket users into weekly cohorts. -
HAVING MIN(login_date) >= '2026-04-01'— restrict the cohort window directly in the same query. - Worked example: three users with first-login dates April 1, April 1, April 2.
| user_id | first_login | week_cohort |
|---|---|---|
| 1 | 2026-04-01 | 2026-03-30 |
| 2 | 2026-04-01 | 2026-03-30 |
| 3 | 2026-04-02 | 2026-03-30 |
All three are in the same week cohort starting Monday 2026-03-30.
SELECT
user_id,
MIN(login_date) AS first_login,
DATE_TRUNC('week', MIN(login_date))::date AS week_cohort
FROM logins
GROUP BY user_id;
Retention formula: retained_users / cohort_size
The retention formula is fraction of cohort users who came back during the retention window. The shape: numerator = COUNT(DISTINCT user_id) matching the retention predicate; denominator = COUNT(DISTINCT user_id) in the cohort. Convert to a percentage with 100.0 * num / denom and NULLIF to guard against an empty cohort.
- Numerator: users in the cohort who logged in during the retention window.
- Denominator: total users in the cohort.
-
Boundary semantics: "next-day retention" usually means
login_date = first_login + 1; "7-day retention" meanslogin_date BETWEEN first_login AND first_login + 6(or half-open<= first_login + 7). - Worked example: 100 users in the cohort, 60 returned the following day → 60% retention.
WITH cohort AS (
SELECT user_id, MIN(login_date) AS first_login
FROM logins
GROUP BY user_id
)
SELECT
ROUND(
100.0 * COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1)
/ NULLIF(COUNT(DISTINCT c.user_id), 0),
2
) AS next_day_retention_pct
FROM cohort c
JOIN logins l ON l.user_id = c.user_id;
Subquery vs. CTE for cohort sizing
You can pin the cohort in a CTE (named, readable, reusable) or a scalar subquery (compact, single-use). The decision rule: CTE for multi-step queries, scalar subquery for one-off filters. The invariant: the cohort definition must be one consistent set everywhere it's referenced—mixing two slightly different definitions in numerator and denominator silently corrupts the retention number.
-
CTE form —
WITH cohort AS (...) SELECT ... FROM cohort JOIN logins .... Readable; supports multi-step. -
Subquery form —
(SELECT user_id FROM logins GROUP BY user_id HAVING MIN(login_date) = '2026-04-01'). Compact for single-use cases. - Most engines — CTEs and subqueries plan equivalently; pick by readability, not performance.
- Worked example: CTE-based 7-day retention reusing the same cohort set in numerator and denominator.
WITH cohort AS (
SELECT user_id, MIN(login_date) AS first_login
FROM logins
WHERE login_date >= '2026-04-01' AND login_date < '2026-04-08'
GROUP BY user_id
)
SELECT
c.first_login AS cohort_date,
COUNT(DISTINCT c.user_id) AS cohort_size,
COUNT(DISTINCT l.user_id) FILTER (
WHERE l.login_date > c.first_login
AND l.login_date <= c.first_login + INTERVAL '7 days'
) AS retained_7d,
ROUND(
100.0 * COUNT(DISTINCT l.user_id) FILTER (
WHERE l.login_date > c.first_login
AND l.login_date <= c.first_login + INTERVAL '7 days'
) / NULLIF(COUNT(DISTINCT c.user_id), 0),
2
) AS retention_7d_pct
FROM cohort c
JOIN logins l ON l.user_id = c.user_id
GROUP BY c.first_login
ORDER BY c.first_login;
Common beginner mistakes
- Counting
COUNT(*)instead ofCOUNT(DISTINCT user_id)—double-counts users who logged in multiple times in the window. - Mixing two slightly different cohort definitions between numerator and denominator—silently wrong retention numbers.
- Forgetting
NULLIF(denom, 0)—divide-by-zero on empty cohorts. - Using
BETWEENfor "within 7 days" when you want half-open semantics—off-by-one boundary inclusion. - Counting users who never left as "retained"—the retention predicate must be
login_date > first_login, not>=.
SQL interview question on user retention rate
Tables logins(user_id, login_date). Return per-cohort next-day retention rate: for each first-login date, the percentage of users who logged in again the following day. Round to 2 decimals, sort by cohort date.
Solution using cohort CTE and conditional aggregation
WITH cohort AS (
SELECT user_id, MIN(login_date) AS first_login
FROM logins
GROUP BY user_id
)
SELECT
c.first_login AS cohort_date,
COUNT(DISTINCT c.user_id) AS cohort_size,
COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1) AS retained_users,
ROUND(
100.0 * COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1)
/ NULLIF(COUNT(DISTINCT c.user_id), 0),
2
) AS next_day_retention_pct
FROM cohort c
JOIN logins l ON l.user_id = c.user_id
GROUP BY c.first_login
ORDER BY c.first_login;
Step-by-step trace (input: 3 users, login table below):
| user_id | login_date |
|---|---|
| 1 | 2026-04-01 |
| 1 | 2026-04-02 |
| 2 | 2026-04-01 |
| 3 | 2026-04-01 |
| 3 | 2026-04-03 |
-
Build the CTE
cohort— group by user, takeMIN(login_date). Result:(1, 2026-04-01),(2, 2026-04-01),(3, 2026-04-01). Three users, one cohort date. -
Join
cohorttologinsonuser_id. Each cohort row is joined to every login row for that user — user 1 → 2 rows, user 2 → 1 row, user 3 → 2 rows. Total: 5 joined rows. -
GROUP BY c.first_login— collapses to one row per cohort date (2026-04-01). -
Cohort-size aggregate —
COUNT(DISTINCT c.user_id) = 3. -
Retained-user aggregate —
COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1)counts users who have any joined row withlogin_date = 2026-04-02. Only user 1 qualifies →1. -
Percentage —
100.0 * 1 / NULLIF(3, 0) = 33.33. Rounded to 2 decimals. - Final result — single cohort row.
Output:
| cohort_date | cohort_size | retained_users | next_day_retention_pct |
|---|---|---|---|
| 2026-04-01 | 3 | 1 | 33.33 |
Why this works — concept by concept:
-
Cohort CTE —
WITH cohort AS (... MIN(login_date) ... GROUP BY user_id)pins each user once with their first-login date; reused as the canonical cohort definition. -
Outer join —
JOIN logins l ON l.user_id = c.user_idbrings every login per user into the row set so the retention predicate can see all return visits. -
GROUP BY+ denominator —COUNT(DISTINCT c.user_id)collapsed byc.first_loginis the cohort size — distinct so duplicate joined rows don't inflate it. -
Conditional aggregate (numerator) —
COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1)counts only users who returned the next day; sameDISTINCTdiscipline. -
NULLIFdivide-by-zero guard —NULLIF(denom, 0)turns an empty cohort intoNULLinstead of raising. -
ROUND(..., 2)— clamps the percentage to 2 decimals, removing float noise. -
Cost — one scan of
loginsfor the CTE + one hash join + one aggregate →O(N + M).
SQL
Topic — cohort analysis
Cohort analysis problems
SQL
Topic — subquery
Subquery problems
3. SQL Self-Join with Date Arithmetic for Consecutive-Day Activity
Self-joins with date filters for consecutive-day patterns in SQL for data engineering
"Find users active 3 consecutive days" is the canonical self-join + date-arithmetic interview prompt. The mental model: a triple self-join with date constraints t2.date = t1.date + 1 and t3.date = t1.date + 2 returns user-ids that have all three consecutive days in the table. Self-join is the structural primitive; date arithmetic is the constraint glue.
Self-join basics: alias the same table three times
A self-join uses one physical table as multiple logical references. For 3-consecutive-day activity, alias logins three times: t1, t2, t3. The aliases are required to disambiguate columns; the ON clauses define the relationships. The invariant: aliasing makes the consecutive-day claim explicit—t1 is day 1, t2 is day 2, t3 is day 3.
-
logins t1,logins t2,logins t3— three logical aliases on the same physical table. -
Each
ONclause carries one relationship constraint: same user, +1 day, +2 days. -
SELECT DISTINCT t1.user_id— deduplicate across multiple matching streaks. - Worked example: user 42 has logins on April 1, 2, 3.
| t1.user_id | t1.date | t2.date | t3.date |
|---|---|---|---|
| 42 | 2026-04-01 | 2026-04-02 | 2026-04-03 |
The triple self-join produces one row → user 42 qualifies.
SELECT DISTINCT t1.user_id
FROM logins t1
JOIN logins t2 ON t2.user_id = t1.user_id AND t2.date = t1.date + 1
JOIN logins t3 ON t3.user_id = t1.user_id AND t3.date = t1.date + 2;
Consecutive-day invariant: t2.date = t1.date + 1 AND t3.date = t1.date + 2
The structural constraint is the date-arithmetic invariant: the second login is exactly one day after the first; the third is exactly two days after. Both are equality predicates on date columns plus integer offsets. The invariant: strict equality on the offset—= +1, not >= +1—or you'll match non-consecutive streaks.
-
t2.date = t1.date + 1— exact next-day; equality, not inequality. -
t3.date = t1.date + 2— exact two-days-later; same shape. -
Generalizes to N days: add
tN.date = t1.date + N - 1for an N-day streak. -
Worked example: if user 42 logs in on April 1 and 3 but not April 2, the
t2.date = t1.date + 1constraint fails and the row is dropped.
| user_id | dates present | t2 match | t3 match | qualifies? |
|---|---|---|---|---|
| 42 | Apr 1, 2, 3 | yes (Apr 2) | yes (Apr 3) | yes |
| 99 | Apr 1, 3, 5 | no (no Apr 2) | — | no |
Worked-example solution. The query that encodes the invariant for a 3-day streak — +1 and +2 offsets are the load-bearing predicates:
SELECT DISTINCT t1.user_id
FROM logins t1
JOIN logins t2
ON t2.user_id = t1.user_id
AND t2.login_date = t1.login_date + INTERVAL '1 day' -- exact +1
JOIN logins t3
ON t3.user_id = t1.user_id
AND t3.login_date = t1.login_date + INTERVAL '2 days' -- exact +2
WHERE t1.user_id IN (42, 99);
-- → only user 42 returned; user 99 is dropped because Apr 2 is missing.
Gaps-and-islands alternative with ROW_NUMBER and date subtraction
The classic "gaps and islands" trick uses ROW_NUMBER() partitioned by user, ordered by date; date - ROW_NUMBER is constant inside one consecutive run. Group by (user_id, date - rn) and filter COUNT(*) >= 3. The invariant: for any user, date - ROW_NUMBER OVER (PARTITION BY user_id ORDER BY date) is constant within a single consecutive streak.
-
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date)— sequential per user. -
date - rn::int— subtract the row number to get a streak-id. -
GROUP BY user_id, streak_id HAVING COUNT(*) >= 3— filter streaks of length ≥ 3. - Performance: one window pass + one aggregate; usually faster than a triple self-join on large tables.
- Worked example: user 42 with April 1, 2, 3, 5.
| date | rn | date − rn | streak id |
|---|---|---|---|
| 2026-04-01 | 1 | 2026-03-31 | A |
| 2026-04-02 | 2 | 2026-03-31 | A |
| 2026-04-03 | 3 | 2026-03-31 | A |
| 2026-04-05 | 4 | 2026-04-01 | B |
Streak A has 3 rows → user 42 qualifies.
WITH numbered AS (
SELECT
user_id,
login_date,
login_date - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date))::int AS streak_id
FROM (SELECT DISTINCT user_id, login_date FROM logins) d
)
SELECT user_id
FROM numbered
GROUP BY user_id, streak_id
HAVING COUNT(*) >= 3;
Common beginner mistakes
- Using
>=instead of=in the date-offset constraint—matches non-consecutive streaks (April 1, 5, 10). - Forgetting
t2.user_id = t1.user_id—joins across users, returns nonsense. - Forgetting
DISTINCTon the input or the output—double-counts users with multiple overlapping 3-day streaks. - Using
BETWEENon the date offset—silently allows zero-day or off-by-one windows. - For the gaps-and-islands form, forgetting to deduplicate
(user_id, login_date)first—ROW_NUMBERskips when the table has duplicate rows.
Drill self-join SQL problems →
SQL interview question on three-consecutive-day activity
Table logins(user_id, login_date)—each row is one login event; (user_id, login_date) may appear multiple times. Return distinct user_ids who logged in on three consecutive calendar days at any point in the data. Sort ascending.
Solution using a triple self-join with date arithmetic
SELECT DISTINCT t1.user_id
FROM logins t1
JOIN logins t2
ON t2.user_id = t1.user_id
AND t2.login_date = t1.login_date + INTERVAL '1 day'
JOIN logins t3
ON t3.user_id = t1.user_id
AND t3.login_date = t1.login_date + INTERVAL '2 days'
ORDER BY t1.user_id;
Step-by-step trace (input: logins rows below):
| user_id | login_date |
|---|---|
| 42 | 2026-04-01 |
| 42 | 2026-04-02 |
| 42 | 2026-04-03 |
| 99 | 2026-04-01 |
| 99 | 2026-04-03 |
-
Pick
t1candidate — engine iterates eachloginsrow ast1. Taket1 = (42, 2026-04-01). -
Probe
t2— look for a row witht2.user_id = 42 AND t2.login_date = 2026-04-02. Found →(42, 2026-04-02). -
Probe
t3— look fort3.user_id = 42 AND t3.login_date = 2026-04-03. Found →(42, 2026-04-03). Triple matched — emitt1.user_id = 42. -
Next
t1for user 99 —t1 = (99, 2026-04-01).t2would need(99, 2026-04-02)— not in table → row dropped. -
t1 = (99, 2026-04-03)—t2would need(99, 2026-04-04)— not present → dropped. User 99 never qualifies. -
DISTINCTon output — even if user 42 had a longer streak (e.g. 4 consecutive days), multiplet1starting points would emit duplicate42s;DISTINCTdeduplicates. - Final result — one row.
Output:
| user_id |
|---|
| 42 |
Why this works — concept by concept:
-
Triple alias —
logins t1,logins t2,logins t3give three logical "day 1," "day 2," "day 3" references on one physical table. -
Same-user constraint —
t2.user_id = t1.user_id(andt3.user_id = t1.user_id) keep the join within a single user; without it the join cross-products users. -
Strict-equality date offsets —
t2.login_date = t1.login_date + INTERVAL '1 day'andt3.login_date = t1.login_date + INTERVAL '2 days'enforce exact consecutiveness;>=would match non-consecutive streaks. -
DISTINCT— deduplicates users who have multiple overlapping 3-day streaks (e.g. 4-day runs produce two starting points). -
ORDER BY t1.user_id— deterministic output ordering for grading. -
Cost — a hash join on
(user_id, login_date + 1)plus another on(user_id, login_date + 2); both index-supported,O(N)on a(user_id, login_date)index.
SQL
Topic — self-join
Self-join problems
SQL
Topic — date arithmetic
Date arithmetic problems
4. SQL Window Functions for Month-over-Month Growth
LAG over DATE_TRUNC partitions for MoM growth in SQL for data engineering
Month-over-month growth is the canonical analytics KPI and the Hard-tier problem in the Salesforce set. The mental model: bucket events by month with DATE_TRUNC, sum the metric per month, use LAG(metric, 1) to pull the previous month's value into the current row, and compute (curr - prev) * 100.0 / prev as percent growth. Three window functions, one query, no self-join required.
Pro tip:
LAGreturns NULL on the very first row of the partition—there's no previous month for January's growth. Wrap the percentage inNULLIF(prev, 0)so a zero-prior-month doesn't crash the divide; the first-row NULL flows through naturally.
DATE_TRUNC('month', sale_date) for monthly bucketing
DATE_TRUNC('month', ts) truncates a timestamp to the first of its month at 00:00:00. The result is a left-edge timestamp that groups all rows from one month onto a single key. The invariant: DATE_TRUNC('month', ...) returns the first-of-month timestamp—rows in April 2026 all collapse to 2026-04-01 00:00:00.
-
DATE_TRUNC('month', sale_date)— primary monthly bucketing. -
DATE_TRUNC('week', ts)— Monday-start in PostgreSQL (ISO 8601). -
DATE_TRUNC('quarter', ts)— Jan / Apr / Jul / Oct quarter starts. - Worked example: four sales bucketed to month.
| sale_id | sale_date | DATE_TRUNC('month', sale_date) |
|---|---|---|
| 1 | 2026-01-15 | 2026-01-01 |
| 2 | 2026-01-28 | 2026-01-01 |
| 3 | 2026-02-03 | 2026-02-01 |
| 4 | 2026-02-19 | 2026-02-01 |
SELECT
DATE_TRUNC('month', sale_date)::date AS month,
SUM(units) AS total_units
FROM coupon_sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month;
LAG(metric) OVER (ORDER BY month) for prior-period lookup
LAG(expr, n) returns the value of expr from n rows back in the window's ordered sequence. With ORDER BY month, LAG(total_units, 1) returns the previous month's total_units inline on the current row. The invariant: LAG aligns prior-period values onto the current row without a self-join.
-
LAG(total_units, 1) OVER (ORDER BY month)— pulls the previous month into the current row. -
LAG(total_units, 12)— generalizes to YoY growth (12 months back). -
First-row behavior —
LAGreturns NULL when there's no prior row in the partition. -
PARTITION BY product_id— compute MoM independently per product. - Worked example: four months of unit sales.
| month | total_units | LAG(total_units, 1) |
|---|---|---|
| 2026-01-01 | 1,200 | NULL |
| 2026-02-01 | 1,560 | 1,200 |
| 2026-03-01 | 1,820 | 1,560 |
| 2026-04-01 | 2,210 | 1,820 |
WITH monthly AS (
SELECT
DATE_TRUNC('month', sale_date)::date AS month,
SUM(units) AS total_units
FROM coupon_sales
GROUP BY 1
)
SELECT
month,
total_units,
LAG(total_units, 1) OVER (ORDER BY month) AS prev_month_units
FROM monthly
ORDER BY month;
MoM growth formula: (curr - prev) * 100.0 / prev
The MoM growth formula is (curr − prev) × 100 / prev—straightforward arithmetic, with the same float-division and divide-by-zero guards as percentage calculations. The invariant: 100.0 * (curr - prev) / NULLIF(prev, 0) produces a clean percentage with NULL for the first row and any zero-prior-month.
-
100.0— float literal forces float division. -
NULLIF(prev, 0)— defends against zero-prior-month divide-by-zero. -
ROUND(..., 2)— clamps decimal noise. -
COALESCE(... , 0)— only if the consumer prefers 0% over NULL for the first month; usually NULL is more honest. -
Worked example: February's MoM =
(1560 − 1200) * 100.0 / 1200 = 30.00%.
WITH monthly AS (
SELECT
DATE_TRUNC('month', sale_date)::date AS month,
SUM(units) AS total_units
FROM coupon_sales
GROUP BY 1
),
with_lag AS (
SELECT
month,
total_units,
LAG(total_units, 1) OVER (ORDER BY month) AS prev_units
FROM monthly
)
SELECT
month,
total_units,
prev_units,
ROUND(
100.0 * (total_units - prev_units) / NULLIF(prev_units, 0),
2
) AS mom_growth_pct
FROM with_lag
ORDER BY month;
Common beginner mistakes
- Using
LAGwithoutORDER BYin theOVERclause—engine raises an error or returns non-deterministic results. - Self-joining the table to itself by
month - 1—works but slow and hard to maintain compared toLAG. - Forgetting
NULLIF(prev, 0)on the divide—divide-by-zero crash on the first month or any zero-units month. - Forgetting
100.0for percentage—integer division returns 0 for fractional growth. - Using
LEAD(the opposite direction) when the prompt asks for previous-period growth—LEADlooks forward.
Practice window-function problems →
SQL interview question on month-over-month growth
Table coupon_sales(sale_id, sale_date, units, revenue). Return per-month total units sold, total revenue, and month-over-month growth percentage for both metrics. Round percentages to 2 decimals; sort by month ascending.
Solution using DATE_TRUNC + LAG over an ordered window
WITH monthly AS (
SELECT
DATE_TRUNC('month', sale_date)::date AS month,
SUM(units) AS total_units,
SUM(revenue) AS total_revenue
FROM coupon_sales
GROUP BY DATE_TRUNC('month', sale_date)
),
with_lag AS (
SELECT
month,
total_units,
total_revenue,
LAG(total_units, 1) OVER (ORDER BY month) AS prev_units,
LAG(total_revenue, 1) OVER (ORDER BY month) AS prev_revenue
FROM monthly
)
SELECT
month,
total_units,
total_revenue,
ROUND(
100.0 * (total_units - prev_units) / NULLIF(prev_units, 0),
2
) AS units_mom_pct,
ROUND(
100.0 * (total_revenue - prev_revenue) / NULLIF(prev_revenue, 0),
2
) AS revenue_mom_pct
FROM with_lag
ORDER BY month;
Step-by-step trace (input: 4 sales rows aggregating to 2 months):
| sale_id | sale_date | units | revenue |
|---|---|---|---|
| 1 | 2026-01-15 | 500 | 1000 |
| 2 | 2026-01-28 | 700 | 1400 |
| 3 | 2026-02-03 | 900 | 1800 |
| 4 | 2026-02-19 | 660 | 1320 |
-
monthlyCTE —DATE_TRUNC('month', sale_date)collapses Jan rows to2026-01-01, Feb rows to2026-02-01.SUM(units) GROUP BY monthproduces(2026-01-01, 1200, 2400)and(2026-02-01, 1560, 3120). -
with_lagCTE —LAG(total_units, 1) OVER (ORDER BY month)reads each row's previous-month value: row 1 hasprev_units = NULL(no prior); row 2 hasprev_units = 1200. Same forprev_revenue: NULL, then 2400. -
Outer
SELECTrow 1 (2026-01-01) —prev_units = NULL. The expression100.0 * (1200 - NULL) / NULLIF(NULL, 0)collapses toNULL. Both growth columns emitNULLfor the first month. -
Outer
SELECTrow 2 (2026-02-01) —100.0 * (1560 - 1200) / NULLIF(1200, 0) = 100.0 * 360 / 1200 = 30.00. Revenue:100.0 * (3120 - 2400) / 2400 = 30.00. Rounded to 2 decimals. - Final output, sorted ascending.
Output:
| month | total_units | total_revenue | units_mom_pct | revenue_mom_pct |
|---|---|---|---|---|
| 2026-01-01 | 1200 | 2400 | NULL | NULL |
| 2026-02-01 | 1560 | 3120 | 30.00 | 30.00 |
Why this works — concept by concept:
-
DATE_TRUNC('month', sale_date)— collapses every row of a calendar month onto the first-of-month timestamp, the bucketing key. -
SUM(units)/SUM(revenue)aggregate — one row per month afterGROUP BYon the truncated date. -
LAG(metric, 1) OVER (ORDER BY month)— pulls the previous month's value inline onto the current row; no self-join, no second pass over the data. -
MoM growth formula —
100.0 * (curr - prev) / NULLIF(prev, 0); the100.0float literal forces float division,NULLIFdefends against zero/NULL prior-month values. -
First-row NULL —
LAGreturnsNULLfor the first row of the partition; the formula propagatesNULLcleanly so January has no MoM number — honest, not zero. -
ROUND(..., 2)— display precision; removes trailing float noise like30.0000000001. -
Cost — one scan of
coupon_sales, one hash aggregate by month, one window pass →O(N + K log K)where K is months.
SQL
Topic — window functions
Window function problems
SQL
Topic — date functions
Date function problems
5. SQL Aggregation with String Manipulation for Volume Reports
Aggregation over parsed dimensions in SQL for data engineering
Real warehouse and operations data often arrives with composite keys packed into a single string column—SKU-WAREHOUSE-CATEGORY or 2026-04-29:WH02:HEAVY. The mental model: parse the composite string into structured columns with SPLIT_PART or SUBSTRING, then aggregate. The string parsing is a one-shot transform; the aggregation is the standard SUM / COUNT / AVG you already know.
Pro tip: Always pin the string-parsing into a CTE (or subquery) before aggregating. Doing both in the same
SELECTmakes the query unreadable, and a bad split-character mid-string breaks the entire downstream aggregate silently.
SPLIT_PART / SUBSTRING for parsing composite keys
SPLIT_PART(str, delim, n) returns the n-th field of str split by delim. For positional parsing of fixed-format strings, SPLIT_PART is faster and clearer than regex. The invariant: SPLIT_PART is 1-indexed—the first field is n = 1, not n = 0.
-
SPLIT_PART('SKU01-WH02-HEAVY', '-', 1)→'SKU01'. -
SPLIT_PART('SKU01-WH02-HEAVY', '-', 2)→'WH02'. -
SUBSTRING(s, start, len)— for fixed-width formats where the column always has the same offsets. -
POSITION('-' IN s)— locate a delimiter for variable-width parsing. - Worked example: parse a composite product code.
| raw_code | part 1 (sku) | part 2 (warehouse) | part 3 (category) |
|---|---|---|---|
SKU01-WH02-HEAVY |
SKU01 | WH02 | HEAVY |
SKU07-WH01-LIGHT |
SKU07 | WH01 | LIGHT |
SKU03-WH02-HEAVY |
SKU03 | WH02 | HEAVY |
SELECT
SPLIT_PART(raw_code, '-', 1) AS sku,
SPLIT_PART(raw_code, '-', 2) AS warehouse,
SPLIT_PART(raw_code, '-', 3) AS category
FROM warehouse_events;
SUM and COUNT over a parsed dimension
Once the string is parsed, aggregation is the standard GROUP BY you already know. The shape is CTE that parses → SELECT that aggregates. The invariant: the grain of the result is determined by the GROUP BY clause; the parsing CTE's job is only to expose the dimension columns.
-
GROUP BY warehouse, category— two-level aggregation. -
COUNT(*)— event count per group. -
SUM(units),AVG(units),MIN/MAX(units)— volume metrics. - Worked example: three events grouped by warehouse + category.
| warehouse | category | event_count | total_units |
|---|---|---|---|
| WH01 | LIGHT | 1 | 50 |
| WH02 | HEAVY | 2 | 220 |
WITH parsed AS (
SELECT
SPLIT_PART(raw_code, '-', 2) AS warehouse,
SPLIT_PART(raw_code, '-', 3) AS category,
units
FROM warehouse_events
)
SELECT
warehouse,
category,
COUNT(*) AS event_count,
SUM(units) AS total_units
FROM parsed
GROUP BY warehouse, category
ORDER BY warehouse, category;
Multi-metric reports with FILTER / CASE WHEN aggregates
Volume reports usually need many metrics in one query: total units, heavy-only units, light-only units, percentage of heavy events. Combine GROUP BY with conditional aggregation. The invariant: conditional aggregation lets you compute multiple per-group metrics in a single pass—each metric is one expression in the SELECT.
-
SUM(CASE WHEN category = 'HEAVY' THEN units ELSE 0 END)— portable form. -
SUM(units) FILTER (WHERE category = 'HEAVY')— PostgreSQL/standard SQL form (cleaner). -
COUNT(*) FILTER (WHERE category = 'HEAVY')— count of matching rows. -
100.0 * heavy_count / total_count— percentage with float-division andNULLIFguard. - Worked example: multi-metric per warehouse.
| warehouse | total_units | heavy_units | light_units | heavy_pct |
|---|---|---|---|---|
| WH01 | 50 | 0 | 50 | 0.00 |
| WH02 | 220 | 220 | 0 | 100.00 |
WITH parsed AS (
SELECT
SPLIT_PART(raw_code, '-', 2) AS warehouse,
SPLIT_PART(raw_code, '-', 3) AS category,
units
FROM warehouse_events
)
SELECT
warehouse,
SUM(units) AS total_units,
SUM(units) FILTER (WHERE category = 'HEAVY') AS heavy_units,
SUM(units) FILTER (WHERE category = 'LIGHT') AS light_units,
ROUND(
100.0 * COUNT(*) FILTER (WHERE category = 'HEAVY') / NULLIF(COUNT(*), 0),
2
) AS heavy_event_pct
FROM parsed
GROUP BY warehouse
ORDER BY warehouse;
Common beginner mistakes
- Treating
SPLIT_PARTas zero-indexed—off-by-one bug; it's 1-indexed. - Inlining
SPLIT_PARTinGROUP BYdirectly—works but unreadable; pin it in a CTE. - Using
LIKE '%HEAVY%'instead of exact match on the parsed dimension—matchesLIGHT-HEAVY-COMBOaccidentally. - Forgetting
NULLIF(denom, 0)on percentage divides—divide-by-zero on empty groups. - Aggregating before parsing—loses the dimension you wanted to group by.
SQL interview question on warehouse volume metrics
Table warehouse_events(event_id, raw_code TEXT, units INT, ts TIMESTAMP) where raw_code has the format SKU-WAREHOUSE-CATEGORY (e.g. SKU01-WH02-HEAVY). Return per-warehouse: total events, total units, heavy-units, light-units, and percentage of heavy events. Sort by total_units descending.
Solution using SPLIT_PART parsing and conditional aggregation
WITH parsed AS (
SELECT
SPLIT_PART(raw_code, '-', 2) AS warehouse,
SPLIT_PART(raw_code, '-', 3) AS category,
units
FROM warehouse_events
)
SELECT
warehouse,
COUNT(*) AS total_events,
SUM(units) AS total_units,
SUM(units) FILTER (WHERE category = 'HEAVY') AS heavy_units,
SUM(units) FILTER (WHERE category = 'LIGHT') AS light_units,
ROUND(
100.0 * COUNT(*) FILTER (WHERE category = 'HEAVY') / NULLIF(COUNT(*), 0),
2
) AS heavy_event_pct
FROM parsed
GROUP BY warehouse
ORDER BY total_units DESC, warehouse ASC;
Step-by-step trace (input: 3 events):
| event_id | raw_code | units |
|---|---|---|
| 1 | SKU01-WH02-HEAVY | 100 |
| 2 | SKU07-WH01-LIGHT | 50 |
| 3 | SKU03-WH02-HEAVY | 120 |
-
parsedCTE — for each row,SPLIT_PART(raw_code, '-', 2)producesWH02 / WH01 / WH02;SPLIT_PART(raw_code, '-', 3)producesHEAVY / LIGHT / HEAVY. Output: 3 rows of(warehouse, category, units). -
GROUP BY warehouse— collapses rows by warehouse: WH01 has 1 row; WH02 has 2 rows. -
WH02 aggregates —
COUNT(*) = 2,SUM(units) = 220,SUM(units) FILTER (WHERE category = 'HEAVY') = 220(both rows match),SUM(units) FILTER (WHERE category = 'LIGHT') = 0,100.0 * 2 / NULLIF(2, 0) = 100.00. -
WH01 aggregates —
COUNT(*) = 1,SUM(units) = 50, heavy = 0, light = 50, heavy_pct =100.0 * 0 / 1 = 0.00. -
ORDER BY total_units DESC, warehouse ASC— WH02 (220) precedes WH01 (50). - Final output — two warehouse rows.
Output:
| warehouse | total_events | total_units | heavy_units | light_units | heavy_event_pct |
|---|---|---|---|---|---|
| WH02 | 2 | 220 | 220 | 0 | 100.00 |
| WH01 | 1 | 50 | 0 | 50 | 0.00 |
Why this works — concept by concept:
-
SPLIT_PART(raw_code, '-', N)— 1-indexed positional parsing; turns one composite string into three structured columns (sku, warehouse, category). -
parsedCTE — pins the parsing in one place so the outer query is readable;GROUP BY warehouseoperates on a clean structured column. -
COUNT(*)— total event count per warehouse. -
SUM(units)— total units per warehouse, no filter. -
SUM(units) FILTER (WHERE category = 'HEAVY')— conditional sum: onlyHEAVYrows contribute. SQL-standardFILTERclause. -
COUNT(*) FILTER (WHERE category = 'HEAVY')— conditional count for the percentage numerator. -
100.0 * num / NULLIF(denom, 0)— float-division percentage with empty-group guard. -
Tie-breaking
ORDER BY total_units DESC, warehouse ASC— deterministic when two warehouses share the same total.
SQL
Topic — aggregation
Aggregation problems
COMPANY
Salesforce — aggregation
Salesforce-tagged aggregation
6. Python Hash Tables for Atomic Transaction State
Dict-as-state design for atomic transactions in Python for data engineering
Atomic transactions are the canonical state-management interview prompt for DE candidates. The mental model: a dict is a key-value state store with O(1) reads and writes; an "atomic transaction" wraps a sequence of mutations so they either all apply or none do. The implementation is plain Python—no databases, no locks—but the design discipline (snapshot, mutate, commit, rollback) is the data-engineering primitive Salesforce is grading.
Pro tip: The simplest atomic-transaction design uses
dict.copy()to snapshot before mutating; on commit, the working copy replaces the canonical store; on rollback, the working copy is discarded.copy()is shallow—if your values are themselves containers, prefercopy.deepcopyor design the values to be immutable.
Dict as a key-value store with O(1) reads / writes
A Python dict is the canonical hash-table KV store. Reads (d[k]) and writes (d[k] = v) are O(1) average, O(N) worst-case (rare). The invariant: dict operations are O(1) amortized; never reach for a list-of-tuples KV store unless the keys aren't hashable.
-
d[k] = v— set; O(1) average. -
d.get(k, default)— read with default; safer thand[k]for missing keys. -
del d[k]— delete; O(1) average. -
k in d— membership; O(1) average. - Worked example: simple state store.
| operation | dict state |
|---|---|
| start | {} |
d['acct_a'] = 100 |
{'acct_a': 100} |
d['acct_b'] = 50 |
{'acct_a': 100, 'acct_b': 50} |
d['acct_a'] = 90 |
{'acct_a': 90, 'acct_b': 50} |
state: dict[str, int] = {}
state['acct_a'] = 100
state['acct_b'] = 50
state['acct_a'] = 90
print(state) # {'acct_a': 90, 'acct_b': 50}
Atomic update pattern: snapshot → mutate → commit
The atomic-update pattern has three phases. Snapshot copies the current state into a working buffer. Mutate applies all changes to the buffer. Commit swaps the buffer back into place; all mutations land together. The invariant: no caller observes a partial mutation—either zero changes are visible or all of them are.
-
working = state.copy()— shallow snapshot. -
working[k] = new_value— apply each mutation to the copy, not the canonical store. -
state.clear(); state.update(working)— atomic commit (or assignself._state = working). -
raisemid-mutation — the canonical state is untouched; the buffer is garbage-collected. - Worked example: transfer $20 from account A to account B atomically.
| step | canonical state | working buffer |
|---|---|---|
| start | {'a': 100, 'b': 50} |
— |
| snapshot | {'a': 100, 'b': 50} |
{'a': 100, 'b': 50} |
| mutate (A−20) | {'a': 100, 'b': 50} |
{'a': 80, 'b': 50} |
| mutate (B+20) | {'a': 100, 'b': 50} |
{'a': 80, 'b': 70} |
| commit | {'a': 80, 'b': 70} |
discarded |
def transfer_atomic(state: dict[str, int], src: str, dst: str, amt: int) -> dict[str, int]:
working = state.copy()
working[src] -= amt
if working[src] < 0:
raise ValueError("insufficient funds")
working[dst] = working.get(dst, 0) + amt
return working
Rollback via dict.copy() or a backup buffer
The flip side of commit is rollback: if a mid-transaction check fails, discard the buffer and leave the canonical state alone. Two common designs: a per-transaction buffer (snapshot inside the operation; raise on failure; the buffer is GC'd) or a rollback log (record each mutation in a list; on failure, replay in reverse). The invariant: rollback never partially-mutates the canonical store.
- Per-transaction buffer — simplest; buffer is local; raise to abort.
- Rollback log — needed when mutations are large or the buffer doesn't fit in memory; record (key, old_value) pairs.
-
Context-manager pattern —
with txn(state):enters / exits cleanly; the__exit__decides commit vs rollback based on exception status. - Worked example: failing transfer on insufficient funds preserves the canonical state.
class AtomicDict:
def __init__(self, initial: dict | None = None) -> None:
self._state: dict = dict(initial or {})
def transaction(self, mutations: list[tuple[str, int]]) -> None:
working = self._state.copy()
try:
for key, delta in mutations:
working[key] = working.get(key, 0) + delta
if working[key] < 0:
raise ValueError(f"{key} would go negative")
except Exception:
return # rollback: discard `working`, leave self._state alone
self._state = working # commit: swap in working buffer
def get(self, key: str, default=None):
return self._state.get(key, default)
def snapshot(self) -> dict:
return self._state.copy()
Worked-example solution. Driving the class with a successful and a failing transaction; the canonical state is preserved across the failure:
ad = AtomicDict({'a': 100, 'b': 50})
ad.transaction([('a', -30), ('b', 30)]) # commits → {'a': 70, 'b': 80}
print(ad.snapshot()) # {'a': 70, 'b': 80}
ad.transaction([('a', -200), ('b', 200)]) # raises mid-loop → rollback
print(ad.snapshot()) # still {'a': 70, 'b': 80} — untouched
Common beginner mistakes
- Mutating the canonical state in-place and trying to "undo on failure"—rollback logic is fragile and easy to get wrong.
- Using
working = state(reference, not copy) and mutatingworking—mutates the canonical state in place, defeating the snapshot. - Using
state.copy()on a dict whose values are mutable (lists, nested dicts) — shallow copy shares the inner objects. - Returning the buffer instead of swapping it into the canonical reference—commit silently fails.
- Not handling the empty / missing-key case in
dict.get(k, 0)—KeyErrormid-transaction with no rollback.
See more hash-table design problems →
Python interview question on atomic transactions
Implement AtomicDict with set(key, value), get(key, default=None), and transaction(updates: list[tuple[str, int]]). The transaction method applies all (key, delta) pairs atomically—if any update would make a value negative, the entire transaction rolls back. After a failed transaction, the state must be unchanged.
Solution using a snapshot-and-commit dict
class AtomicDict:
def __init__(self, initial: dict[str, int] | None = None) -> None:
self._state: dict[str, int] = dict(initial or {})
def set(self, key: str, value: int) -> None:
self._state[key] = value
def get(self, key: str, default: int | None = None) -> int | None:
return self._state.get(key, default)
def transaction(self, updates: list[tuple[str, int]]) -> bool:
working = self._state.copy()
for key, delta in updates:
working[key] = working.get(key, 0) + delta
if working[key] < 0:
return False # rollback: discard `working`, leave self._state alone
self._state = working # commit: atomic swap
return True
Step-by-step trace (initial state {'a': 100, 'b': 50}; call transaction([('a', -30), ('b', 30)]) then a failing transaction([('a', -200), ('b', 200)])):
-
First transaction call —
working = {'a': 100, 'b': 50}.copy()→ independent dict. -
Apply
('a', -30)—working['a'] = 100 + (-30) = 70. Not negative → continue. -
Apply
('b', 30)—working['b'] = 50 + 30 = 80. Not negative → continue. -
Loop ends successfully —
self._state = workingswaps in{'a': 70, 'b': 80}. ReturnsTrue. Canonical state is now{'a': 70, 'b': 80}. -
Second transaction call —
working = {'a': 70, 'b': 80}.copy(). -
Apply
('a', -200)—working['a'] = 70 + (-200) = -130. Negative → returnFalseimmediately. -
Rollback — the function returns before reaching
self._state = working. The localworkingdict is garbage-collected.self._stateis still{'a': 70, 'b': 80}— completely untouched. -
Verify —
get('a') == 70,get('b') == 80. The failed second transaction left zero side effects.
Output — observable state after each call:
| call | return value | get('a') |
get('b') |
_state after |
|---|---|---|---|---|
transaction([('a', -30), ('b', 30)]) |
True |
70 | 80 |
{'a': 70, 'b': 80} (committed) |
transaction([('a', -200), ('b', 200)]) |
False |
70 | 80 |
{'a': 70, 'b': 80} (rolled back, unchanged) |
Why this works — concept by concept:
-
Snapshot —
working = self._state.copy()creates an independent shallow copy; mutations toworkingcannot leak intoself._state. -
Buffer mutation — the
forloop applies each(key, delta)toworking, never to the canonical store. -
Validation predicate —
if working[key] < 0rejects negative balances mid-loop; no further updates are applied. -
Rollback by early return —
return Falseexits before the commit line; the localworkingdict is garbage-collected, leavingself._stateuntouched. -
Atomic commit —
self._state = workingis a single reference reassignment — all mutations land together, no partial-update window observable to other callers. -
Cost — one
dict.copy()(O(N)where N is current keys) plusO(M)updates; the swap isO(1)reference reassignment.
PYTHON
Topic — hash table
Hash table problems
PYTHON
Topic — design
Design problems
7. Python Closures and Higher-Order Functions for Pipelines
Function composition for ETL pipelines in Python for data engineering
Function composition is the canonical functional-programming primitive for ETL pipelines: chain transformations so the output of one becomes the input of the next. The mental model: a closure is a function that captures values from its enclosing scope; composition is f(g(x)) packaged as a new function. Together they let you express pipeline = compose(load, transform, validate, persist) in one readable line.
Pro tip: Watch composition order.
compose(f, g)(x)is conventionallyf(g(x))—the rightmost runs first. Frameworks like Toolz and Funcy follow this convention; Spark / pandas method chaining flows the other direction (leftmost runs first). State the order out loud in interviews so the reviewer knows you noticed.
Closure basics: a function that captures its enclosing scope
A closure is a function that references variables from its enclosing scope after that scope has exited. The invariant: closures capture variables by reference, not by value—a late-binding subtlety that bites in lambda factories.
- Inner function referencing an outer-function variable.
- Captured by reference — if the outer variable is reassigned, all captured closures see the new value.
-
Common factory pattern:
def make_adder(n): return lambda x: x + n→ reusable per-nadders. -
Worked example:
add_5 = make_adder(5); add_5(10) == 15.
def make_adder(n: int):
def add(x: int) -> int:
return x + n
return add
add_5 = make_adder(5)
add_10 = make_adder(10)
print(add_5(3)) # 8
print(add_10(3)) # 13
A subtle gotcha: the late-binding capture means the loop-variable trap is real:
fns = [lambda x: x + i for i in range(3)]
print([f(0) for f in fns]) # [2, 2, 2] — all bound to final i
fns = [lambda x, i=i: x + i for i in range(3)]
print([f(0) for f in fns]) # [0, 1, 2] — default-arg fix
compose(f, g)(x) == f(g(x)) — the composition contract
The composition contract is precise: compose(f, g) returns a new function that, when called with x, returns f(g(x)). The invariant: compose is right-to-left; g runs first, then f consumes its output. Two-function composition is the building block for n-way pipelines.
-
def compose(f, g): return lambda x: f(g(x))— minimal definition. -
Type signature:
compose(B → C, A → B) → (A → C)— the inner function's output type must match the outer's input type. -
Edge case:
compose(f)on a single function should equalf(identity behavior). -
Worked example:
inc = lambda x: x + 1;double = lambda x: x * 2;inc_then_double = compose(double, inc);inc_then_double(3) == 8((3+1)*2).
| step | expression | value |
|---|---|---|
| 1 | g(3) = inc(3) |
4 |
| 2 | f(g(3)) = double(4) |
8 |
def compose(f, g):
return lambda x: f(g(x))
inc = lambda x: x + 1
double = lambda x: x * 2
inc_then_double = compose(double, inc)
print(inc_then_double(3)) # 8 — (3+1)*2
functools.reduce for n-way pipelines
For an arbitrary number of functions, functools.reduce collapses a list with a binary operation. The invariant: reduce(compose, [f1, f2, ..., fN]) is the right-to-left composition of all N functions. Reverse the list if your pipeline reads left-to-right ("load → clean → validate → save").
-
reduce(compose, fns)— right-to-left composition of N functions. -
reduce(lambda f, g: lambda x: g(f(x)), fns)— left-to-right composition. -
reduce(compose, fns, identity)— start with an identity function for empty pipelines. -
functools.partial(f, k=v)— bind a keyword arg, useful when one stage needs configuration. -
Worked example: four-stage ETL pipeline—
parse_csv → drop_nulls → normalize_email.
from functools import reduce
def compose(f, g):
return lambda x: f(g(x))
def pipeline(*fns):
"""Returns a function that runs left-to-right: pipeline(f, g, h)(x) == h(g(f(x)))."""
return reduce(lambda f, g: lambda x: g(f(x)), fns, lambda x: x)
# Stages
def parse_csv(raw: str) -> list[dict]:
return [dict(zip(("name", "email"), line.split(","))) for line in raw.splitlines() if line]
def drop_nulls(rows: list[dict]) -> list[dict]:
return [r for r in rows if r.get("email")]
def normalize_email(rows: list[dict]) -> list[dict]:
return [{**r, "email": r["email"].strip().lower()} for r in rows]
run = pipeline(parse_csv, drop_nulls, normalize_email)
output = run("Alice,Alice@Example.com\nBob,\nCarol,Carol@example.com")
# [{'name': 'Alice', 'email': 'alice@example.com'}, {'name': 'Carol', 'email': 'carol@example.com'}]
Common beginner mistakes
- Using late-binding capture in a
lambdaloop—all closures share the final loop value (the classic[lambda: i for i in range(3)]bug). - Confusing right-to-left and left-to-right composition direction—pipeline reads backward.
- Mutating the input data inside a stage—breaks the functional contract; later replays produce different output.
- Forgetting an identity seed for an empty
reduce—TypeError: reduce() of empty sequence. - Reaching for class-based pipelines (
PipelineStepsubclasses) when a list of functions is enough—over-engineering.
Python interview question on function composition
Implement pipeline(*fns) that returns a single callable running the input functions left-to-right on a value: pipeline(f, g, h)(x) == h(g(f(x))). An empty pipeline() should return the identity function. Use functools.reduce.
Solution using closures and functools.reduce
from functools import reduce
from typing import Callable, Any
def pipeline(*fns: Callable[[Any], Any]) -> Callable[[Any], Any]:
if not fns:
return lambda x: x
return reduce(lambda f, g: lambda x: g(f(x)), fns)
Step-by-step trace (input: pipeline(inc, double, square)(2) where inc = +1, double = *2, square = **2):
-
fns = (inc, double, square)— non-empty, so skip the empty-pipeline guard. -
reduceiteration 1 — combiner takesf = inc,g = double, returns closurec1 = lambda x: double(inc(x)). Now the accumulator isc1. -
reduceiteration 2 — combiner takesf = c1,g = square, returns closurec2 = lambda x: square(c1(x)). The accumulator is nowc2. Loop ends. -
pipeline(inc, double, square)returnsc2— a single callable that, when invoked, runs all three stages left-to-right. -
Call
c2(2)— evaluatessquare(c1(2)). -
Inner
c1(2)— evaluatesdouble(inc(2)) = double(3) = 6. -
Outer
square(6)—6 ** 2 = 36. -
Final result —
36. The composition order matched left-to-right:2 → inc → 3 → double → 6 → square → 36. -
Empty case —
pipeline()returnslambda x: x; callingpipeline()(42)returns42unchanged.
Output — value flow stage by stage:
| stage | input | function | output |
|---|---|---|---|
| 1 | 2 | inc |
3 |
| 2 | 3 | double |
6 |
| 3 | 6 | square |
36 |
| (empty) | 42 | identity | 42 |
Why this works — concept by concept:
-
functools.reduce— walks the function tuple from left to right, threading an accumulator through a binarycombinerlambda. -
Combiner closure —
lambda f, g: lambda x: g(f(x))returns a new function each iteration; the new function capturesf(the so-far composed pipeline) andg(the next stage). -
Left-to-right ordering — inside the inner closure,
f(x)runs first andgconsumes its output, so stages execute in the order they were passed in. -
Fresh closure bindings — every
reduceiteration produces a brand-new closure, so the classic loop-variable late-binding bug ([lambda: i for i in range(N)]) doesn't apply. -
Empty-pipeline guard —
if not fns: return lambda x: xshort-circuits to the identity function, sopipeline()(42) == 42. -
Cost —
O(N)function applications per call where N is the number of stages; build cost is alsoO(N)(one closure per stage).
PYTHON
Topic — closures
Closure problems
PYTHON
Topic — higher-order functions
Higher-order function problems
Tips to crack Salesforce data engineering interviews
These are habits that move the needle in real Salesforce loops—not a re-statement of the topics above.
SQL preparation
Salesforce tilts SQL-heavy. Drill these patterns until they're muscle memory:
-
Subquery + aggregation — Nth-largest values via scalar subquery in
WHERE. -
Cohort + division — retention KPIs via
MIN(login_date)cohort + conditional aggregate over the cohort size. -
Triple self-join + date arithmetic — consecutive-day activity (
t2.date = t1.date + 1). -
LAGoverDATE_TRUNCpartitions — month-over-month growth without a self-join. -
SPLIT_PART+ conditional aggregation — volume reports parsed out of composite-key strings. -
Percentage shape —
100.0 * num / NULLIF(denom, 0)shows up in every section; memorize it.
Topic-page drills: subquery, cohort analysis, self-join, window functions, date functions.
Python preparation
Salesforce's two Python problems are data-engineering primitives, not algorithm puzzles:
- Hash-table design — atomic transactions with snapshot / commit / rollback. Drill state-management patterns, not LeetCode trees.
-
Closures +
functools.reduce— ETL pipeline composition. Drill functional Python over class hierarchies. -
dict.copy()discipline — shallow vs. deep copy; when shared references bite. -
Late-binding capture — the
lambdaloop-variable trap and the default-arg fix.
Entry points: hash table, closures, Salesforce Python practice.
Window-function fluency
Salesforce's only Hard-tier curated problem is the MoM growth window-function query. Get fluent with the family:
-
LAG(metric, n) OVER (ORDER BY month)— previous-period lookup;n = 1for MoM,n = 12for YoY. -
LEAD(metric, n)— same idea, the other direction; useful for "next period" comparisons. -
Partition vs. global windows —
PARTITION BY product_idfor per-product MoM; no partition for global trend. -
MoM formula —
100.0 * (curr - prev) / NULLIF(prev, 0). -
ROW_NUMBER— top-K-per-group filtered via a CTE wrapper. -
SUM() OVER (ORDER BY ts ROWS BETWEEN ...)— running totals and rolling windows.
Window functions return the most ROI per hour for Salesforce-style interviews.
Enterprise SaaS-analytics framing
Salesforce's prompts are SaaS-analytics flavored: salaries (HR analytics), retention (product analytics), activity cohorts (engagement analytics), coupon revenue (revenue analytics), warehouse throughput (operations analytics), transaction state (data-platform internals), function pipelines (ETL platform). State the mapping out loud: "this is Nth-largest via subquery"; "this is cohort retention with the standard retained / cohort_size"; "this is MoM growth via LAG"; "this is dict-based transaction state."
Where to practice on PipeCode
| Skill lane | Practice path |
|---|---|
| Curated Salesforce practice set | /explore/practice/company/salesforce |
| Salesforce SQL practice | /explore/practice/company/salesforce/sql |
| Salesforce Python practice | /explore/practice/company/salesforce/python |
| Subquery | /explore/practice/topic/subquery |
| Cohort analysis | /explore/practice/topic/cohort-analysis |
| Self-join | /explore/practice/topic/self-join |
| Window functions | /explore/practice/topic/window-functions |
| Date functions | /explore/practice/topic/date-functions |
| Hash table | /explore/practice/topic/hash-table |
| Closures | /explore/practice/topic/closures |
| All practice topics | /explore/practice/topics |
| Interview courses | /explore/courses |
Communication under time pressure
Three things to say out loud before, during, and after each query — in this order:
-
Assumptions — before typing. "I'll assume
logins(user_id, login_date)may have duplicates"; "I'll assumecoupon_salesrevenue is in a single currency"; "I'll assume the warehouse-code format is fixed three fields." -
Invariants — after key code blocks. "
LAGreturns NULL on the first row by design"; "SPLIT_PARTis 1-indexed"; "DISTINCTprevents duplicate-streak inflation." -
Complexity — at the end. "
O(N)for the cohort scan +O(N + M)for the retention join"; "one window pass,O(N + K log K)."
Interviewers grade clear reasoning above silent-and-perfect.
Frequently Asked Questions
What is the Salesforce data engineering interview process like?
The Salesforce data engineering interview typically includes a phone screen (mostly SQL warm-up around aggregation and subqueries), one or two coding rounds focused on SQL window functions, cohort retention, and self-joins, plus a Python round on data-engineering primitives (state-tracking dictionaries and functional pipelines), a system-design conversation around Data Cloud / lakehouse architecture, and behavioral interviews. The curated 7-problem Salesforce practice set on PipeCode mirrors what you will see on the technical rounds.
What SQL topics does Salesforce test for data engineers?
Salesforce emphasizes subqueries for Nth-largest queries, cohort aggregation for retention KPIs, self-joins with date arithmetic for consecutive-day activity, window functions (especially LAG over DATE_TRUNC) for MoM growth, and aggregation with string manipulation for volume reports. Drill these on the subquery, cohort analysis, self-join, window functions, and aggregation topic pages.
How important is Python for a Salesforce data engineering interview?
Python is roughly 30% of the technical interview at Salesforce—lighter than SQL but not skippable. The patterns are DE primitives, not algorithm puzzles: hash-table design for atomic transaction state, closures + functools.reduce for pipeline composition. The hash table and closures topic pages plus Salesforce Python practice are the right drilling targets.
How hard are Salesforce data engineering interview questions?
Salesforce's curated set has 1 easy + 5 medium + 1 hard = a medium-heavy hub. The Hard-tier problem is window-function MoM growth—get fluent with LAG over DATE_TRUNC and you've handled the ceiling. Most candidates underestimate window functions; that's where Salesforce loops separate "good SQL" from "production SQL."
Are window functions and MoM growth common in Salesforce interviews?
Yes—LAG over DATE_TRUNC for MoM growth is the Hard-tier curated problem and the single highest-ROI prep target for Salesforce. Get comfortable computing (curr - prev) * 100.0 / NULLIF(prev, 0) percentages over both unpartitioned and partitioned windows. The same pattern applies to YoY (LAG(metric, 12)), week-over-week, and rolling averages—drill the family, not just the one query.
How many Salesforce practice problems should I solve before the interview?
Aim for 20–30 problems spanning all seven topic clusters above—not 100 of the same kind. Solve every problem in the Salesforce-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide. Window functions and cohort retention are the topics worth over-investing in for Salesforce specifically.
Start practicing Salesforce data engineering problems
Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Salesforce problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Salesforce asks—without the noise of generic algorithm prep that doesn't apply to this loop.
Pipecode.ai is Leetcode for Data Engineering.
This article was originally published by DEV Community and written by Gowtham Potureddi.
Read original article on DEV Community

