We stopped leaving GitHub to debug test failures. Here's how.

CI is red. You open the PR. Now what?

You click the failing workflow. You read the logs. You open the trace viewer in a separate tab. You cross-reference the error with the code. You search Slack to see if this happened before. You go back to GitHub to leave a comment.

Every time. For every failure.

The problem isn't that debugging is hard. The problem is that you keep leaving GitHub to do it — even though GitHub is where you make the merge decision.

We built QAI Agent to fix that. This post is about two things that changed how we work: asking QAI questions directly from a PR comment, and getting the exact code fix inline on the PR.

The problem with CI feedback

Your tests fail. The PR comment tells you:

8 failures
3 unique clusters
Risk: High

Useful. But it doesn't answer the question every developer actually asks: "Is this worse than last week?"

That question requires history. It requires context across runs. It requires something that remembers.

Ask QAI from the PR

Once QAI is connected to your repo, you can mention @qai-agent in any PR comment and ask anything about your test failures.

Real example — I typed this on a PR with 18 failing tests across 4 frameworks:

@qai-agent What are the key fixes that would resolve roughly 80% of the test failures across all suites?

The response came back instantly:

QAI Agent 🪄

Key Fixes for ~80% of Failures

Fix 1: Login flow missing load state wait (covers ~50% of failures)

All checkout and user-account beforeEach timeouts across all 4 suites trace to the same root cause: login completes but the page isn't fully loaded before navigation/assertions proceed.

Playwright (login() helper):

async function login(page: Page) {
  await page.goto('/login');
  await page.fill('[name="username"]', 'demo');
  await page.fill('[name="password"]', 'demo');
  await page.click('[type="submit"]');
  await page.waitForLoadState('networkidle'); // ADD THIS
}

Selenium Java (@BeforeEach setup):

new WebDriverWait(driver, Duration.ofSeconds(60))
  .until(ExpectedConditions.visibilityOfElementLocated(By.name("username")));
new WebDriverWait(driver, Duration.ofSeconds(60))
  .until(webDriver -> ((JavascriptExecutor) webDriver)
    .executeScript("return document.readyState").equals("complete"));

Selenium Python (setup_checkout fixture):

wait = WebDriverWait(driver, 60)  # increase from 30
wait.until(EC.visibility_of_element_located((By.NAME, "username")))

Fix 2: Inverted/stale assertions in search & cart tests (covers ~15% of failures)

const fullCount = await cards.count();
await page.fill('[data-testid="search"]', 'bamboo');
const filteredCount = await cards.count();
expect(filteredCount).toBeGreaterThan(0);
expect(fullCount).toBeGreaterThan(filteredCount); // was backwards

Fix 3: Empty cart text locator mismatch (covers ~2% of failures)

// Change from:
await $('//*[contains(text(),"Your cart is empty")]');
// To:
await $('//*[contains(text(),"Your cart is empty.")]');

Fix	Tests resolved	Suites
Login load state + timeout	~13 tests	All 4
Inverted/stale assertions	~4 tests	Playwright, WebdriverIO
Cart empty text locator	1 test	WebdriverIO

Three fixes. One question. ~18 tests resolved.

It didn't just list what broke — it told us which fix to do first, how many tests each one resolves, and gave the exact code per framework. If you're running Playwright, Selenium Java, Selenium Python, and WebdriverIO on the same PR, the fix isn't the same in each one. QAI knows which frameworks are affected and generates the right code for each.

Ask about history — the question no test runner can answer

The most powerful use case isn't "why did this fail." It's "is this getting worse?"

@qai-agent is there an increase in failures comparing to previous PRs?

Yes, there's a dramatic increase in failures compared to previous PRs.
Previous PRs: 7 occurrences of these patterns. This PR: 20 occurrences each — nearly 3x increase.

Root causes:
→ UI_CHANGED failures (4 tests) — 85% confidence
→ TIMING_FLAKE failures (4 tests) — 70% confidence
→ 0% flaky score — consistent, reproducible failures

Verdict: This PR introduced systematic failures. Block merge until UI locator issue and timing problems are resolved.

That's not a test runner. That's a senior QA engineer reviewing your PR.

A single failure is noise. A 3x increase in failures across PRs is a signal. QAI can answer that in seconds because it has the history. Your team doesn't.

Some other questions you can ask:

@qai-agent why is this test failing?
@qai-agent is this flaky or a real regression?
@qai-agent how long has this been broken?
@qai-agent what's the fastest fix for the cart failures?
@qai-agent is this the same failure we saw last week?

Each answer includes historical context, severity classification, confidence score, and a fix suggestion.

The code fix — already on the PR, without asking

The second feature shows up automatically. When QAI analyzes a PR, the comment includes an inline code fix for high-confidence failures. You don't need to ask. It's already there.

For a TEST BUG cluster at 70% confidence:

The test "search narrows results to matching products" has inverted logic on line 23. [View fix →]

test('search narrows results', async ({ page }) => {
  const cards = page.locator('a[href^="/products/"]');
  const initialCount = await cards.count();

  await page.getByPlaceholder(/search/i).fill('bamboo');
  await expect(page.getByText(/bamboo/i).first()).toBeHidden({ timeout: 10_000 });

  const filteredCount = await page.locator('a[href^="/products/"]').count();
  await page.getByPlaceholder(/search/i).fill('');
  expect(await page.locator('a[href^="/products/"]').count())
    .toBeGreaterThanOrEqual(initialCount);
});

Ready to copy and apply. No dashboard. No trace viewer. No tab switching.

The PR comment also breaks results down by suite:

Suite	✅ Pass	❌ Fail	Total	Pass rate
Selenium Python	10	4	14	71%
Selenium Java	9	4	13	69%
WebdriverIO	4	1	5	80%
Total	23	9	32	72%

And at the bottom of every comment:

💬 Ask QAI anything about this PR:
Comment @qai-agent <your question> — examples:
• @qai-agent why is this failing?
• @qai-agent is this flaky or a real regression?
• @qai-agent what's the fastest fix?

Why this matters

Most test tools are read-only. You look at them. They don't talk back.

Ask QAI flips this. Instead of navigating to a dashboard, opening a report, filtering by date, comparing runs manually — you just ask. In the same place you're already working. The context stays in the PR. The team sees the answer.

The PR is where you decide whether to merge. That's where the analysis should live.

Setup — two steps

Step 1 — Add the Action to your workflow:

- name: QAI Agent
  uses: useqai/qai-agent@v1
  if: always()
  with:
    junit-path: 'test-results/results.xml'
    qai-url: https://ingest.useqai.dev
    qai-api-key: ${{ secrets.QAI_API_KEY }}

Step 2 — Install the QAI GitHub App on your repo (required for @qai-agent replies)

Get your free API key at useqai.dev — 30 seconds, no credit card.

Try it before connecting anything

Zero setup: Paste your JUnit XML at useqai.dev/try — no account, no GitHub, no secrets. See exactly what QAI posts on a PR in 30 seconds.

Fork and see: Fork useqai/demo-shop — QAI is already wired up across 4 frameworks. Open a PR, comment @qai-agent, and see it respond.

🔧 GitHub Action: useqai/qai-agent
📦 Source: github.com/useqai/qai-agent
📊 Dashboard + Ask QAI: useqai.dev

If you try it and hit any edge cases — unusual JUnit variants, frameworks not listed — open an issue or drop a comment here.