← Back to Work

Support Ticket Triage Classifier

An ML triage system with calibrated risk scoring and confidence-gated routing — early applied ML work from first-year training.

PythonFlaskscikit-learnTF-IDFLogistic Regression

Overview

A customer-support ticket triage system built during early ML training. Given an incoming ticket, the model outputs a category prediction, a calibrated confidence score, and a risk classification — then a separate policy layer decides whether the prediction is acted on automatically or routed to a human agent.

Architecture

The pipeline has two stages:

  1. Classifier — TF-IDF features feed a Logistic Regression model, producing a label plus a calibrated probability. Logistic Regression was chosen specifically because its probability outputs are close to calibrated out of the box, which matters for the next stage.
  2. Policy layer — independent of the classifier. Hard-coded rules check the predicted category and confidence against thresholds. Certain categories (billing disputes, legal/account-termination language) are always routed to a human regardless of model confidence — these rules live in code, not in model weights, so they’re auditable and don’t drift with retraining.
{
  "ticket_id": "...",
  "primary_label": "billing_query",
  "confidence": 0.54,
  "risk_score": 0.71,
  "auto_routed": false,
  "reviewer_note": "Confidence below threshold (0.54 < 0.75)."
}

Dataset

No off-the-shelf labelled dataset matched the requirements, so the training set was built from public customer-support data using conservative heuristic labelling, semantic-voting across multiple signals, and synthetic stress-testing of ambiguous edge cases.

Result

The system intentionally automates a minority of tickets — the ones where confidence is high and the category carries low downstream risk — and routes everything else for human review with full decision context attached (label, confidence, risk score, and the reason for routing).

Notes

Calibration mattered more than raw accuracy here: a confidence threshold is only meaningful if “0.75” actually corresponds to ~75% empirical accuracy. This was verified with reliability diagrams rather than assumed. The same prediction/policy separation pattern — model produces a signal, independent rules decide what happens with it — turned out to be useful again later when building booking-conflict detection for the wellness booking platform.