RLHF (Reinforcement Learning From Human Feedback)
-
After the model is prompted with a CX or SDR scenario, we have it generate several output variations so humans can compare different reasoning paths, tones, and action steps.
-
Reviewers can rank responses according to correctness, alignment with domain-specific info, product/service knowledge, escalation rules, tone guidelines, and whether it avoids hallucinations or errors.
-
Human-ranked outputs can then be fed into a reward model that mathematically learns what “better” versus “worse” looks like in the domain-specific knowledge-set (ie, proper qualification, stopping "creative" pricing, erroneous troubleshooting).
-
The base model will then be optimized to produce responses that align with the reward model. This will improve accuracy, reduce hallucinations, and ensuring future outputs better match real-world CX/SDR needs.
