Using LLM's as critics for subjective tests against copy

Looking for advice here. I am in the middle of testing a build that creates, plans, & drafts social media posts. Where I am having trouble is with the critic agent. It's sole job is to evaluate the drafts against the standards and hard_fail, soft_fail, pass the posts (looping them back to the drafter 2x for failures). Everything that is deterministic has been split out and is handled by coding only, the judgement calls that are left to llm are whats killing me. I feel like I'm chasing ghosts. When I edit either the prompt or language of the check descriptions to resolve a 50/50 a new check will start to flip 50/50.

Has anyone dealt with this? How were you able to move beyond a coin toss for results?

I have already moved some from soft_fail to warnings but if I downgrade everything to a warning whats the point of the critic? Truly bad output will pass through the pipeline as drafts with a list of warnings.

Advice would be greatly appreciated.

1 comment