Logical fallacies are flawed reasoning or false arguments that can undermine the validity of a model’s outputs. Examples include circular reasoning, false dichotomies, ad hominem attacks, etc. Machine learning models are optimized to perform well on specific metrics like cosine similarity, safety, or helpfulness. However, optimizing for metrics alone does not guarantee logically sound reasoning. Language models can learn to exploit flaws in reasoning to generate plausible-sounding but logically invalid arguments. When models rely on fallacies, their outputs become unreliable and untrustworthy, even if they achieve high scores on metrics. Users cannot depend on such outputs. Propagating logical fallacies can spread misinformation, confuse users, and lead to harmful real-world consequences when models are deployed in products or services. Therefore, it is crucial that model developers proactively address logical fallacies after optimizing metrics. Eliminating fallacies ensures model outputs remain logically valid and aligned with human reasoning. This maintains user trust and mitigates risks.
Session Summary
How to Detect and Remove Logical Fallacies from LLM Output
MLconf Online 2023
Jonathan Bennion
ex-Disney ex-Facebook ex-Google, most recently @ Fox
AI Engineer, Consultant
Learn more »