This helps us build groups of examples that evaluate the same pieces of knowledge, allowing us to measure under what *contexts* an LLM can correctly draw a particular inference ("inferential consistency"). We find that LLMs still exhibit room for improvement on this front. (5/n)
April 29, 2025 at 8:41 PM
This helps us build groups of examples that evaluate the same pieces of knowledge, allowing us to measure under what *contexts* an LLM can correctly draw a particular inference ("inferential consistency"). We find that LLMs still exhibit room for improvement on this front. (5/n)
We propose a method to pinpoint the particular pieces of knowledge a defeasible reasoning example aims to evaluate by identifying the atom(s) that are most critical in determining the overall label of a defeasible NLI example. (4/n)
April 29, 2025 at 8:41 PM
We propose a method to pinpoint the particular pieces of knowledge a defeasible reasoning example aims to evaluate by identifying the atom(s) that are most critical in determining the overall label of a defeasible NLI example. (4/n)
We also explore how atomic hypothesis decomposition can help us better understand the complexities of defeasible reasoning, a softer inference task that requires models to weigh the effects of multiple, sometimes competing, pieces of evidence on a hypothesis. (3/n)
April 29, 2025 at 8:41 PM
We also explore how atomic hypothesis decomposition can help us better understand the complexities of defeasible reasoning, a softer inference task that requires models to weigh the effects of multiple, sometimes competing, pieces of evidence on a hypothesis. (3/n)
For example, after decomposing hypothesis from an NLI premise-hypothesis pair into atoms, we can measure whether its judgment on the overall pair is consistent with its set of judgments on each premise-atom sub-problem in a logical way. (2/n)
April 29, 2025 at 8:40 PM
For example, after decomposing hypothesis from an NLI premise-hypothesis pair into atoms, we can measure whether its judgment on the overall pair is consistent with its set of judgments on each premise-atom sub-problem in a logical way. (2/n)