Refining Factual Verification with via Curriculum-Inspired and Reinforcement-Guided Optimization
Note: I kept this explanation high-level, intended to illustrate how the model can be improved in challenging scenarios. If you’re interested in the detailed implementation or research behind it, reach out to us at AutoAlign AI.
Factual verification plays a critical role in combating fake news circulating on the internet and mitigating hallucinations generated by language models. However, using proprietary large models for this task introduces privacy concerns and incurs significant operational costs. Running a 600B parameter model locally is impractical due to its resource intensity.
Small Language Models (SLMs), in contrast, are lightweight enough to run on local infrastructure. Yet, they typically lack the reasoning capabilities and contextual understanding required for complex or long-form inputs. To bridge this gap, we fine-tuned SLM models. Although smaller models struggle with long-context understanding, we tackled this limitation by first training the model on long-context datasets and then fine-tuning it specifically for factual verification tasks. This two-stage approach significantly improved the model’s performance—beating both open-source and closed-source LLMs across multiple benchmark datasets.
Despite the importance of fact-checking, high-quality datasets in this domain are limited. Existing datasets are often biased towards short to medium-length contexts, typically ranging from a few hundred to a thousand tokens. However, real-world scenarios—particularly in enterprise applications—require the model to handle significantly longer contexts (up to 32k tokens). Training on short-context data alone doesn’t prepare models to reason over lengthy documents, rendering them ineffective for production use.
Moreover, accurate reasoning over large and noisy document collections—such as enterprise knowledge bases or internet-scale corpora—is inherently difficult. The structure of these documents is often complex, and extracting relevant information for verification can be non-trivial.
Off-the-shelf SLMs struggle in such environments, frequently producing hallucinated answers or incorrect figures. Enterprises attempt to mitigate these failures through prompt engineering, but this is a reactive and unsustainable approach. Constantly tweaking prompts is labor-intensive, brittle, and fails to scale in production environments.
A robust solution requires both architectural and training improvements. Given the scarcity of fact-checking data, we adopted a curriculum learning strategy, where the model is progressively trained on tasks of increasing difficulty and complexity. We first trained the model on large-scale reasoning and comprehension datasets to strengthen its contextual understanding, then fine-tuned it on domain-specific factual verification data
A key challenge in combining curriculum learning with domain-specific fine-tuning is ensuring that the learning signals from both stages are aligned—misalignment can lead to forgetting or degraded performance. To address this, we structure training to first build general reasoning abilities through broad tasks, then gradually shift to targeted fact-checking. Throughout, we apply high-level strategies to ensure knowledge is transferred smoothly and reinforced across stages, enabling the model to improve without losing previously learned capabilities
To operationalize this strategy, we created a pipeline of three curated datasets:
A diverse collection of reasoning and comprehension tasks sorted by difficulty. This dataset helps the model acquire foundational skills for long-context understanding in a stable and generalizable way.
A supervised dataset containing fact-checking problems with reasoning traces.We used a mix of easy and moderately difficult examples here. Samples were manually inspected for quality, structure, and reasoning depth.
A reinforcement learning dataset composed exclusively of hard cases that require multi-step reasoning or retrieval from large documents. This stage pushes the model beyond typical LLM capabilities and drives generalization.
To ensure data quality:
Creating reasoning traces for difficult problems is especially challenging. In our evaluation, even closed models failed to generate correct traces in over 80% of hard examples. Therefore, we manually curated traces for easier and moderately difficult problems to support SFT. For RL training, we used only the most challenging examples, skipping the need for predefined reasoning traces and letting the model learn through exploration and reward optimization. This not only improves performance but also reduces the overhead of manual trace generation. 😉
The training pipeline follows a multi-stage approach to progressively align the model with task objectives and human preferences. The stages are:
The main bottleneck in our pipeline is the reinforcement learning (RL) stage, especially when using reasoning-heavy tasks and GRPO (Gradient-based Policy Optimization). On standard hardware like L4 GPUs, fine-tuning can take several hours to days. To accelerate this process, we introduced key optimizations in how GRPO training is handled.
Typically, RL training is split across different GPU roles:
However, this setup is inefficient:
To overcome these issues, we used different setup to maximize GPU utilization (inspired from ms-swift colocate):
By dynamically offloading models and sharing all GPUs across both phases, we reduce memory pressure, enable larger batch sizes, and significantly increase parallelism. This optimization cuts training time from 20 hours down to just 9 hours, making RL fine-tuning far more practical on limited hardware.
This work is inspired by Deepseek-R1 ,Lamma and S1. Its implementation is built upon veRL and ms-swift. We sincerely appreciate the efforts of these teams for their contributions to open-source research and development.
I’d also like to thank AutoAlign AI and the incredible team at AutoAlign AI for making this project possible.
For any further details or collaboration inquiries, feel free to reach out to AutoAlign AI .