Journal Club
Welcome to our Journal Club! Here, we discuss papers that we find interesting and relevant to our research interests and use-cases. The papers are selected by the team and are presented by members each week! Additionally, a different team member records notes on what was discussed.
📊 Weekly Tracking
📶 Presentation Stats
| Team Member | Presentations | Notes | Sum |
|---|---|---|---|
| Hunter | 17 | 3 | 20 |
| Nikhil | 3 | 0 | 3 |
| Yosh | 5 | 4 | 9 |
| Olivia | 1 | 0 | 1 |
| Ben | 0 | 3 | 3 |
beyond-the-last-answer
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Paper: Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Presenter: Hunter
Note Taker:
Date: May 23, 2025
Discussion Notes
[Notes to be added]
byte-latent-transformer
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper: Byte Latent Transformer: Patches Scale Better Than Tokens
Presenter: Hunter
Note Taker: Yosh
Date: April 25, 2025
Overview
Replaces fixed-vocabulary tokenization with entropy-based grouping of bytes into variable-length patches. Patch boundaries are chosen by a small byte-level autoregressive model using either a global entropy constraint or an approximate monotonic constraint; an incremental patcher ensures that patching on streaming input matches patching on the full sequence. The model stack uses two small transformers (local encoder/decoder) plus a larger latent transformer operating over patches. Reported performance is comparable to Llama 3 on language benchmarks and stronger in settings like character permutation. A wrapper path is described: initialize from an existing model (e.g., Llama 3) and add a BLT wrapper.
Novelty
- Dynamic, vocabulary-free byte patching driven by predicted next-byte entropy (vs. static tokenizers, BPE, or space-based rules).
- Incremental patcher property for streaming/online encoding consistency.
- Architecture combining local transformers for patch formation with a latent transformer consuming patch sequences.
- Weight-initialization strategy that reuses pretrained weights and layers them under a BLT wrapper.
Learnings
- Tokenizers’ brittleness (noise sensitivity, orthographic gaps, multilingual inequity) is a motivating factor for patching.
- BLT matches or exceeds Llama 3 on several evaluations; particularly robust to character-level perturbations.
- Edge behavior observed: patch lengths can expand on repeated or similarly structured phrases (e.g., multiple-choice options).
- The wrapper approach suggests compatibility with existing model ecosystems while moving toward tokenizer-free inputs.
Notes for Application
- Patch-based byte encodings could be feasible for noisy or multilingual document text where tokenizers struggle.
- The incremental patcher may support streaming OCR ingestion while keeping encodings consistent across chunks.
- A wrapper initialized from existing Llama-based checkpoints offers a migration path for current document models; worth testing alongside standard tokenizers.
- Monitor repeated, formulaic sections in forms (e.g., option lists): similar structure may trigger longer patches; evaluate effects on latency and memory.
dapo
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper: DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Presenter: Yosh
Note Taker: Hunter
Date: April 4, 2025
Overview
System-scale RL recipe that modifies GRPO-style training with changes to clipping, sampling, token-level credit assignment, and length handling. Also removes the KL term and uses an answer-equality reward.
Novelty
- Clip-Higher: decouples upper and lower clip bounds; relaxes the upper bound to avoid restricting generations while keeping lower-bound clipping.
- Dynamic sampling: builds batches that are not all-correct or all-incorrect to keep advantages informative.
- Token-level policy-gradient loss: replaces “mean of means” with a single averaging step so token contributions are credited directly.
- Overlong reward shaping: masks truncated regions instead of penalizing them; optional soft ramp as outputs approach max length.
- Additional choices: KL term removed; reward defined by abstract equality of answers; revised gradient averaging.
Learnings
- Decoupled clipping provides control over how much updates constrain outputs.
- Ensuring batch outcome diversity avoids zero-advantage collapses.
- Token-level credit sharpens feedback to helpful and unhelpful tokens.
- Treating truncation separately from verbosity avoids punishing context-limit artifacts.
- KL-free training shifts reliance to the reward and clipping scheme.
Notes for Application
- Dynamic sampling could prevent degenerate batches in extraction RL (e.g., all fields correct or all incorrect).
- Overlong shaping can help when context limits cause truncation in long documents.
dft
On the Generalization of SFT
Paper: On the Generalization of SFT
Presenter: Hunter
Note Taker: Ben
Date: August 15, 2025
Overview
This paper analyzes why supervised fine-tuning (SFT) often underperforms reinforcement learning (RL) in terms of generalization. The authors show mathematically that SFT gradients can be seen as a special case of policy gradients, but with an implicit, ill-posed reward structure that creates instability and overfitting. They propose Dynamic Fine-Tuning (DFT), a simple reweighting of the SFT loss by token probability. DFT is presented as a one-line code change that stabilizes training and substantially improves generalization across tasks and models.
Novelty
Provides a formal equivalence between SFT and policy gradient methods, identifying an implicit inverse-probability reward that destabilizes SFT.
Introduces Dynamic Fine-Tuning (DFT), which rescales the objective with token probabilities to eliminate this instability.
The modified loss function is:
[ L_{\text{DFT}}(\theta) = \mathbb{E}{(x,y^\star)\sim D}; -\sum{t=1}^{|y^\star|} \text{sg}\big(\pi_\theta(y^\star_t \mid y^\star_{<t}, x)\big), \log \pi_\theta(y^\star_t \mid y^\star_{<t}, x) ]
where
sg(·)is the stop-gradient operator.Shows that DFT achieves stronger gains than both standard SFT and importance-weighted SFT (iw-SFT).
Extends evaluation beyond SFT settings into offline RL, where DFT outperforms both offline (RFT, DPO) and online methods (PPO, GRPO) despite its simplicity.
Observes that DFT changes token probability distributions in a polarized way—boosting some tokens while downweighting others—unlike SFT’s uniform probability increase.
Learnings
- The poor generalization of SFT comes from its implicit reward weighting, which overemphasizes low-probability expert tokens and increases gradient variance.
- DFT corrects this by giving uniform rewards to expert tokens, leading to more stable updates.
- Empirically, DFT consistently outperforms SFT across multiple benchmarks (Math500, Minerva, Olympiad Bench, AIME, AMC) and models (Qwen2.5, LLaMA, DeepSeek). Gains are often several times larger than those achieved by standard SFT.
- DFT converges faster, achieves better early-stage performance, and avoids the plateau behavior seen in SFT.
- In offline RL experiments, DFT not only surpasses SFT-based baselines but also outperforms online methods like PPO and GRPO, highlighting its efficiency.
- Token distribution analysis suggests that not all tokens should be fit equally; deprioritizing connective or low-value tokens may improve robustness.
- Ablation studies confirm that DFT’s advantage is not due to hyperparameter choices—improvements hold across learning rates and batch sizes.
Notes for Application
- DFT offers a practical improvement to SFT with minimal implementation cost—a single modification to the loss.
- Particularly useful in settings where RL is too resource-intensive or reward signals are unavailable.
- Highlights the importance of examining implicit reward structures in fine-tuning objectives, which may influence broader post-training strategies.
genrm
Generative Verifiers: Reward Modeling as Next-Token Prediction
Paper: Generative Verifiers: Reward Modeling as Next-Token Prediction
Presenter: Hunter
Note Taker: Ben
Date: August 8, 2025
Overview
Recasts verification/reward modeling as next-token prediction: given a problem and a candidate solution, the verifier answers “Is the answer correct (Yes/No)?” and uses the probability of Yes as the score. A CoT variant first generates a verification rationale, then emits Yes/No; multiple rationales can be sampled and averaged at inference. Reported gains over discriminative reward models and DPO-style verifiers on GSM8K and algorithmic tasks, with transfer to MATH.
Novelty
- Verification as token probability (direct and CoT modes):
( r_{\text{Direct}}(x,y) = p_\theta(\text{Yes}\mid x,y,I) )
( r_{\text{CoT}}(x,y) = p_\theta(\text{Yes}\mid x,y,I_{\text{CoT}},v_{\text{CoT}},I) ), with (v_{\text{CoT}}\sim p_\theta(\cdot\mid x,y,I_{\text{CoT}}))
Majority vote over (K) rationales: ( r_{\text{MajV@K}}=\frac{1}{K}\sum_{i=1}^K p_\theta(\text{Yes}\mid\cdot,v^{(i)}_{\text{CoT}}) ) - Unified objective for verification + solution generation:
( L_{\text{GenRM}} = L_{\text{SFT}}(D_{\text{verify}}) + \lambda,L_{\text{SFT}}(D_{\text{correct}}) ) - Synthetic verification rationales for CoT training (reference-guided variants and multiple rationales per solution explored).
Learnings
- Best-of-N re-ranking improves when scoring candidates with direct/CoT verifiers; CoT with multiple verification votes increases accuracy.
- Joint training on verification plus correct-solution SFT improves verification; overly large generation mix can reduce verifier quality.
- Performance scales with verifier size and with the number of verification votes (K).
- Quality and quantity of CoT rationales both matter; reference-guided rationales and multiple rationales per solution tend to help.
Notes for Application
- Generative verification appears compatible with loss-run extraction: a verifier can score candidate parses by emitting Yes/No and using (p(\text{Yes})) as the selection signal.
- The Yes-token probability can serve as a confidence score for downstream API calls where logprobs are not returned.
glm
GLM Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper: GLM-4.5V and GLM-4.1V-Thinking
Presenter: Yosh
Note Taker: Ben
Date: August 22, 2025
Overview
Presents a multimodal reinforcement learning approach aimed at broad coverage across coding, math, GUI agents, visual grounding, and OCR. Reports that GLM-4.1V-9B-Thinking outperforms larger models such as Qwen2.5-VL-72B. The training recipe combines curriculum-based RL with a vision–language stack updated for long sequences and varied image aspect ratios.
Novelty
- Curriculum RL: training data organized by difficulty, estimated via pretrained models’ pass@k; intent is to keep examples neither trivial nor infeasible as the model strengthens.
- Architecture:
- 3D convolutions to enable temporal downsampling.
- 2D-RoPE integrated into ViT self-attention.
- Aspect-ratio handling by normalizing images to the ViT’s absolute-position grid via bicubic interpolation.
- Data packing: concatenates variable-length samples to approach maximum context length.
- Training pipeline:
- Cold start with long chain-of-thought SFT before RL.
- RL phase includes RLVR and RLHF.
- KL term removed from the loss.
- Rewards provided either by extracting answers from structured responses or by using an LLM evaluator.
Learnings
- Curriculum sampling is used to address the observation that easy items late in training give little signal, while hard items too early impede learning.
- Chain-of-thought SFT is used for initialization, then RL refines behavior.
- Removing the KL term is part of the reported setup; alignment proceeds without an explicit reference-model regularizer.
- Vision-side changes (3D conv, 2D-RoPE, aspect-ratio normalization) are presented as enabling more stable multimodal processing.
- Pretraining covers captioning, interleaved image–text, OCR, visual grounding, and instruction-tuning sources; long-sequence packing is used to maximize utilization.
illusion
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Paper: The Illusion of Thinking Presenter: Hunter Note Taker: Ben Date: June 13, 2025
Overview
Uses controllable puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) to probe Large Reasoning Models (LRMs) versus standard LLMs under matched inference compute. Finds three regimes: (i) at low complexity, non-thinking models are more accurate and token-efficient; (ii) at medium complexity, LRMs gain an advantage; (iii) at high complexity, both collapse to near-zero accuracy. Near the collapse point, LRMs allocate fewer thinking tokens despite ample budget, suggesting an inference-time scaling limit. The setup enables analysis of final answers and internal traces, revealing complexity-dependent failure patterns.
Novelty
- Controllable complexity via puzzle families enables clean comparisons and avoids benchmark contamination; results reported across matched thinking/no-thinking model pairs (e.g., Claude 3.7 Sonnet, DeepSeek R1/V3).
- Three-regime characterization of performance with complexity (low → non-thinking better; medium → thinking advantage; high → collapse), with visualizations across multiple puzzles.
- Reasoning-effort dynamics: thinking tokens initially rise with complexity, then decline near the collapse threshold.
- Thought-position analysis: correctness vs. position within the reasoning trace changes with complexity—early correct then overthinking at low N; later-arriving correct solutions at medium N; fixation on early errors at high N.
- Algorithm-execution check: even when given explicit solution algorithms to execute, models still fail at similar complexity points.
Learnings
- Low/medium/high regimes: non-thinking > thinking at low complexity; thinking helps at medium; both collapse at high.
- Collapse behavior: accuracy falls to near zero beyond a model-specific threshold; simultaneously, thinking tokens decrease instead of scaling with difficulty.
- Overthinking vs. fixation: at low complexity, models often find a correct plan early but continue exploring incorrect alternatives; at medium, correct solutions appear later; at high, models fixate on early wrong paths and waste budget.
- Exact algorithm execution remains brittle: providing step-by-step algorithms does not prevent failure, indicating limits in following and verifying logical procedures.
- Failure location: first failure moves occur far earlier than solution length and vary non-monotonically with complexity; error-free spans differ widely across puzzles.
- Controls: context limits and sampling do not explain collapse.
Notes for Application
- For document-reasoning pipelines, consider controlled difficulty sweeps (synthetic tasks with adjustable compositional depth) and track both final accuracy and where failures occur within multi-step outputs.
- Monitor inference-time effort (e.g., reasoning tokens or steps) versus task difficulty; a decline in effort near hard cases may indicate impending failure modes rather than efficiency.
- When adding “thinking” to models, compare against non-thinking baselines under equal compute; improvements may concentrate only in a medium-difficulty band.
inftythink
InftyThink
Paper: InftyThink
Presenter: Yosh
Note Taker: Ben
Date: March 14, 2025
Overview
Addresses “overthinking” and circular reasoning by constraining early-stage reasoning and restructuring the generation process. Proposes multi-step, chunked reasoning: produce a short reasoning segment, summarize it, append the summary to the prompt, and iterate until an answer or an iteration cap is reached. This creates a sawtooth context pattern (vs. steadily growing contexts), reduces memory/compute pressure from long-context attention, and aims to avoid truncation issues.
Novelty
- Early-stage capacity restraint (“Occam’s Razor”) to counter the overthinking optimization problem.
- Iterative “think → summarize → continue” procedure with an iteration hyperparameter (n) as a ceiling (model may answer earlier).
- Incremental summaries are over reasoning steps (not document content); the source document remains present as the query at every step.
- SFT data construction: chunk r1 traces on paragraph/sentence boundaries, then summarize segments (e.g., with a larger model) to create supervision for the iterative procedure.
Learnings
- Reported AIME24 gains of ~5–10% after SFT; larger base models see smaller gains within that range.
- Shorter-context models benefit more, consistent with the generation algorithm’s memory profile.
- Even with a single iteration, accuracy at fixed completion length exceeds that of a standard reasoning model of the same length.
- Practical note: useful when required “thinking” exceeds available GPU memory.
Notes for Application
- Multi-step chunked reasoning may be feasible for long or multi-extraction document tasks: set an iteration cap (n), keep the full document in context, and summarize only the reasoning.
- Training data can mirror the paper’s approach: split existing CoT traces into segments and attach short summaries for SFT.
- Evaluate at equal completion lengths and on shorter-context deployments where the method is expected to help most.
ladder
LADDER
Paper: LADDER
Presenter: Hunter
Note Taker: Yosh
Date: March 21, 2025
Discussion Notes
Recursive generation and solving of progressively simpler variants of original problem drives improved problem-solving capabilities
First takeaway: for RL to be effective, there must be a gradient of difficulty in dataset samples, otherwise the model will catastrophically collapse (especially when gap between simple and hard samples is large)
- We can do this manually, e.g. chunking that’s already implemented in Form reader project
LADDER consists of 3 components:
- Variant Generation
- Solution Verification
- Reinforcement Learning
Used base model to generate mathematical transformations of integrals categorized by impact on problem difficulty. Then, for each integral, randomly sample 3-5 transformations and provide them as explicit suggestions to variant generation model.
- Also utilized temperature cycling and persona-based prompting (e.g. “think like Euler or Gauss”) which improved performance
Quality control is essential in variant generation process, as small perturbations could lead to potentially intractable integrals, or variants could actually be more difficult
Test-Time Reinforcement Learning - Perform the LADDER at test time, retaining the post-training model but disposing of the test-time tuned weights
Train on 10 problems and test on 100, generating 500 variants/problem
Using a best of N approach with N=1,10 yielded 1%,2% accuracy on test set respectively. RL w/o variants yielded 3% and LADDER RL w/ variants reached 82% performance
LADDER+TTRL attained 90% on MIT Integration Bee
leaderboard-illusion
The Leaderboard Illusion
Paper: The Leaderboard Illusion Presenter: Hunter Note Taker: Ben Date: May 30, 2025
Overview
Study of Chatbot Arena’s evaluation mechanics and policies. Finds that private variant testing by companies like Meta with selective disclosure, unequal sampling, and uneven deprecations create data-access asymmetries and bias Arena rankings. Shows that training on Arena-style data substantially improves performance on an Arena-derived test set while not improving (and sometimes reducing) performance on out-of-distribution benchmarks.
Novelty
- Documents an undisclosed practice where select providers privately test many model variants and publish only the best-scoring one; argues this violates unbiased sampling assumptions behind Bradley–Terry scoring.
- Quantifies data-access disparities: proprietary models receive a larger share of Arena traffic and feedback; estimates include ~20.4% of all data for OpenAI and ~19.2% for Google, while open-weight/open-source models collectively receive far less.
- Shows that uneven model deprecations and sampling can fragment the comparison graph and undermine Bradley–Terry reliability.
- Provides controlled simulations and real-world ablations (including identical checkpoints) showing best-of-N private testing can inflate public Arena scores.
Learnings
- Private best-of-N testing plus the option to retract or hide scores can raise a provider’s public Arena rating above its average variant quality; even identical checkpoints can land at meaningfully different scores due to sampling variance.
- Sampling and deprecation policies create persistent exposure gaps; over time, proprietary providers accumulate substantially more prompts/battles than open-weight/open-source entrants.
- Fine-tuning with increasing proportions of Arena-style data yields large relative win-rate gains on an Arena-derived set (e.g., moving from 0% to 70% Arena-mix roughly doubles win-rate), but generalization to external benchmarks (e.g., MMLU) does not improve in tandem.
- Leaderboard ranks can become unstable or unreliable when task distributions shift over time and previously compared models are removed from evaluation.
Notes for Application
- For internal evaluations, avoid best-of-N selection bias: publish scores for all tested variants or fix a pre-registered selection procedure.
- Track and cap model exposure in live A/Bs; ensure balanced sampling across providers/variants to preserve comparability.
- Maintain a connected and stable comparison graph over time; when deprecating models, do so proportionally across categories and preserve overlap for transitivity.
- When using live feedback data to fine-tune, validate gains on independent, out-of-distribution test suites to detect leaderboard-specific overfitting.
modernbert
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Presenter: Hunter
Note Taker:
Date: January 3, 2025
Discussion Notes
[Notes to be added]
paperbench
PaperBench: Evaluating AI’s Ability to Replicate AI Research
Paper: PaperBench: Evaluating AI’s Ability to Replicate AI Research
Presenter: Hunter
Note Taker:
Date: April 11, 2025
Discussion Notes
[Notes to be added]
reflect-retry-reward
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Paper: Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Presenter: Hunter
Note Taker: Ben
Date: June 20, 2025
Overview
Two-stage procedure: on failure, the model writes a brief self-reflection; the self-reflection is then added to context for a second attempt. If the second attempt succeeds, only the self-reflection tokens receive reward via GRPO. Requires a binary validator (success/fail). Reported gains include up to +34.7% on equation writing and +18.1% on function calling, with trained 1.5–7B models surpassing much larger baselines in some cases.
Novelty
- Reward assigned only to reflection tokens when a retry fixes a failure (masked GRPO over multi-step generations).
- Multi-step GRPO implementation: first attempt → reflection → second attempt, with masking to isolate reflection tokens.
- “Dataset of failures” constructed by sampling multiple completions and keeping only verified failures for efficiency and clearer learning signals.
- Task-agnostic setup: relies on validators (function-call exact match; equation evaluates to target) rather than teacher models or synthetic data.
- Empirical comparisons include first/second-try performance pre/post training and an analysis of catastrophic forgetting (minimal changes on MMLU-Pro, GSM8K, HellaSwag, MATH).
Learnings
- Self-reflection becomes shorter and clearer after training; models improve even on first-try accuracy (without generating explicit reflections).
- On APIGen and Countdown, trained small models can outperform untrained much larger ones under two-attempt evaluation.
- Average improvements after training are substantial on both tasks; error profiles shift (e.g., fewer tool-choice errors in function calling; better number usage in equations).
- Minimal catastrophic forgetting relative to vanilla checkpoints under standard eval suites.
Notes for Application
- Feasible where binary validators exist, like loss runs verifier.
- Reflection-only reward offers a way to improve retry behavior without altering task-specific outputs directly.
- A “failures-only” training set could reduce compute and target the most informative examples in extraction or function-calling pipelines.
sft-memorizes-rl-generalizes
SFT Memorizes, RL Generalizes
Paper: SFT Memorizes, RL Generalizes
Presenter: Hunter
Note Taker: Yosh
Date: February 28, 2025
Discussion Notes
- Comparative study of SFT vs RL performance on a toy arithmetic reasoning task and a visual navigation task (textual and visual modalities).
- SFT quickly achieves good performance in-distribution due to memorization but generalizes poorly out-of-distribution. RL, on the other hand, generalizes well out-of-distribution while following a smoother learning curve in-distribution, indicating true progressive learning.
- Out of distribution performance of RL is very applicable to our use cases
- Demand forms classification and/or extractions, and loss runs extractions
- Form extraction model being trained on a subset of form types and then being tested on unseen form types (ACORDs).
- RL is more sample efficient than SFT, which is compatible with some of our lower sample count real datasets.
- We can use RL to develop other in-house models to complement our production models
- Verifier model to confirm annotations/catch potential errors
simplerlzoo
SimpleRL-Zoo
Paper: SimpleRL-Zoo
Presenter: Hunter
Note Taker: Ben
Date: March 28, 2025
Overview
The paper explores r1-style RL training and how base-model choice, reward design, and training curriculum affect downstream performance. It also examines the impact of supervised fine-tuning (SFT) versus RL-only training and proposes evaluation metrics that better capture reasoning behaviors.
Novelty
- Reports that pre-RL SFT limits exploration and caps performance.
- Emphasizes progressive difficulty scaling of training data rather than rigid supervision.
- Uses metrics beyond accuracy and response length:
- reasoning behavior ratio (backtracking, verification, subgoal setting, enumeration)
- clip ratio and average stopped length
- pass@k accuracy as an indicator of exploration
- Notes that strict formatting rewards and complex prompts can suppress exploration and induce “overthinking.”
- Shows strong effects from exploration-related hyperparameters (temperature, num_generations).
Learnings
- Rigid formatting disincentivizes exploration; flexibility supports richer reasoning.
- Response length is not a proxy for reasoning quality.
- Base-model choice matters; some models (e.g., Qwen) already exhibit strong math reasoning, reducing RL gains.
- SFT before RL can reduce accuracy and reasoning sophistication, leading to early plateaus.
- RL-only training adapts more freely and can produce the reported “aha” behaviors.
- If SFT is used, it should be framed to encourage exploration, not constrain it.
Notes for Application
- Reward design: simple correctness-based rewards are viable; extensions could be explored.
- Training strategy: construct datasets where difficulty ramps with model capability.
- Hyperparameters: increasing num_generations and using higher temperatures (>1) promotes exploration.
singlora
SingLoRA: Low Rank Adaptation Using a Single Matrix
Paper: SingLoRA
Presenter: Hunter
Note Taker: Ben
Date: July 11, 2025
Overview
Reformulates LoRA by replacing the two-matrix update (W_0 + BA) with a single low-rank, symmetric update (W_0 + \tfrac{\alpha}{r},u(t),AA^\top). The goal is to remove inter-matrix scale conflicts, improve training stability in the infinite-width analysis, and cut adapter parameters by about half. The work extends to non-square weight matrices, argues that attention expressiveness is preserved despite symmetric updates, and reports gains on GLUE (RoBERTa, GPT-2) and MNLI with LLaMA-7B, plus diffusion fine-tuning on DreamBooth (Tables 1–3; Figure 2). :contentReference[oaicite:0]{index=0}
Novelty
- Single-matrix low-rank adapter with ramp-up scalar (u(t)=\min(t/T,1)); initialization keeps the base model unchanged at step 0.
- Stability analysis: shows stable feature learning with a learning-rate scale of (O(n^{-1/2})) in a toy model; avoids the scale-mismatch seen in LoRA’s (A,B) updates.
- Transformation-invariance for the parameterization is proved under standard first-order optimizers.
- Extension to rectangular layers via truncated (A^*) so (W_0 + A^*A^\top) applies to common transformer blocks.
- Attention expressiveness: although updates are symmetric per matrix, (QK^\top) interactions are not constrained to be symmetric, so general patterns are learnable (Figure 1).
Learnings
- Reported GLUE results: mean accuracy improves over LoRA and matches/exceeds LoRA+ and DoRA with roughly half the adapter parameters (Table 1).
- LLaMA-7B on MNLI: 91.3% vs. LoRA 89.1, LoRA+ 90.2, DoRA 90.6, with ~40% fewer trainable params (Table 2).
- Learning-rate robustness: accuracy varies ~1% across LR sweeps vs. ~4.8% for LoRA (Figure 2).
- DreamBooth: higher DINO similarity at the same or lower parameter budgets; qualitative examples retain subject details (Table 3; Figure 3).
- Practical read: fewer hyperparameter sensitivities (no split LR for (A,B)), reduced parameter budgets, and stable convergence with standard optimizers.
Notes for Application
- For document-processing finetunes, a single-matrix adapter could reduce trainable parameters and lower sensitivity to learning-rate choice while retaining model capacity in attention blocks.
- When PEFT stability is a concern (e.g., smaller batches or limited LR sweeps), the symmetric (AA^\top) update may help avoid scale-mismatch issues seen in two-matrix LoRA.
- If adapters are deployed across multiple transformer layers, the rectangular extension suggests the approach is compatible with common projection shapes in extraction models.
tina
Tina: Tiny Reasoning Models via LoRA
Paper: Tina: Tiny Reasoning Models via LoRA
Presenter: Hunter
Note Taker:
Date: May 2, 2025
Discussion Notes
[Notes to be added]
unintentional-unalignment
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Paper: Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Presenter: Nikhil
Note Taker: Yosh
Date: April 18, 2025
Discussion Notes
- During training it can be observed that models unintentionally align itself toward undesirable responses when using DPO
- During DPO, the algorithm attempts to increase the gap between desired and undesired responses, which was ostensibly believed to occur by increasing and decreasing their respective probabilities
- It was found that the probabilities actually both decrease; however, one just decreases more sharply than another, still expanding the delta
- This can yield ‘catastrophic responses’ where the model must then augment the probability of other potentially dispreferred tokens
- This phenomena emerges in very simple settings
- At large scales this can cause the entire model to become unintentionally aligned
visual-rft
Visual-RFT
Paper: Visual-RFT
Presenter: Yosh
Note Taker: Hunter
Date: March 7, 2025
Discussion Notes
- DeepSeek GRPO style RL applied to VLMs (open vocab classification, object detection).
- Long tail / rare object detections are learned more easily with RFT. This should apply to our use cases with generalization to new forms (ACORDs and more generally). Also, this approach would help with our checkbox detection use cases, even when checkboxes are funky looking.
- They have the model produce confidence scores as numbers in generations and
then parse those numbers, incorporating it in their reward. A cool idea to
try, beyond just visual applications.
- We talked about having a smaller discretization than in this paper. Seems to be a bit extreme to do 100 bins. However, the idea makes practical and intuitive sense. You can’t do this with SFT, but you can with RFT.
- No need for differentiable rewards, so can easily incorporate confidence. In fact, one can have an “episode” where many predictions are done and ECE is minimized. Would require a batch based reward possibly? But GRPO is already configured in groups, so may be straightforward.
- Seems a straightforward way to get better document-level and table-level confidence scores. For tables, we could also analyze things row-wise and column-wise.
- Stark differences in performance between RFT and SFT. This is a good sign for us. The 2B parameter model makes massive gains with RFT. Smaller models can become quite capable with RFT. When scaling up model size, the gap between SFT and RFT shrinks but is still significant.
xrag
xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token
Paper: xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token
Presenter: Olivia
Note Taker: Hunter
Date: May 16, 2025
Overview
Compresses each retrieved document into a single token and conditions the LLM on that token instead of the full text. Training is staged: (1) paraphrase-style pretraining to associate an embedding with its source document, then (2) context-aware instruction tuning using only the embedding, including a self-distillation step against a standard RAG setup. Reported results are near or above regular RAG on several QA datasets; authors argue full-text RAG can be distracted by extra snippet text.
Novelty
- One-token document representation injected into the LLM in place of full passages.
- Paraphrase pretraining to align vectors with documents.
- Context-only instruction tuning plus self-distillation from a full-text RAG teacher.
Learnings
- Can roughly match or surpass standard RAG on key QA benchmarks despite using only a single token per document.
- Claimed robustness when retrieved text contains distracting or irrelevant content.
- Open questions remain for multi-hop reasoning and detail-critical, novel documents.
Notes for Application
- Could be relevant for very long documents or table-heavy pages that often exceed context budgets.
- Likely unnecessary for small documents and short outputs.
- Worth comparing against other RAG-compression approaches if fine-grained details must be preserved.