LLM systems | evaluation | post-training | retrieval

Yin Li

I build reproducible infrastructure for model behavior: benchmark runners, verifier-guided inference, retrieval diagnostics, traceable agent workflows, and upstream model-system fixes.

01 Post-training pipelines

SFT/DPO-style flows, verifier checks, and executable benchmark exports.

02 Retrieval evaluation

Query planning, source trust, citation grounding, and regression tests.

03 Agent observability

Tool traces, validators, retry paths, reward labels, and escalation boundaries.

04 Efficient model systems

Single-GPU training, quantization, runtime/cache correctness, and reproducible runs.

Selected systems

Repositories organized around evidence, not screenshots.

The strongest projects expose the full path from setup to artifact: commands, benchmark inputs, result files, known boundaries, and the exact claim each run supports.

Upstream OSS | compiler/runtime

Merged PR

Triton PR #10411

Merged upstream runtime cache-group integrity fix that treats incomplete cache groups as misses, with unit coverage and full Triton integration CI across NVIDIA, AMD, and macOS runners.

Proof
Merged into triton-lang/triton, commit 83adb20
Axis
Runtime cache correctness and regression coverage

Post-training | verifier-guided inference

Repository

L20-CodeForge

Single-L20 post-training and verifier-guided inference stack for executable code benchmarks, with reproduction scripts, artifact hashes, and explicit claim boundaries.

Proof
Reproduction scripts and benchmark artifacts
Axis
Code generation, RLVR, executable validation

Fine-tuning | evaluation

Repository

nl2sql-benchmark

Text-to-SQL fine-tuning and multi-path inference with Qwen2.5-Coder-7B.

Proof
Spider/BIRD-style evaluation, cost curves, export paths

Reranking | FinanceMTEB

Repository

finmteb-zh-reranker-sota

Chinese finance reranking snapshot with Qwen3-Reranker-8B and public comparison context.

Proof
Reported MAP run, CI checks, leaderboard snapshot

Retrieval | agent systems

Repository

signal-rag

Search and retrieval workbench with query planning, source-trust tiers, and citation checks.

Proof
Recall evaluation, extractive fallback, benchmark examples

AI4S infra | scientific agents

Repository

scitrace-rl

Trace, validation, and reward infrastructure for scientific agents.

Proof
Adversarial cases, semantic judge, deterministic validators

Retrieval benchmark | provenance

Repository

coreb-retrieval-sota

CoREB retrieval benchmark snapshot with CI-backed artifacts and result provenance.

Proof
Versioned result files and reproducible evaluation path

Evidence model

A good repository should behave like a lab instrument.

The page is optimized for reviewers who scan quickly: what problem is solved, what can be reproduced, what was measured, and where the claim stops.

Input contract

Datasets, prompts, traces, and configs are named.

No invisible benchmark setup. Inputs should be inspectable before running anything.

Execution contract

Commands produce stable artifacts.

Scripts, Make targets, and CI paths make evaluation repeatable after code changes.

Claim contract

Metrics include provenance and limitations.

Reported numbers should point to hardware notes, data versions, and known failure modes.

Agent contract

Actions leave traces that can be audited.

Tool calls, citations, validators, retries, and escalation paths are stored as system data.

Stack map

Pragmatic tools for measurable model systems.

Model systems

PyTorch, Transformers, LoRA/QLoRA, vLLM, lm-eval, Triton runtime, C++/CUDA

Evaluation

Golden sets, regression tests, semantic judges, citation checks, benchmark reports

Applications

RAG, text-to-SQL, tool use, structured generation, scientific workflows

Infrastructure

FastAPI, TypeScript, Docker, GitHub Actions, SQLite, PostGIS, Redis, Kafka

Repository matrix

Work surfaces grouped by capability.

Contact

Open to research engineering, LLM systems, evaluation infrastructure, and AI4S opportunities.