LLM systems | evaluation | post-training | retrieval

Yin Li

I build reproducible infrastructure for model behavior: benchmark runners, verifier-guided inference, retrieval diagnostics, traceable agent workflows, and upstream model-system fixes.

GitHub Selected systems Evidence model

01 Post-training pipelines

SFT/DPO-style flows, verifier checks, and executable benchmark exports.

02 Retrieval evaluation

Query planning, source trust, citation grounding, and regression tests.

03 Agent observability

Tool traces, validators, retry paths, reward labels, and escalation boundaries.

04 Efficient model systems

Single-GPU training, quantization, runtime/cache correctness, and reproducible runs.

Selected systems

Repositories organized around evidence, not screenshots.

The strongest projects expose the full path from setup to artifact: commands, benchmark inputs, result files, known boundaries, and the exact claim each run supports.

Upstream OSS | compiler/runtime

Merged PRs

Upstream model-system fixes

Merged fixes across CARLA, FlashAttention, scikit-learn, Triton, Ray, PyTorch TorchTitan, Apache DataFusion, Apache TVM, Microsoft ONNX Runtime, and xgrammar, focused on runtime correctness, distributed autoscaling metrics, LoRA fine-tuning behavior, spill execution configuration, CuTe/SM120 compile handling, benchmarking reliability, Array API/DLPack interop safety, ONNX frontend behavior, CUDA/FMHA initialization, structured-generation parser edge cases, and high-signal upstream bug reports including AI tooling security/correctness boundaries.

scikit-learn: PR #34380 merged
Triton: PR #10411 and PR #10413 merged
TVM: PR #19818 merged
TorchTitan: PR #3456 merged
DataFusion: PR #23066 and PR #23226 merged
CARLA: PR #9791 merged into ue5-dev
FlashAttention: PR #2671 merged
Ray: PR #64184 merged
ONNX Runtime: PR #29140 merged
xgrammar: PR #667 merged
Gradio: Issue #13556 confirmed and fixed upstream via PR #13580
PyTorch: Issue #188023 triaged as a DLPack crash bug

Pretraining | single-GPU scaling

Repository

l20-edu-135m-pretrain

From-scratch 135M Transformer pretraining on 10B FineWeb-Edu tokens using one NVIDIA L20, with a public checkpoint, lm-eval comparisons, and reproducible training notes.

Proof: Public checkpoint, lm-eval runs, single-GPU training recipe
Axis: Pretraining, scaling discipline, reproducibility

Single-GPU LLM systems | kernels and serving

Repository

Single-GPU Inference Lab

Single-L20 reference stack with Triton RMSNorm, fused residual paths, RoPE + KV-cache write kernels, measured dispatch policies, and handwritten NF4 QLoRA smoke runs.

Proof: L20 benchmark reports, CUDA telemetry, Qwen2.5-Coder 0.5B and 14B smoke runs
Axis: GPU kernels, serving/training constraints, single-card systems evidence

Post-training | verifier-guided inference

Repository

L20-CodeForge

Single-L20 post-training and verifier-guided inference stack for executable code benchmarks, with reproduction scripts, artifact hashes, and explicit claim boundaries.

Proof: Reproduction scripts and benchmark artifacts
Axis: Code generation, RLVR, executable validation

Fine-tuning | evaluation

Repository

nl2sql-benchmark

Text-to-SQL fine-tuning and multi-path inference with Qwen2.5-Coder-7B.

Proof: Spider/BIRD-style evaluation, cost curves, export paths

Evaluation | supervised classifiers

Repository

goemotions-roberta-large-focal

RoBERTa-large emotion classifier with focal loss, tuned thresholds, and test macro-F1 0.5330.

Proof: Reported test macro-F1, threshold tuning, reproducible evaluation path

Reranking | FinanceMTEB

Repository

finmteb-zh-reranker-sota

Chinese finance reranking snapshot with Qwen3-Reranker-8B and public comparison context.

Proof: Reported MAP run, CI checks, leaderboard snapshot

Retrieval | agent systems

Repository

signal-rag

Search and retrieval workbench with query planning, source-trust tiers, and citation checks.

Proof: Recall evaluation, extractive fallback, benchmark examples

AI4S infra | scientific agents

Repository

scitrace-rl

Trace, validation, and reward infrastructure for scientific agents.

Proof: Adversarial cases, semantic judge, deterministic validators

Retrieval benchmark | provenance

Repository

coreb-retrieval-sota

CoREB retrieval benchmark snapshot with CI-backed artifacts and result provenance.

Proof: Versioned result files, reproducible evaluation path, and upstream submission issue

UK AI ecosystem

Open-source fixes for UK public-sector AI and safety tooling.

Current UK-focused work targets government AI infrastructure directly: AISI evaluation tooling and i.AI public-service AI systems. Open pull requests are listed as open until maintainer review or merge confirms the contribution.

AISI | evaluation infrastructure

Open PR #4371

UK AI Security Institute Inspect AI

Ready-for-review fix for `bash_session` output handling so unbounded command output is capped before JSON-RPC serialization, reducing sandbox failure risk in model evaluations.

Status: Open, ready for review, mergeable
Validation: Focused pytest, RPC integration test, ruff, and py_compile

i.AI | public-service AI

Open PR #87

i-dot-ai Lex

Ready-for-review fix for empty-text document uploads in the UK legal API stack, avoiding unnecessary embedding and vector-store writes when a batch has no uploadable content.

Status: Open, ready for review, mergeable
Validation: Focused pytest, ruff, py_compile, and passing security check

i.AI | public-sector productivity

Open PR #60

i-dot-ai Minute

Ready-for-review fix for Azure OpenAI structured output handling in public-sector meeting transcription/minuting infrastructure, replacing an undefined method call with the adapter's existing response handling path.

Status: Open, ready for review, mergeable
Validation: Focused pytest, ruff, and py_compile with a fake Azure client

Evidence model

A good repository should behave like a lab instrument.

The page is optimized for reviewers who scan quickly: what problem is solved, what can be reproduced, what was measured, and where the claim stops.

Input contract

Datasets, prompts, traces, and configs are named.

No invisible benchmark setup. Inputs should be inspectable before running anything.

Execution contract

Commands produce stable artifacts.

Scripts, Make targets, and CI paths make evaluation repeatable after code changes.

Claim contract

Metrics include provenance and limitations.

Reported numbers should point to hardware notes, data versions, and known failure modes.

Agent contract

Actions leave traces that can be audited.

Tool calls, citations, validators, retries, and escalation paths are stored as system data.

Stack map

Pragmatic tools for measurable model systems.

Model systems

PyTorch, Transformers, LoRA/QLoRA, vLLM, lm-eval, Triton runtime, C++/CUDA

Evaluation

Golden sets, regression tests, semantic judges, citation checks, benchmark reports

Applications

RAG, text-to-SQL, tool use, structured generation, scientific workflows

Infrastructure

FastAPI, TypeScript, Docker, GitHub Actions, SQLite, PostGIS, Redis, Kafka

Repository matrix

Work surfaces grouped by capability.

Contact

Open to research engineering, LLM systems, evaluation infrastructure, and AI4S opportunities.

GitHub Triton PR #10411 ONNX Runtime PR #29140 TVM PR #19818 TorchTitan PR #3456 DataFusion PR #23066 DataFusion PR #23226 CARLA PR #9791 FlashAttention PR #2671 Ray PR #64184 xgrammar PR #667 Gradio issue #13556 PyTorch issue #188023 AISI Inspect AI PR #4371 i.AI Lex PR #87 i.AI Minute PR #60 L20-CodeForge nl2sql-benchmark

Yin Li

Repositories organized around evidence, not screenshots.

Upstream model-system fixes

l20-edu-135m-pretrain

Single-GPU Inference Lab

L20-CodeForge

nl2sql-benchmark

goemotions-roberta-large-focal

finmteb-zh-reranker-sota

signal-rag

scitrace-rl

coreb-retrieval-sota

Open-source fixes for UK public-sector AI and safety tooling.

UK AI Security Institute Inspect AI

i-dot-ai Lex

i-dot-ai Minute

A good repository should behave like a lab instrument.

Datasets, prompts, traces, and configs are named.

Commands produce stable artifacts.

Metrics include provenance and limitations.

Actions leave traces that can be audited.

Pragmatic tools for measurable model systems.

Work surfaces grouped by capability.

Training and inference

Retrieval and ranking

Agents and traces

Structured systems

Open to research engineering, LLM systems, evaluation infrastructure, and AI4S opportunities.