RAG / LLM App / Full-stack

2025

Role: RAG solution design, data-pipeline design, evaluation framing, hallucination analysis

MedRAFT Medical RAG QA System

A medical RAG system that connects knowledge-base ingestion, evidence retrieval, teacher-supervised data generation, RAFT-style training, and multi-dimensional evaluation for safer Chinese QA.

Impact: Turned a course project into a full product-style RAG case study by clarifying the evidence flow, answer structure, confidence design, and evaluation dimensions required in a high-risk domain.

MedRAFT workflow diagram covering knowledge ingestion, retrieval, teacher supervision, RAFT training, and evaluation.

MedRAFT evaluation design diagram showing evidence support, answer structure, and hallucination checks. — The evaluation layer separates retrieval quality, answer completeness, and hallucination risk instead of collapsing everything into one score.

Overview

MedRAFT is a Chinese medical QA project that I now present as a product-style RAG workflow: build the knowledge base, retrieve evidence, generate supervised answers, fine-tune the student model, and evaluate whether the final output is actually supported and safe enough to review.

Problem

Medical QA is a high-risk generation setting: a fluent answer may still be unsupported, incomplete, or misleading. Pure LLM generation is not enough; the system needs retrieval grounding, distractor-aware training, and evaluation that checks whether answers are actually supported by retrieved evidence.

Solution

I designed the project as an evidence-first workflow: retrieved passages come before answer generation, teacher outputs define the target structure, and evaluation explicitly checks support quality, answer quality, and hallucination patterns. This makes the system more defensible than a pure chat-style medical demo.

Architecture

The pipeline includes document ingestion, text normalization, chunking, embedding-based retrieval, teacher-model answer generation, distractor construction, prompt formatting, parameter-efficient fine-tuning, comparative inference, retrieval evaluation, and hallucination analysis.

Core Features

Chinese medical document ingestion and normalization
Vector retrieval for evidence-grounded medical QA
Teacher-supervised sample construction
Distractor-augmented RAFT-style training instances
LoRA/QLoRA fine-tuning on Qwen2.5-7B
Evaluation workflow for retrieval quality, answer quality, and hallucination analysis

Tech Stack

Python
Hugging Face
PEFT/LoRA
Vector Store

Implementation Details

Constructed 1,199 teacher-supervised samples and 1,195 distractor-augmented instances for training and comparison.
Fine-tuned Qwen2.5-7B with LoRA/QLoRA to study parameter-efficient domain adaptation for Chinese medical QA.
Built reusable scripts for data normalization, prompt formatting, comparative inference, retrieval-quality analysis, and hallucination inspection.
Designed the project around evidence fidelity instead of only optimizing for fluent medical-style answers.
Separated retrieval quality from generation quality so that weak evidence and weak synthesis could be diagnosed independently.

Challenges

Chinese medical terminology creates passages that are semantically similar but clinically different, making naive retrieval risky.
Distractor passages are useful for training robustness, but they also require careful prompt and label design to avoid confusing the model.
Fine-tuned outputs may become more polished while still requiring evidence-level hallucination checks.

What I Learned

In medical RAG, dataset construction and evaluation design matter as much as model selection.
RAFT-style learning helped me understand how retrieved context, distractors, and supervised answers interact in domain QA.
A responsible medical QA interface should communicate uncertainty, evidence scope, and unsupported areas rather than only returning confident answers.