RAG / LLM App / Full-stack
2025
Role: RAG solution design, data-pipeline design, evaluation framing, hallucination analysis
MedRAFT Medical RAG QA System
A medical RAG system that connects knowledge-base ingestion, evidence retrieval, teacher-supervised data generation, RAFT-style training, and multi-dimensional evaluation for safer Chinese QA.
Impact: Turned a course project into a full product-style RAG case study by clarifying the evidence flow, answer structure, confidence design, and evaluation dimensions required in a high-risk domain.
Overview
MedRAFT is a Chinese medical QA project that I now present as a product-style RAG workflow: build the knowledge base, retrieve evidence, generate supervised answers, fine-tune the student model, and evaluate whether the final output is actually supported and safe enough to review.
Problem
Medical QA is a high-risk generation setting: a fluent answer may still be unsupported, incomplete, or misleading. Pure LLM generation is not enough; the system needs retrieval grounding, distractor-aware training, and evaluation that checks whether answers are actually supported by retrieved evidence.
Solution
I designed the project as an evidence-first workflow: retrieved passages come before answer generation, teacher outputs define the target structure, and evaluation explicitly checks support quality, answer quality, and hallucination patterns. This makes the system more defensible than a pure chat-style medical demo.
Architecture
The pipeline includes document ingestion, text normalization, chunking, embedding-based retrieval, teacher-model answer generation, distractor construction, prompt formatting, parameter-efficient fine-tuning, comparative inference, retrieval evaluation, and hallucination analysis.
Core Features
- Chinese medical document ingestion and normalization
- Vector retrieval for evidence-grounded medical QA
- Teacher-supervised sample construction
- Distractor-augmented RAFT-style training instances
- LoRA/QLoRA fine-tuning on Qwen2.5-7B
- Evaluation workflow for retrieval quality, answer quality, and hallucination analysis
Tech Stack
- Python
- Hugging Face
- PEFT/LoRA
- Vector Store
Implementation Details
- Constructed 1,199 teacher-supervised samples and 1,195 distractor-augmented instances for training and comparison.
- Fine-tuned Qwen2.5-7B with LoRA/QLoRA to study parameter-efficient domain adaptation for Chinese medical QA.
- Built reusable scripts for data normalization, prompt formatting, comparative inference, retrieval-quality analysis, and hallucination inspection.
- Designed the project around evidence fidelity instead of only optimizing for fluent medical-style answers.
- Separated retrieval quality from generation quality so that weak evidence and weak synthesis could be diagnosed independently.
Challenges
- Chinese medical terminology creates passages that are semantically similar but clinically different, making naive retrieval risky.
- Distractor passages are useful for training robustness, but they also require careful prompt and label design to avoid confusing the model.
- Fine-tuned outputs may become more polished while still requiring evidence-level hallucination checks.
What I Learned
- In medical RAG, dataset construction and evaluation design matter as much as model selection.
- RAFT-style learning helped me understand how retrieved context, distractors, and supervised answers interact in domain QA.
- A responsible medical QA interface should communicate uncertainty, evidence scope, and unsupported areas rather than only returning confident answers.