Grading is the bottleneck of education. Teachers spend hundreds of hours manually reviewing descriptive answers, checking diagrams, and deciphering handwriting.Grading is the bottleneck of education. Teachers spend hundreds of hours manually reviewing descriptive answers, checking diagrams, and deciphering handwriting.

How to Automate Exam Grading with RAG and CLIP

2025/12/16 03:00

Grading is the bottleneck of education. Teachers spend hundreds of hours manually reviewing descriptive answers, checking diagrams, and deciphering handwriting. It’s subjective, exhausting, and prone to inconsistency.

While Multiple Choice Questions (MCQs) are easy to automate, Descriptive and Diagrammatic answers have always been the "final boss" for EdTech.

Most existing solutions rely on simple keyword matching (TF-IDF) or basic BERT models, which fail to understand context or evaluate visual diagrams. In this guide, we are going to build a system that solves this using Retrieval-Augmented Generation (RAG) and Multimodal AI.

We will architect a solution that:

  1. Ingests textbooks to create a "Ground Truth" knowledge base.
  2. Uses Local LLMs (Mistral via Ollama) to generate model answers.
  3. Uses Semantic Search to grade text.
  4. Uses CLIP to grade student diagrams.

Let’s build.

The Architecture: A Dual-Pipeline System

We are building a pipeline that handles two distinct data types: Text and Images. We cannot rely on the LLM's internal knowledge alone (hallucination risk), so we ground it in a Vector Database created from the course textbooks.

Here is the high-level data flow:

The Tech Stack

  • LLM Runtime: Ollama (running Mistral 7B)
  • Orchestration: LangChain
  • Vector DB: FAISS (CPU optimized)
  • Embeddings (Text): thenlper/gte-base or all-MiniLM-L6-v2
  • Embeddings (Image): OpenAI CLIP (ViT-B-32)
  • OCR: PaddleOCR (for extracting labels from diagrams)

Phase 1: The Knowledge Base (Ingestion)

First, we need to turn a static PDF textbook into a query-based database. We don't just want text; we need to extract diagrams and their captions to grade visual questions later.

The Extraction Logic

We use pdfplumber for text and PaddleOCR to find diagram labels.

import pdfplumber from paddleocr import PaddleOCR def ingest_textbook(pdf_path): ocr = PaddleOCR(use_angle_cls=True, lang='en') documents = [] with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: # 1. Extract Text text = page.extract_text() # 2. Extract Images (Pseudo-code for brevity) # In production, use fitz (PyMuPDF) to extract binary image data images = extract_images_from_page(page) # 3. OCR on Images to get Captions/Labels for img in images: result = ocr.ocr(img, cls=True) caption = " ".join([line[1][0] for line in result[0]]) # Associate diagram with text context documents.append({ "content": text + "\n [DIAGRAM: " + caption + "]", "type": "mixed" }) return documents

Once extracted, we chunk the text (500 characters with overlap) and store it in FAISS.

Phase 2: Generating the "Perfect" Answer (RAG)

To grade a student, we first need to know what the correct answer looks like. We don't rely on a teacher's answer key alone; we generate a dynamic model answer from the textbook to ensure it matches the curriculum exactly.

We use LangChain to retrieve the relevant context and Mistral to synthesize the answer.

from langchain.chains import RetrievalQA from langchain_community.llms import Ollama from langchain_community.vectorstores import FAISS from langchain_community.embeddings import HuggingFaceEmbeddings # 1. Setup Embeddings & Vector Store embeddings = HuggingFaceEmbeddings(model_name="thenlper/gte-base") vectorstore = FAISS.load_local("textbook_index", embeddings) # 2. Setup Local LLM via Ollama llm = Ollama(model="mistral") # 3. Create RAG Chain qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever(search_kwargs={"k": 3}), return_source_documents=True ) def generate_model_answer(question): # Optimize prompt for academic precision prompt = f""" You are a science teacher. Answer the following question based ONLY on the context provided. Question: {question} Answer within 50-80 words. """ result = qa_chain.invoke(prompt) return result['result']

Phase 3: Grading the Text (Semantic Similarity)

Now we compare the Student's Answer against the Model's Answer.

We avoid exact keyword matching because students phrase things differently. Instead, we use Cosine Similarity on sentence embeddings.

from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('all-MiniLM-L6-v2') def grade_text_response(student_ans, model_ans): # Encode both answers embedding_1 = model.encode(student_ans, convert_to_tensor=True) embedding_2 = model.encode(model_ans, convert_to_tensor=True) # Calculate Cosine Similarity score = util.pytorch_cos_sim(embedding_1, embedding_2) return score.item() # Returns value between 0 and 1

Note: In our experiments, a raw similarity score of 0.85+ usually correlates to full marks. We scale the scores: anything above 0.85 is a 100%, and anything below 0.4 is a 0%.

Phase 4: Grading the Diagrams (CLIP)

This is the hardest part. How do you grade a hand-drawn diagram of a "Neuron" or "Flower"?

We use CLIP (Contrastive Language-Image Pre-Training). CLIP understands the semantic relationship between images. We compare the embedding of the student's drawing (or uploaded image) against the embedding of the "Gold Standard" diagram from the textbook.

from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch # Load CLIP model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") def grade_diagram(student_img_path, textbook_img_path): image1 = Image.open(student_img_path) image2 = Image.open(textbook_img_path) # Process images inputs = processor(images=[image1, image2], return_tensors="pt", padding=True) # Get Embeddings outputs = model.get_image_features(**inputs) # Normalize outputs = outputs / outputs.norm(p=2, dim=-1, keepdim=True) # Calculate Similarity similarity = (outputs[0] @ outputs[1].T).item() return similarity

Phase 5: The Final Grading Algorithm

Finally, we aggregate the scores based on the question type. If a question requires both text and a diagram, we apply weights.

The Logic:

  1. Length Check: If the student's answer is too short (<30% of expected length), apply a penalty.
  2. Weighted Scoring: Final Score = (TextScore * 0.7) + (DiagramScore * 0.3)
  3. Thresholding:

| Similarity Score | Grade Percentage | |----|----| | > 0.85 | 100% (Full Marks) | | 0.6 - 0.85 | 50% (Half Marks) | | 0.25 - 0.6 | 25% | | < 0.25 | 0% |

def calculate_final_grade(text_sim, img_sim, max_marks, has_diagram=False): if has_diagram: # 70% weight to text, 30% to diagram combined_score = (text_sim * 0.7) + (img_sim * 0.3) else: combined_score = text_sim # Apply Thresholds if combined_score > 0.85: marks = max_marks elif combined_score > 0.6: marks = max_marks * 0.5 elif combined_score > 0.25: marks = max_marks * 0.25 else: marks = 0 return round(marks, 1)

Results and Reality Check

We tested this on CBSE Class 10 Science papers.

  • Time Saved: Manual grading took ~20 minutes per paper. The AI took 5-6 minutes.
  • Accuracy: The system achieved high alignment with human graders on descriptive answers.
  • Challenge: CLIP struggles if the student's diagram is rotated or poorly lit. The text grader can sometimes be too lenient if the student uses the right keywords but in the wrong order.

Conclusion

We have moved beyond simple multiple-choice scanners. By combining RAG for factual grounding and CLIP for visual understanding, we can build automated grading systems that are fair, consistent, and tireless.

This architecture isn't just for schools, it applies to technical interviews, certification exams, and automated compliance checking.

Ready to build? Start by installing Ollama and getting your vector store running. The future of education is automated.

\

Market Opportunity
Sleepless AI Logo
Sleepless AI Price(AI)
$0.0378
$0.0378$0.0378
+1.12%
USD
Sleepless AI (AI) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

The Channel Factories We’ve Been Waiting For

The Channel Factories We’ve Been Waiting For

The post The Channel Factories We’ve Been Waiting For appeared on BitcoinEthereumNews.com. Visions of future technology are often prescient about the broad strokes while flubbing the details. The tablets in “2001: A Space Odyssey” do indeed look like iPads, but you never see the astronauts paying for subscriptions or wasting hours on Candy Crush.  Channel factories are one vision that arose early in the history of the Lightning Network to address some challenges that Lightning has faced from the beginning. Despite having grown to become Bitcoin’s most successful layer-2 scaling solution, with instant and low-fee payments, Lightning’s scale is limited by its reliance on payment channels. Although Lightning shifts most transactions off-chain, each payment channel still requires an on-chain transaction to open and (usually) another to close. As adoption grows, pressure on the blockchain grows with it. The need for a more scalable approach to managing channels is clear. Channel factories were supposed to meet this need, but where are they? In 2025, subnetworks are emerging that revive the impetus of channel factories with some new details that vastly increase their potential. They are natively interoperable with Lightning and achieve greater scale by allowing a group of participants to open a shared multisig UTXO and create multiple bilateral channels, which reduces the number of on-chain transactions and improves capital efficiency. Achieving greater scale by reducing complexity, Ark and Spark perform the same function as traditional channel factories with new designs and additional capabilities based on shared UTXOs.  Channel Factories 101 Channel factories have been around since the inception of Lightning. A factory is a multiparty contract where multiple users (not just two, as in a Dryja-Poon channel) cooperatively lock funds in a single multisig UTXO. They can open, close and update channels off-chain without updating the blockchain for each operation. Only when participants leave or the factory dissolves is an on-chain transaction…
Share
BitcoinEthereumNews2025/09/18 00:09
XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025?

XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025?

The post XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025? appeared first on Coinpedia Fintech News The XRP price has come under enormous pressure
Share
CoinPedia2025/12/16 19:22
BlackRock boosts AI and US equity exposure in $185 billion models

BlackRock boosts AI and US equity exposure in $185 billion models

The post BlackRock boosts AI and US equity exposure in $185 billion models appeared on BitcoinEthereumNews.com. BlackRock is steering $185 billion worth of model portfolios deeper into US stocks and artificial intelligence. The decision came this week as the asset manager adjusted its entire model suite, increasing its equity allocation and dumping exposure to international developed markets. The firm now sits 2% overweight on stocks, after money moved between several of its biggest exchange-traded funds. This wasn’t a slow shuffle. Billions flowed across multiple ETFs on Tuesday as BlackRock executed the realignment. The iShares S&P 100 ETF (OEF) alone brought in $3.4 billion, the largest single-day haul in its history. The iShares Core S&P 500 ETF (IVV) collected $2.3 billion, while the iShares US Equity Factor Rotation Active ETF (DYNF) added nearly $2 billion. The rebalancing triggered swift inflows and outflows that realigned investor exposure on the back of performance data and macroeconomic outlooks. BlackRock raises equities on strong US earnings The model updates come as BlackRock backs the rally in American stocks, fueled by strong earnings and optimism around rate cuts. In an investment letter obtained by Bloomberg, the firm said US companies have delivered 11% earnings growth since the third quarter of 2024. Meanwhile, earnings across other developed markets barely touched 2%. That gap helped push the decision to drop international holdings in favor of American ones. Michael Gates, lead portfolio manager for BlackRock’s Target Allocation ETF model portfolio suite, said the US market is the only one showing consistency in sales growth, profit delivery, and revisions in analyst forecasts. “The US equity market continues to stand alone in terms of earnings delivery, sales growth and sustainable trends in analyst estimates and revisions,” Michael wrote. He added that non-US developed markets lagged far behind, especially when it came to sales. This week’s changes reflect that position. The move was made ahead of the Federal…
Share
BitcoinEthereumNews2025/09/18 01:44