CV Screening Pipeline

From job description to ranked shortlist — fully automated, evidence-backed, and explainable at every step.

Python 3.11+ Gemini 2.5 Flash ChromaDB + RAG Docling PDF Parsing Flask Dashboard

Overview

The CV Screening Pipeline is an end-to-end AI-powered recruitment tool built for a Capital Markets law firm. It takes a Job Description and a folder of candidate CVs as input and produces a tiered ranked shortlist with per-candidate scorecards — with no human intervention at any stage.

The scoring rubric is derived from the JD automatically. Criterion weights, domain map, and keyword reference list are all extracted from the JD itself — not hardcoded. Swap the JD and the rubric adapts.

Every score is backed by a sentence quoted directly from the CV. The model is structurally forced to find evidence before it is allowed to assign a number. This makes every decision explainable and auditable.

Who this is for
Built for a Capital Markets law firm hiring junior associates, but the architecture is role-agnostic. The knowledge bases and JD are the only role-specific components.

Key Features

JD-grounded rubric
The rubric is not hardcoded. Gemini reads the JD and extracts criterion weights, a semantic domain map, and a keyword reference list. Every scoring decision traces back to what the JD actually asked for.
Anonymised before scoring
Names, emails, addresses, gender markers, age, and community indicators are stripped before any CV reaches the scoring model. Names are re-attached only at the final output stage, after all scoring is complete.
Evidence-first scoring
The scoring prompt enforces a strict sequence: quote the relevant sentence, analyse what it demonstrates, cross-check against the knowledge base, then assign a score. A score without evidence is structurally impossible.
Isolated scoring
Each candidate is scored in a completely separate API call. No candidate's data is ever in the same context as another candidate's data. This prevents in-context comparison bias — where the model compares candidates against each other rather than against the rubric.
RAG-grounded lookups
Institution tiers, organisation prestige, and Capital Markets terminology are verified against local knowledge bases via ChromaDB. The model reads verified facts you provided — it does not guess from training memory.
Interactive dashboard
Results are served locally at http://127.0.0.1:5000. The tiered ranked list is clickable — every candidate row opens a full evidence-backed scorecard showing JD keywords found and missing, sub-scores, and a plain-English summary.

Tech Stack

LayerTechnologyWhy this was chosen
LanguagePython 3.11+Dominant ecosystem for AI pipelines — all required libraries are Python-native
LLMGemini 2.5 FlashFree tier available, 1M token context window, strong instruction following
PDF parsingDocling (IBM)Layout-aware — understands multi-column CVs that standard parsers scramble
Vector DBChromaDBLocal, no server required, persists to disk, right-sized for this scale
Embeddingsall-MiniLM-L6-v2Free, runs locally, sufficient for section-level semantic retrieval
OrchestrationPlain PythonNo framework — every step is visible and debuggable
DashboardFlaskLightweight Python web server, no frontend framework needed

Pipeline Stages

The pipeline runs sequentially. Each stage reads from the previous stage's output and writes to a known file location. main.py orchestrates the sequence.

0A
JD Processing
Gemini reads the Job Description and extracts criterion weights, a semantic domain map, and a JD keyword reference list. Saved to jd_rubric.json. Called once per run — never repeated per CV.
1 API call totalstep0a_jd_processor.py
0B
CV Parsing and Anonymisation
Docling converts each PDF to clean Markdown using layout analysis. Each candidate is assigned an anonymous ID. PII is stripped. The original filename mapping is stored separately and never reaches the scoring model.
No API callsstep0b_cv_parser.py
1
Extraction and Indexing
CVs are batched in groups of 5 and sent to Gemini. For each CV the model extracts structured JSON — degree, institution, CGPA, internships with domain terms found, publications, moot courts. Missing fields return null, never inferred. Profiles and knowledge bases are indexed into ChromaDB.
1 call per 5 CVsstep1_extractor.py
2
Scoring
Each candidate is scored in a completely isolated Gemini call. Before calling the model, relevant CV sections and knowledge base entries are retrieved from ChromaDB via RAG and injected into the prompt. The prompt enforces evidence-first scoring: quote, analyse, cross-check, score.
1 call per candidate — never batchedstep2_scorer.py
3
Output Generation
All scorecards are aggregated. Candidates are tiered and sorted. Names are re-attached from the mapping file for the first time since parsing. Three output files are generated. Flask dashboard is launched.
No API callsstep3_output.py

Scoring Rubric

The rubric has three passes. The weights shown are defaults — actual weights for each run are derived from the JD by the LLM.

Pass 1 — Hard Gate

Pass or fail. No partial credit. Candidates who fail either gate exit the pipeline. No scoring API calls are made for failed candidates.

GateRequirementFail condition
Degree typeLLB, BA LLB, or LLB+LLM from a recognised Indian institutionNo qualifying law degree found in CV
Degree validityCurrently pursuing or completed a qualifying degreeNo law education found at all

Pass 2 — Scored out of 100

Institution prestige
15 pts
NLU = full marks. Recognised private school = partial. Unknown = minimal. RAG-verified against institution tier database.
Internship quality
25 pts
Scored on two conditions: organisation reputation AND depth of work described. Full marks require both. RAG-verified against organisation prestige database.
Technical knowledge (highest weight)
45 pts
Split into two sub-scores: JD alignment 36 pts (80%) and semantic depth 9 pts (20%). Rewards both exact regulatory knowledge and genuine domain understanding.
Academic performance
10 pts
CGPA bands: 8.5+ = full, 7.5–8.4 = competitive, 7.0–7.4 = baseline, below 7.0 = minimal. Boosted by prizes, class rank, merit list mentions.

Pass 3 — Bonus Points (capped at +15)

CriterionMaxNotes
Moot court+5Corporate or CapM themed competitions score higher
Publications+5Recognised law journal, CapM or securities topic preferred
International exposure+4Foreign internship or exchange semester
Leadership roles+3Secretary, president of society etc.
Languages+3Working proficiency beyond English, +1 per language

Tier Assignment

Final score = Pass 2 (max 100) + Pass 3 (max +15) = 115-point scale.

Strong
90 – 115
Recommend for interview
Competitive
70 – 89
Consider for interview
Weak
50 – 69
Hold — interview if Strong pool is thin
Below Threshold
0 – 49
Do not proceed

RAG and Knowledge Bases

RAG (Retrieval Augmented Generation) grounds the scoring model in verified facts. Instead of relying on the model's training memory, the pipeline retrieves specific entries from local knowledge bases and injects them into the scoring prompt.

Three knowledge bases are stored in knowledge_base/ and indexed into ChromaDB at the start of each run.

FileContentsUsed for
institutions.jsonAll major Indian law schools with tier, NLU status, city, placement notesCriterion 2.1 — institution prestige scoring
organisations.jsonLaw firms (Tier 1/2/3), regulators, exchanges, investment banks with CapM relevance tagsCriterion 2.2 — internship organisation quality
capm_terms.json26 Capital Markets terms with definitions, correct usage examples, and shallow usage examplesCriterion 2.3 — technical knowledge depth scoring
Keeping knowledge bases current
Law firm tiers, NIRF rankings, and SEBI regulations change. Review and update the knowledge base JSON files at the start of each hiring cycle.

API Call Strategy

The pipeline makes the minimum number of API calls consistent with quality and reliability.

StageCallsReasoning
JD parsing1 totalJD does not change between candidates
Extraction1 per 5 CVsBatching is safe — extraction is mechanical, no quality loss
Scoring1 per candidateNever batched — isolation prevents model drift and comparison bias
Total for 100 CVs~121 callsDown from 700+ with a naive single-call-per-step approach
Why scoring is never batched
When multiple candidates share a context window, the model compares them against each other rather than against the rubric. Candidate 3's strong profile unconsciously becomes the new baseline, making Candidate 7's identical profile score lower. This is called in-context comparison bias. Isolation prevents it.

Installation

Prerequisites

  • Python 3.11 or higher
  • A Google AI Studio API key — aistudio.google.com
  • Gemini 2.5 Flash enabled on your project — verify at ai.dev/rate-limit
  • 4GB free disk space (Docling downloads ML models on first run)

Steps

# 1. Clone the repository
git clone https://github.com/yourusername/cv-screening-pipeline.git
cd cv-screening-pipeline

# 2. Create virtual environment
python -m venv venv

# 3. Activate it
source venv/bin/activate        # Mac / Linux
venv\Scripts\activate           # Windows

# 4. Install dependencies
pip install -r requirements.txt

# 5. Set up environment file
cp .env.example .env
# Open .env and add your GEMINI_API_KEY

# 6. Verify setup
python -c "import config; print('Setup successful')"

Configuration

Environment variables (.env)

GEMINI_API_KEY=your_gemini_api_key_here

This is the only secret the project requires. Never commit this file. The .env.example template is safe to commit.

Application settings (config.py)

SettingDefaultWhen to change
GEMINI_MODELgemini-2.5-flashIf the model is unavailable on your API key — check your rate limits page
EXTRACTION_BATCH_SIZE5Reduce to 3 for long CVs, increase to 10 for short CVs
DASHBOARD_PORT5000If port 5000 is already in use on your machine
RAG_TOP_K3Number of knowledge base results to retrieve per lookup

Usage

Running the full pipeline

# Add your JD
# Replace data/input/job_description.txt with your actual JD

# Add candidate CVs
# Place PDF files in data/input/cvs/

# Run
python main.py

# Browser opens automatically at http://127.0.0.1:5000

Restarting the dashboard without re-running

python dashboard/app.py

Running a single stage

python -m pipeline.step0a_jd_processor   # Re-parse JD only
python -m pipeline.step0b_cv_parser      # Re-parse CVs only
python -m pipeline.step1_extractor       # Re-extract profiles
python -m pipeline.step2_scorer          # Re-score all candidates
python -m pipeline.step3_output          # Re-generate output files

Output Files

FileFormatPurpose
data/output/ranked_list.jsonJSONFull tiered ranking — feeds the dashboard
data/output/report.mdMarkdownHuman-readable report — suitable for emailing to a hiring partner
data/output/scored/C-XXX_scorecard.jsonJSONFull evidence trail per candidate — all scores with quoted sentences
data/output/decision_receipts/C-XXX_receipt.jsonJSONLightweight audit record — minimum data to defend any decision

Troubleshooting

Model not found (404)

Check ai.dev/rate-limit and use whichever model shows a non-zero RPM. Update GEMINI_MODEL in config.py.

Quota exhausted (429)

Free tier daily limit reached. Wait 24 hours. Alternatively, enable billing on your Google Cloud project for higher quotas.

Pipeline appears frozen after Docling shows 100%

Not frozen. Docling is loading 560MB of neural network weights into CPU RAM silently after downloading them. This takes 5–15 minutes on first run with no progress indicator. Do not press Ctrl+C. Wait. Every run after this takes under 30 seconds because the models are cached.

ChromaDB telemetry messages

Harmless. To suppress them, add these two lines at the very top of main.py before all other imports:

import os
os.environ["ANONYMIZED_TELEMETRY"] = "False"
os.environ["CHROMA_TELEMETRY"] = "False"

Port already in use

Change DASHBOARD_PORT in config.py to any free port such as 5001.

Scoring inconsistency between runs

The same candidate may score 5–10 points differently between runs due to LLM non-determinism even at low temperature. This is expected. Strong and weak candidates score consistently. Borderline candidates may shift between adjacent tiers.

Project Structure

cv_screening_pipeline/
|-- main.py                    Entry point and orchestrator
|-- config.py                  All configuration
|-- requirements.txt
|-- .env.example
|
|-- data/
|   |-- input/
|   |   |-- job_description.txt
|   |   |-- cvs/               Drop PDFs here
|   |-- processed/             Auto-generated during run
|   |-- output/                Auto-generated during run
|
|-- knowledge_base/
|   |-- institutions.json
|   |-- organisations.json
|   |-- capm_terms.json
|
|-- pipeline/
|   |-- step0a_jd_processor.py
|   |-- step0b_cv_parser.py
|   |-- step1_extractor.py
|   |-- step2_scorer.py
|   |-- step3_output.py
|
|-- rag/
|   |-- embedder.py
|   |-- indexer.py
|   |-- retriever.py
|
|-- prompts/
|   |-- jd_parse_prompt.txt
|   |-- extract_and_anonymise_prompt.txt
|   |-- score_all_criteria.txt
|
|-- dashboard/
|   |-- app.py
|   |-- templates/
|   |-- static/
|
|-- docs/
    |-- index.html             This documentation site

Roadmap

Short term

  • Role-based rubric switching — --role flag selects Junior, Senior, or Partner-track weight profiles
  • Scoring consistency check — run borderline candidates twice and flag when scores differ by more than 8 points
  • Proactive scorecard flags — highlight missing fields, weak performance on high-weight criteria, and narrow gate passes

Medium term

  • Hosted document parsing — replace local Docling with AWS Textract or Azure Document Intelligence for production deployments
  • REST API wrapper — accept job IDs asynchronously, expose a status endpoint, enable ATS integration
  • Persona matching score — evaluate whether the candidate's overall profile matches the target role persona, once calibration data is available

Long term

  • Multi-language CV support — extend to English-Tamil, English-Malayalam, and other South Indian mixed-language CVs
  • Bias audit module — statistical check for correlations between scores and protected characteristics after each hiring cycle
  • Calibration loop — compare rankings against actual hiring outcomes over time and adjust rubric weights accordingly