CV Screening Pipeline — Documentation

Overview

The CV Screening Pipeline is an end-to-end AI-powered recruitment tool built for a Capital Markets law firm. It takes a Job Description and a folder of candidate CVs as input and produces a tiered ranked shortlist with per-candidate scorecards — with no human intervention at any stage.

The scoring rubric is derived from the JD automatically. Criterion weights, domain map, and keyword reference list are all extracted from the JD itself — not hardcoded. Swap the JD and the rubric adapts.

Every score is backed by a sentence quoted directly from the CV. The model is structurally forced to find evidence before it is allowed to assign a number. This makes every decision explainable and auditable.

Who this is for

Built for a Capital Markets law firm hiring junior associates, but the architecture is role-agnostic. The knowledge bases and JD are the only role-specific components.

Key Features

JD-grounded rubric

The rubric is not hardcoded. Gemini reads the JD and extracts criterion weights, a semantic domain map, and a keyword reference list. Every scoring decision traces back to what the JD actually asked for.

Anonymised before scoring

Names, emails, addresses, gender markers, age, and community indicators are stripped before any CV reaches the scoring model. Names are re-attached only at the final output stage, after all scoring is complete.

Evidence-first scoring

The scoring prompt enforces a strict sequence: quote the relevant sentence, analyse what it demonstrates, cross-check against the knowledge base, then assign a score. A score without evidence is structurally impossible.

Isolated scoring

Each candidate is scored in a completely separate API call. No candidate's data is ever in the same context as another candidate's data. This prevents in-context comparison bias — where the model compares candidates against each other rather than against the rubric.

RAG-grounded lookups

Institution tiers, organisation prestige, and Capital Markets terminology are verified against local knowledge bases via ChromaDB. The model reads verified facts you provided — it does not guess from training memory.

Interactive dashboard

Results are served locally at http://127.0.0.1:5000. The tiered ranked list is clickable — every candidate row opens a full evidence-backed scorecard showing JD keywords found and missing, sub-scores, and a plain-English summary.

Tech Stack

Layer	Technology	Why this was chosen
Language	`Python 3.11+`	Dominant ecosystem for AI pipelines — all required libraries are Python-native
LLM	`Gemini 2.5 Flash`	Free tier available, 1M token context window, strong instruction following
PDF parsing	`Docling (IBM)`	Layout-aware — understands multi-column CVs that standard parsers scramble
Vector DB	`ChromaDB`	Local, no server required, persists to disk, right-sized for this scale
Embeddings	`all-MiniLM-L6-v2`	Free, runs locally, sufficient for section-level semantic retrieval
Orchestration	Plain Python	No framework — every step is visible and debuggable
Dashboard	`Flask`	Lightweight Python web server, no frontend framework needed

Pipeline Stages

The pipeline runs sequentially. Each stage reads from the previous stage's output and writes to a known file location. main.py orchestrates the sequence.

JD Processing

Gemini reads the Job Description and extracts criterion weights, a semantic domain map, and a JD keyword reference list. Saved to jd_rubric.json. Called once per run — never repeated per CV.

1 API call totalstep0a_jd_processor.py

CV Parsing and Anonymisation

Docling converts each PDF to clean Markdown using layout analysis. Each candidate is assigned an anonymous ID. PII is stripped. The original filename mapping is stored separately and never reaches the scoring model.

No API callsstep0b_cv_parser.py

Extraction and Indexing

CVs are batched in groups of 5 and sent to Gemini. For each CV the model extracts structured JSON — degree, institution, CGPA, internships with domain terms found, publications, moot courts. Missing fields return null, never inferred. Profiles and knowledge bases are indexed into ChromaDB.

1 call per 5 CVsstep1_extractor.py

Scoring

Each candidate is scored in a completely isolated Gemini call. Before calling the model, relevant CV sections and knowledge base entries are retrieved from ChromaDB via RAG and injected into the prompt. The prompt enforces evidence-first scoring: quote, analyse, cross-check, score.

1 call per candidate — never batchedstep2_scorer.py

Output Generation

All scorecards are aggregated. Candidates are tiered and sorted. Names are re-attached from the mapping file for the first time since parsing. Three output files are generated. Flask dashboard is launched.

No API callsstep3_output.py

Scoring Rubric

The rubric has three passes. The weights shown are defaults — actual weights for each run are derived from the JD by the LLM.

Pass 1 — Hard Gate

Pass or fail. No partial credit. Candidates who fail either gate exit the pipeline. No scoring API calls are made for failed candidates.

Gate	Requirement	Fail condition
Degree type	LLB, BA LLB, or LLB+LLM from a recognised Indian institution	No qualifying law degree found in CV
Degree validity	Currently pursuing or completed a qualifying degree	No law education found at all

Pass 2 — Scored out of 100

Institution prestige

15 pts

NLU = full marks. Recognised private school = partial. Unknown = minimal. RAG-verified against institution tier database.

Internship quality

25 pts

Scored on two conditions: organisation reputation AND depth of work described. Full marks require both. RAG-verified against organisation prestige database.

Technical knowledge (highest weight)

45 pts

Split into two sub-scores: JD alignment 36 pts (80%) and semantic depth 9 pts (20%). Rewards both exact regulatory knowledge and genuine domain understanding.

Academic performance

10 pts

CGPA bands: 8.5+ = full, 7.5–8.4 = competitive, 7.0–7.4 = baseline, below 7.0 = minimal. Boosted by prizes, class rank, merit list mentions.

Pass 3 — Bonus Points (capped at +15)

Criterion	Max	Notes
Moot court	+5	Corporate or CapM themed competitions score higher
Publications	+5	Recognised law journal, CapM or securities topic preferred
International exposure	+4	Foreign internship or exchange semester
Leadership roles	+3	Secretary, president of society etc.
Languages	+3	Working proficiency beyond English, +1 per language

Tier Assignment

Final score = Pass 2 (max 100) + Pass 3 (max +15) = 115-point scale.

Strong

90 – 115

Recommend for interview

Competitive

70 – 89

Consider for interview

Weak

50 – 69

Hold — interview if Strong pool is thin

Below Threshold

0 – 49

Do not proceed

RAG and Knowledge Bases

RAG (Retrieval Augmented Generation) grounds the scoring model in verified facts. Instead of relying on the model's training memory, the pipeline retrieves specific entries from local knowledge bases and injects them into the scoring prompt.

Three knowledge bases are stored in knowledge_base/ and indexed into ChromaDB at the start of each run.

File	Contents	Used for
`institutions.json`	All major Indian law schools with tier, NLU status, city, placement notes	Criterion 2.1 — institution prestige scoring
`organisations.json`	Law firms (Tier 1/2/3), regulators, exchanges, investment banks with CapM relevance tags	Criterion 2.2 — internship organisation quality
`capm_terms.json`	26 Capital Markets terms with definitions, correct usage examples, and shallow usage examples	Criterion 2.3 — technical knowledge depth scoring

Keeping knowledge bases current

Law firm tiers, NIRF rankings, and SEBI regulations change. Review and update the knowledge base JSON files at the start of each hiring cycle.

API Call Strategy

The pipeline makes the minimum number of API calls consistent with quality and reliability.

Stage	Calls	Reasoning
JD parsing	1 total	JD does not change between candidates
Extraction	1 per 5 CVs	Batching is safe — extraction is mechanical, no quality loss
Scoring	1 per candidate	Never batched — isolation prevents model drift and comparison bias
Total for 100 CVs	~121 calls	Down from 700+ with a naive single-call-per-step approach

Why scoring is never batched

When multiple candidates share a context window, the model compares them against each other rather than against the rubric. Candidate 3's strong profile unconsciously becomes the new baseline, making Candidate 7's identical profile score lower. This is called in-context comparison bias. Isolation prevents it.

Installation

Prerequisites

Python 3.11 or higher
A Google AI Studio API key — aistudio.google.com
Gemini 2.5 Flash enabled on your project — verify at ai.dev/rate-limit
4GB free disk space (Docling downloads ML models on first run)

Steps

# 1. Clone the repository
git clone https://github.com/yourusername/cv-screening-pipeline.git
cd cv-screening-pipeline

# 2. Create virtual environment
python -m venv venv

# 3. Activate it
source venv/bin/activate        # Mac / Linux
venv\Scripts\activate           # Windows

# 4. Install dependencies
pip install -r requirements.txt

# 5. Set up environment file
cp .env.example .env
# Open .env and add your GEMINI_API_KEY

# 6. Verify setup
python -c "import config; print('Setup successful')"

Configuration

Environment variables (.env)

GEMINI_API_KEY=your_gemini_api_key_here

This is the only secret the project requires. Never commit this file. The .env.example template is safe to commit.

Application settings (config.py)

Setting	Default	When to change
`GEMINI_MODEL`	gemini-2.5-flash	If the model is unavailable on your API key — check your rate limits page
`EXTRACTION_BATCH_SIZE`	5	Reduce to 3 for long CVs, increase to 10 for short CVs
`DASHBOARD_PORT`	5000	If port 5000 is already in use on your machine
`RAG_TOP_K`	3	Number of knowledge base results to retrieve per lookup

Usage

Running the full pipeline

# Add your JD
# Replace data/input/job_description.txt with your actual JD

# Add candidate CVs
# Place PDF files in data/input/cvs/

# Run
python main.py

# Browser opens automatically at http://127.0.0.1:5000

Restarting the dashboard without re-running

python dashboard/app.py

Running a single stage

python -m pipeline.step0a_jd_processor   # Re-parse JD only
python -m pipeline.step0b_cv_parser      # Re-parse CVs only
python -m pipeline.step1_extractor       # Re-extract profiles
python -m pipeline.step2_scorer          # Re-score all candidates
python -m pipeline.step3_output          # Re-generate output files

Output Files

File	Format	Purpose
`data/output/ranked_list.json`	JSON	Full tiered ranking — feeds the dashboard
`data/output/report.md`	Markdown	Human-readable report — suitable for emailing to a hiring partner
`data/output/scored/C-XXX_scorecard.json`	JSON	Full evidence trail per candidate — all scores with quoted sentences
`data/output/decision_receipts/C-XXX_receipt.json`	JSON	Lightweight audit record — minimum data to defend any decision

Troubleshooting

Model not found (404)

Check ai.dev/rate-limit and use whichever model shows a non-zero RPM. Update GEMINI_MODEL in config.py.

Quota exhausted (429)

Free tier daily limit reached. Wait 24 hours. Alternatively, enable billing on your Google Cloud project for higher quotas.

Pipeline appears frozen after Docling shows 100%

Not frozen. Docling is loading 560MB of neural network weights into CPU RAM silently after downloading them. This takes 5–15 minutes on first run with no progress indicator. Do not press Ctrl+C. Wait. Every run after this takes under 30 seconds because the models are cached.

ChromaDB telemetry messages

Harmless. To suppress them, add these two lines at the very top of main.py before all other imports:

import os
os.environ["ANONYMIZED_TELEMETRY"] = "False"
os.environ["CHROMA_TELEMETRY"] = "False"

Port already in use

Change DASHBOARD_PORT in config.py to any free port such as 5001.

Scoring inconsistency between runs

The same candidate may score 5–10 points differently between runs due to LLM non-determinism even at low temperature. This is expected. Strong and weak candidates score consistently. Borderline candidates may shift between adjacent tiers.

Project Structure

cv_screening_pipeline/
|-- main.py                    Entry point and orchestrator
|-- config.py                  All configuration
|-- requirements.txt
|-- .env.example
|
|-- data/
|   |-- input/
|   |   |-- job_description.txt
|   |   |-- cvs/               Drop PDFs here
|   |-- processed/             Auto-generated during run
|   |-- output/                Auto-generated during run
|
|-- knowledge_base/
|   |-- institutions.json
|   |-- organisations.json
|   |-- capm_terms.json
|
|-- pipeline/
|   |-- step0a_jd_processor.py
|   |-- step0b_cv_parser.py
|   |-- step1_extractor.py
|   |-- step2_scorer.py
|   |-- step3_output.py
|
|-- rag/
|   |-- embedder.py
|   |-- indexer.py
|   |-- retriever.py
|
|-- prompts/
|   |-- jd_parse_prompt.txt
|   |-- extract_and_anonymise_prompt.txt
|   |-- score_all_criteria.txt
|
|-- dashboard/
|   |-- app.py
|   |-- templates/
|   |-- static/
|
|-- docs/
    |-- index.html             This documentation site

Roadmap

Short term

Role-based rubric switching — --role flag selects Junior, Senior, or Partner-track weight profiles
Scoring consistency check — run borderline candidates twice and flag when scores differ by more than 8 points
Proactive scorecard flags — highlight missing fields, weak performance on high-weight criteria, and narrow gate passes

Medium term

Hosted document parsing — replace local Docling with AWS Textract or Azure Document Intelligence for production deployments
REST API wrapper — accept job IDs asynchronously, expose a status endpoint, enable ATS integration
Persona matching score — evaluate whether the candidate's overall profile matches the target role persona, once calibration data is available

Long term

Multi-language CV support — extend to English-Tamil, English-Malayalam, and other South Indian mixed-language CVs
Bias audit module — statistical check for correlations between scores and protected characteristics after each hiring cycle
Calibration loop — compare rankings against actual hiring outcomes over time and adjust rubric weights accordingly