Overview
The CV Screening Pipeline is an end-to-end AI-powered recruitment tool built for a Capital Markets law firm. It takes a Job Description and a folder of candidate CVs as input and produces a tiered ranked shortlist with per-candidate scorecards — with no human intervention at any stage.
The scoring rubric is derived from the JD automatically. Criterion weights, domain map, and keyword reference list are all extracted from the JD itself — not hardcoded. Swap the JD and the rubric adapts.
Every score is backed by a sentence quoted directly from the CV. The model is structurally forced to find evidence before it is allowed to assign a number. This makes every decision explainable and auditable.
Key Features
Tech Stack
| Layer | Technology | Why this was chosen |
|---|---|---|
| Language | Python 3.11+ | Dominant ecosystem for AI pipelines — all required libraries are Python-native |
| LLM | Gemini 2.5 Flash | Free tier available, 1M token context window, strong instruction following |
| PDF parsing | Docling (IBM) | Layout-aware — understands multi-column CVs that standard parsers scramble |
| Vector DB | ChromaDB | Local, no server required, persists to disk, right-sized for this scale |
| Embeddings | all-MiniLM-L6-v2 | Free, runs locally, sufficient for section-level semantic retrieval |
| Orchestration | Plain Python | No framework — every step is visible and debuggable |
| Dashboard | Flask | Lightweight Python web server, no frontend framework needed |
Pipeline Stages
The pipeline runs sequentially. Each stage reads from the previous stage's output and writes to a known file location. main.py orchestrates the sequence.
jd_rubric.json. Called once per run — never repeated per CV.Scoring Rubric
The rubric has three passes. The weights shown are defaults — actual weights for each run are derived from the JD by the LLM.
Pass 1 — Hard Gate
Pass or fail. No partial credit. Candidates who fail either gate exit the pipeline. No scoring API calls are made for failed candidates.
| Gate | Requirement | Fail condition |
|---|---|---|
| Degree type | LLB, BA LLB, or LLB+LLM from a recognised Indian institution | No qualifying law degree found in CV |
| Degree validity | Currently pursuing or completed a qualifying degree | No law education found at all |
Pass 2 — Scored out of 100
Pass 3 — Bonus Points (capped at +15)
| Criterion | Max | Notes |
|---|---|---|
| Moot court | +5 | Corporate or CapM themed competitions score higher |
| Publications | +5 | Recognised law journal, CapM or securities topic preferred |
| International exposure | +4 | Foreign internship or exchange semester |
| Leadership roles | +3 | Secretary, president of society etc. |
| Languages | +3 | Working proficiency beyond English, +1 per language |
Tier Assignment
Final score = Pass 2 (max 100) + Pass 3 (max +15) = 115-point scale.
RAG and Knowledge Bases
RAG (Retrieval Augmented Generation) grounds the scoring model in verified facts. Instead of relying on the model's training memory, the pipeline retrieves specific entries from local knowledge bases and injects them into the scoring prompt.
Three knowledge bases are stored in knowledge_base/ and indexed into ChromaDB at the start of each run.
| File | Contents | Used for |
|---|---|---|
institutions.json | All major Indian law schools with tier, NLU status, city, placement notes | Criterion 2.1 — institution prestige scoring |
organisations.json | Law firms (Tier 1/2/3), regulators, exchanges, investment banks with CapM relevance tags | Criterion 2.2 — internship organisation quality |
capm_terms.json | 26 Capital Markets terms with definitions, correct usage examples, and shallow usage examples | Criterion 2.3 — technical knowledge depth scoring |
API Call Strategy
The pipeline makes the minimum number of API calls consistent with quality and reliability.
| Stage | Calls | Reasoning |
|---|---|---|
| JD parsing | 1 total | JD does not change between candidates |
| Extraction | 1 per 5 CVs | Batching is safe — extraction is mechanical, no quality loss |
| Scoring | 1 per candidate | Never batched — isolation prevents model drift and comparison bias |
| Total for 100 CVs | ~121 calls | Down from 700+ with a naive single-call-per-step approach |
Installation
Prerequisites
- Python 3.11 or higher
- A Google AI Studio API key — aistudio.google.com
- Gemini 2.5 Flash enabled on your project — verify at ai.dev/rate-limit
- 4GB free disk space (Docling downloads ML models on first run)
Steps
# 1. Clone the repository
git clone https://github.com/yourusername/cv-screening-pipeline.git
cd cv-screening-pipeline
# 2. Create virtual environment
python -m venv venv
# 3. Activate it
source venv/bin/activate # Mac / Linux
venv\Scripts\activate # Windows
# 4. Install dependencies
pip install -r requirements.txt
# 5. Set up environment file
cp .env.example .env
# Open .env and add your GEMINI_API_KEY
# 6. Verify setup
python -c "import config; print('Setup successful')"
Configuration
Environment variables (.env)
GEMINI_API_KEY=your_gemini_api_key_here
This is the only secret the project requires. Never commit this file. The .env.example template is safe to commit.
Application settings (config.py)
| Setting | Default | When to change |
|---|---|---|
GEMINI_MODEL | gemini-2.5-flash | If the model is unavailable on your API key — check your rate limits page |
EXTRACTION_BATCH_SIZE | 5 | Reduce to 3 for long CVs, increase to 10 for short CVs |
DASHBOARD_PORT | 5000 | If port 5000 is already in use on your machine |
RAG_TOP_K | 3 | Number of knowledge base results to retrieve per lookup |
Usage
Running the full pipeline
# Add your JD
# Replace data/input/job_description.txt with your actual JD
# Add candidate CVs
# Place PDF files in data/input/cvs/
# Run
python main.py
# Browser opens automatically at http://127.0.0.1:5000
Restarting the dashboard without re-running
python dashboard/app.py
Running a single stage
python -m pipeline.step0a_jd_processor # Re-parse JD only
python -m pipeline.step0b_cv_parser # Re-parse CVs only
python -m pipeline.step1_extractor # Re-extract profiles
python -m pipeline.step2_scorer # Re-score all candidates
python -m pipeline.step3_output # Re-generate output files
Output Files
| File | Format | Purpose |
|---|---|---|
data/output/ranked_list.json | JSON | Full tiered ranking — feeds the dashboard |
data/output/report.md | Markdown | Human-readable report — suitable for emailing to a hiring partner |
data/output/scored/C-XXX_scorecard.json | JSON | Full evidence trail per candidate — all scores with quoted sentences |
data/output/decision_receipts/C-XXX_receipt.json | JSON | Lightweight audit record — minimum data to defend any decision |
Troubleshooting
Model not found (404)
Check ai.dev/rate-limit and use whichever model shows a non-zero RPM. Update GEMINI_MODEL in config.py.
Quota exhausted (429)
Free tier daily limit reached. Wait 24 hours. Alternatively, enable billing on your Google Cloud project for higher quotas.
Pipeline appears frozen after Docling shows 100%
Not frozen. Docling is loading 560MB of neural network weights into CPU RAM silently after downloading them. This takes 5–15 minutes on first run with no progress indicator. Do not press Ctrl+C. Wait. Every run after this takes under 30 seconds because the models are cached.
ChromaDB telemetry messages
Harmless. To suppress them, add these two lines at the very top of main.py before all other imports:
import os
os.environ["ANONYMIZED_TELEMETRY"] = "False"
os.environ["CHROMA_TELEMETRY"] = "False"
Port already in use
Change DASHBOARD_PORT in config.py to any free port such as 5001.
Scoring inconsistency between runs
The same candidate may score 5–10 points differently between runs due to LLM non-determinism even at low temperature. This is expected. Strong and weak candidates score consistently. Borderline candidates may shift between adjacent tiers.
Project Structure
cv_screening_pipeline/
|-- main.py Entry point and orchestrator
|-- config.py All configuration
|-- requirements.txt
|-- .env.example
|
|-- data/
| |-- input/
| | |-- job_description.txt
| | |-- cvs/ Drop PDFs here
| |-- processed/ Auto-generated during run
| |-- output/ Auto-generated during run
|
|-- knowledge_base/
| |-- institutions.json
| |-- organisations.json
| |-- capm_terms.json
|
|-- pipeline/
| |-- step0a_jd_processor.py
| |-- step0b_cv_parser.py
| |-- step1_extractor.py
| |-- step2_scorer.py
| |-- step3_output.py
|
|-- rag/
| |-- embedder.py
| |-- indexer.py
| |-- retriever.py
|
|-- prompts/
| |-- jd_parse_prompt.txt
| |-- extract_and_anonymise_prompt.txt
| |-- score_all_criteria.txt
|
|-- dashboard/
| |-- app.py
| |-- templates/
| |-- static/
|
|-- docs/
|-- index.html This documentation site
Roadmap
Short term
- Role-based rubric switching —
--roleflag selects Junior, Senior, or Partner-track weight profiles - Scoring consistency check — run borderline candidates twice and flag when scores differ by more than 8 points
- Proactive scorecard flags — highlight missing fields, weak performance on high-weight criteria, and narrow gate passes
Medium term
- Hosted document parsing — replace local Docling with AWS Textract or Azure Document Intelligence for production deployments
- REST API wrapper — accept job IDs asynchronously, expose a status endpoint, enable ATS integration
- Persona matching score — evaluate whether the candidate's overall profile matches the target role persona, once calibration data is available
Long term
- Multi-language CV support — extend to English-Tamil, English-Malayalam, and other South Indian mixed-language CVs
- Bias audit module — statistical check for correlations between scores and protected characteristics after each hiring cycle
- Calibration loop — compare rankings against actual hiring outcomes over time and adjust rubric weights accordingly