Data Pipeline¶
The library uses a single, high-performance LMDB database (msdrg.mdb) to store all reference data. This provides zero-copy binary access and sub-microsecond lookups while keeping the memory footprint minimal.
Regenerate from raw CMS data¶
This script extracts raw data from CMS CSV files, imports it into a temporary SQLite database for normalization, compiles it into optimized binary blobs, and finally packages everything into the monolithic msdrg.mdb file.
Individual steps¶
# 1. Extract and normalize raw data
python scripts/extract_data.py
# 2. Import to SQLite for processing
python scripts/import_to_sqlite.py
# 3. Compile optimized binary blobs
for s in scripts/compile*; do bash "$s"; done
# 4. Consolidate into LMDB database
python scripts/package_lmdb.py
Database Contents¶
The msdrg.mdb file contains various data structures used by the grouper and MCE:
| Category | Description |
|---|---|
| Core | Diagnosis definitions, DRG formula rules, MDC mappings |
| Grouping | Diagnosis clusters, exclusion groups, gender/MDC rules |
| MCE | ICD-10 DX/SG master tables, age ranges, discharge status |
| Conversion | ICD-10-CM/PCS version-to-version conversion tables |
Tip
The database is opened in read-only mode with no locking (MDB_NOLOCK), making it extremely fast for concurrent analytical workloads across multiple threads.