Documentation
For Scientists
Use the application to quickly surface rare Calabi-Yau candidates while keeping a reproducible record.
- Run the demo or CLI to rank candidates by predicted likelihood
- Verify top-k hits and track precision/recall for experimental comparison
- Export CSV/JSON artifacts for downstream analysis and sharing
- Record random seeds and dataset metadata for reproducibility
Installation Prerequisites
- Python 3.9 or higher
- pip package manager
- 4GB RAM minimum (8GB recommended)
- 2GB free disk space for datasets and outputs
Quick Install
git clone https://github.com/upggr/compute.upg.gr.git
cd compute.upg.gr
pip install -r requirements.txt
How to Run
Basic Demo
python run_cy_search.py --config default.yml --verify --export-artifacts
Expected Runtime
5-15 minutes on standard hardware, depending on dataset size and CPU performance.
Output Location
All artifacts are saved to ./output/ directory:
output/results_topk.csv- Top-k ranked resultsoutput/metrics.json- Performance metricsoutput/repro.md- Reproducibility report
Demo Exports
The Interactive Demo provides download buttons for JSON and CSV artifacts after each run.
Bring Your Own Data
Paste custom candidates as CSV rows in the demo. Each row must include all feature columns for the selected dataset in order:
- Kreuzer-Skarke: h11, h21, euler_abs, hodge_ratio, c2_h11
- CY5-Folds: h11, h21, h31, euler, euler_abs, hodge_sum
- Heterotic: h11, h21, euler, euler_abs, hodge_ratio, hodge_balance, n_gen
Command-Line Options
--config CONFIG_FILE Path to configuration YAML (default: default.yml)
--verify Verify top results against ground truth
--export-artifacts Generate CSV/JSON output files
--top-k K Number of top results to export (default: 100)
--seed SEED Random seed for reproducibility (default: 42)
Reproducibility Guarantees
Fixed Random Seeds
All stochastic operations (model training, data shuffling) use deterministic seeds specified in the configuration file. Default seed is 42.
Pinned Dependencies
The requirements.txt file pins exact versions of all Python packages to ensure identical runtime environments.
pip freeze > requirements.lock # Generate locked dependencies
Dataset Checksum Verification
Dataset downloads are verified using SHA-256 checksums before processing begins. If the checksum fails, the pipeline halts with an error.
# Expected checksum stored in config
dataset_checksum: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
Run Metadata
Every run generates a repro.md file containing:
- Timestamp and hostname
- Git commit hash
- Python version and environment info
- Configuration parameters used
- Dataset checksum verified
- Random seeds employed
Configuration
Edit default.yml to customize pipeline behavior:
dataset:
url: "https://example.com/cy_dataset.csv"
checksum: "sha256:e3b0c44..."
model:
type: "random_forest"
n_estimators: 100
max_depth: 10
search:
top_k: 100
verification: true
reproducibility:
seed: 42
export_metadata: true
Troubleshooting
Checksum Verification Failed
If the dataset checksum doesn't match, the file may be corrupted. Delete cached data and re-download:
rm -rf ./cache/
python run_cy_search.py --config default.yml
Out of Memory
For large datasets, reduce batch size in config or use a machine with more RAM.
Module Not Found
Ensure all dependencies are installed:
pip install -r requirements.txt