split_dataset
Split a NIfTI dataset folder into train, validation, and test subsets with reproducible randomization.
split_dataset(
nii_folder: str,
output_path: str,
ratios: Tuple[float, float, float] = (0.7, 0.15, 0.15),
seed: int = 42,
copy_files: bool = False,
debug: bool = False
) -> Dict[str, List[str]]
Overview
This function randomly assigns NIfTI files from a dataset folder into train, validation, and test subsets. It creates a JSON manifest (split.json) listing the files in each subset, and optionally copies files into separate subdirectories.
This is useful for:
- Preparing datasets for machine learning experiments
- Ensuring reproducible train/val/test splits
- Maintaining consistent splits across team members
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
nii_folder | str | required | Folder containing .nii.gz files to split. |
output_path | str | required | Directory where split artifacts are saved. |
ratios | Tuple[float, ...] | (0.7, 0.15, 0.15) | Train, val, test fractions. Must sum to 1.0. |
seed | int | 42 | Random seed for reproducibility. |
copy_files | bool | False | If True, copies files into train/, val/, test/ subdirectories. |
debug | bool | False | If True, logs split counts. |
Returns
Dict[str, List[str]] – Dictionary with keys "train", "val", "test" mapping to lists of filenames.
Output Files
JSON Manifest
Always creates split.json:
{
"train": ["case_001.nii.gz", "case_003.nii.gz", ...],
"val": ["case_007.nii.gz", ...],
"test": ["case_012.nii.gz", ...]
}
File Copies (Optional)
When copy_files=True, creates:
output_path/
├── split.json
├── train/
│ ├── case_001.nii.gz
│ └── ...
├── val/
│ ├── case_007.nii.gz
│ └── ...
└── test/
├── case_012.nii.gz
└── ...
Exceptions
| Exception | Condition |
|---|---|
FileNotFoundError | Folder does not exist or contains no .nii.gz files |
ValueError | Ratios do not sum to approximately 1.0 |
ValueError | Ratios tuple does not contain exactly 3 values |
Usage Notes
- Reproducibility: Using the same
seedalways produces the same split - Manifest Only: By default, only the JSON manifest is created (no file copying)
- Rounding: Due to integer rounding, the test set may receive slightly more or fewer files
Examples
Basic Usage
Create a 70/15/15 split:
from nidataset.analysis import split_dataset
splits = split_dataset(
nii_folder="dataset/scans/",
output_path="dataset/splits/"
)
print(f"Train: {len(splits['train'])} files")
print(f"Val: {len(splits['val'])} files")
print(f"Test: {len(splits['test'])} files")
Copy Files into Subdirectories
Physically organize files by split:
splits = split_dataset(
nii_folder="dataset/scans/",
output_path="dataset/organized/",
ratios=(0.8, 0.1, 0.1),
copy_files=True,
seed=123
)
Custom Ratios
Use an 80/10/10 split:
splits = split_dataset(
nii_folder="data/all_scans/",
output_path="data/",
ratios=(0.8, 0.1, 0.1),
seed=42,
debug=True
)
Load Existing Split
Reuse a previously created split:
import json
with open("dataset/splits/split.json") as f:
splits = json.load(f)
print(f"Training files: {len(splits['train'])}")
for fname in splits['train'][:5]:
print(f" {fname}")
Typical Workflow
from nidataset.analysis import split_dataset
import json
# 1. Create reproducible split
splits = split_dataset(
nii_folder="data/preprocessed/",
output_path="data/experiment_01/",
ratios=(0.7, 0.15, 0.15),
seed=42,
copy_files=True,
debug=True
)
# 2. Verify distribution
total = sum(len(v) for v in splits.values())
for subset, files in splits.items():
print(f"{subset}: {len(files)} files ({len(files)/total:.0%})")
# 3. Share the manifest with collaborators
# The split.json file ensures everyone uses the same split