Tasks
Tasks in ExtremeXP are the fundamental building blocks of workflows. Each task represents a single computational unit with defined inputs, outputs, parameters, and execution environment specifications. Tasks can be configured with various options including implementation details, dependencies, virtual environments, and parameter definitions.
The task DSL provides a structured way to:
- Define input and output data ports
- Configure task parameters with types, defaults, and ranges
- Specify implementation files and dependencies
- Set up virtual environments and Python versions
- Define task types for categorization
- Configure execution requirements
Basic Task Structure
task read_data {
define input data FileToRead;
define output data dataset;
implementation "UC2/read_data/read_data.py";
dependency "UC2_src/**";
venv "UC2/read_data/requirements.txt";
python_version "3.9";
}
Advanced Task Features
Task with Multiple Inputs and Outputs
task split_dataset {
define input data dataset;
define output data train_data;
define output data test_data;
implementation "UC2/split_dataset/split_dataset.py";
dependency "UC2_src/**";
venv "UC2/split_dataset/requirements.txt";
python_version "3.9";
}
Task with Parameters
task train_model {
define input data train_data;
define input data test_data;
define output data model;
define output data train_data;
define output data test_data;
define param max_depth {
type Integer;
default 3;
range (3, 30);
}
define param n_estimators {
type Integer;
default 5;
range (5, 50);
}
define param min_child_weight {
type Integer;
default 1;
range (1, 10);
}
define param gamma {
type Integer;
default 0;
range (1, 5);
}
implementation "UC2/train_model/train_model.py";
dependency "UC2_src/**";
venv "UC2/train_model/requirements.txt";
python_version "3.9";
}
Task with Type Classification
task Explainability {
type explainability;
define input data train_data;
define input data test_data;
define input data trained_model;
define output data mlAnalysis;
implementation "UC2/Explainability/task.py";
dependency "UC2_src/**";
venv "UC2/Explainability/requirements.txt";
python_version "3.9";
}
Minimal Task Structure
task benchmarking {
define input data ExternalDataFile;
implementation "I2CAT/benchmarking/benchmarking.py";
dependency "I2CAT/**";
}
Task Structure Breakdown
1. Task Declaration
task read_data {
read_data
. All task components are enclosed within the curly braces.
2. Task Type (Optional)
type explainability;
3. Data Port Definitions
Input Data Ports
define input data FileToRead;
define input data train_data;
define input data test_data;
FileToRead
- a single input filetrain_data
- training datasettest_data
- testing dataset
Output Data Ports
define output data dataset;
define output data train_data;
define output data test_data;
define output data model;
dataset
- processed datasettrain_data
- training data subsettest_data
- testing data subsetmodel
- trained machine learning model
4. Parameter Definitions
define param max_depth {
type Integer;
default 3;
range (3, 30);
}
define param n_estimators {
type Integer;
default 5;
range (5, 50);
}
type
- parameter data type (Integer, Float, String, Boolean)default
- default value if not specifiedrange
- valid range for parameter values (used in parameter exploration)
5. Implementation Configuration
implementation "UC2/read_data/read_data.py";
6. Dependencies
dependency "UC2_src/**";
dependency "I2CAT/**";
UC2_src/**
- all files in the UC2_src directory and subdirectoriesI2CAT/**
- all files in the I2CAT directory and subdirectories
Usage Example
- Glob patterns (
demoHelper/**
,helpers/*.py
) - Specific files (
config/settings.json
)
Task Dependencies
The dependency
keyword allows tasks to specify external files or resources they need during execution. Dependencies use glob patterns to match files and directories.
7. Virtual Environment Configuration
venv "UC2/read_data/requirements.txt";
python_version "3.9";
venv
- path to requirements.txt file for virtual environment setuppython_version
- specific Python version to use
Warning
If you want to use venv
in your task DSL file then the python_version
field is mandatory.
python_version
should also be specified as an option in the configuration file
Task Examples by Use Case
Data Processing Task
task read_data {
define input data FileToRead;
define output data dataset;
implementation "UC2/read_data/read_data.py";
dependency "UC2_src/**";
venv "UC2/read_data/requirements.txt";
python_version "3.9";
}
Data Transformation Task
task split_dataset {
define input data dataset;
define output data train_data;
define output data test_data;
implementation "UC2/split_dataset/split_dataset.py";
dependency "UC2_src/**";
venv "UC2/split_dataset/requirements.txt";
python_version "3.9";
}
Machine Learning Task
task train_model {
define input data train_data;
define input data test_data;
define output data model;
define output data train_data;
define output data test_data;
define param max_depth {
type Integer;
default 3;
range (3, 30);
}
define param n_estimators {
type Integer;
default 5;
range (5, 50);
}
implementation "UC2/train_model/train_model.py";
dependency "UC2_src/**";
venv "UC2/train_model/requirements.txt";
python_version "3.9";
}
Analysis Task
task Explainability {
type explainability;
define input data train_data;
define input data test_data;
define input data trained_model;
define output data mlAnalysis;
implementation "UC2/Explainability/task.py";
dependency "UC2_src/**";
venv "UC2/Explainability/requirements.txt";
python_version "3.9";
}
Benchmarking Task
task benchmarking {
define input data ExternalDataFile;
implementation "I2CAT/benchmarking/benchmarking.py";
dependency "I2CAT/**";
}
Parameter Types and Configuration
Integer Parameters
define param max_depth {
type Integer;
default 3;
range (3, 30);
}
Parameter Configuration Options
- type: Specifies the parameter data type
Integer
- whole numbersFloat
- decimal numbersString
- text valuesBoolean
- true/false values- default: Default value used when parameter is not specified
- range: Valid range for parameter values, used in parameter exploration experiments
Key Concepts
- Data Ports: Input and output data connections that enable data flow between tasks
- Parameters: Configurable values that modify task behavior and can be explored in experiments
- Implementation: Python file containing the actual task execution logic
- Dependencies: External files and directories required for task execution
- Virtual Environment: Isolated Python environment with specific package requirements
- Task Types: Optional categorization for organizing tasks by functionality
Task Parameters in Experiments
Task parameters defined with range
specifications can be automatically explored in experiments using different parameter exploration strategies like grid search or random search.
Virtual Environment Best Practices
Each task can have its own virtual environment with specific package requirements, ensuring isolation and reproducibility across different computational environments.
Dependency Management
Use glob patterns for dependencies to automatically include all necessary files and subdirectories, ensuring tasks have access to all required resources.