adamops.data package
Submodules
adamops.data.feature_engineering module
AdamOps Feature Engineering Module
Provides encoding, scaling, feature selection, and auto feature generation.
- adamops.data.feature_engineering.auto_feature_engineering(df: DataFrame, target: str | None = None, polynomial: bool = False, interactions: bool = False, datetime_cols: List[str] | None = None) DataFrame[source]
Automatic feature engineering pipeline.
- adamops.data.feature_engineering.encode(df: DataFrame, columns: List[str], method: str = 'onehot', **kwargs) DataFrame[source]
Encode categorical columns with specified method.
- adamops.data.feature_engineering.encode_label(df: DataFrame, columns: List[str]) Tuple[DataFrame, Dict][source]
Label encode categorical columns. Returns df and encoders dict.
- adamops.data.feature_engineering.encode_onehot(df: DataFrame, columns: List[str], drop_first: bool = False, handle_unknown: str = 'ignore') DataFrame[source]
One-hot encode categorical columns.
- adamops.data.feature_engineering.encode_ordinal(df: DataFrame, columns: List[str], categories: Dict[str, List] | None = None) DataFrame[source]
Ordinal encode columns with optional category order.
- adamops.data.feature_engineering.encode_target(df: DataFrame, columns: List[str], target: str, smoothing: float = 1.0) DataFrame[source]
Target encode categorical columns.
- adamops.data.feature_engineering.generate_datetime_features(df: DataFrame, column: str) DataFrame[source]
Extract datetime features from a column.
- adamops.data.feature_engineering.generate_interactions(df: DataFrame, columns: List[str], operations: List[str] = ['multiply']) DataFrame[source]
Generate interaction features between columns.
- adamops.data.feature_engineering.generate_polynomial(df: DataFrame, columns: List[str], degree: int = 2, include_bias: bool = False) DataFrame[source]
Generate polynomial features.
- adamops.data.feature_engineering.scale(df: DataFrame, method: str = 'standard', columns: List[str] | None = None) DataFrame[source]
Scale numeric columns with specified method.
- adamops.data.feature_engineering.scale_minmax(df: DataFrame, columns: List[str] | None = None) DataFrame[source]
Scale features to [0, 1] range.
- adamops.data.feature_engineering.scale_robust(df: DataFrame, columns: List[str] | None = None) DataFrame[source]
Scale with median and IQR (robust to outliers).
- adamops.data.feature_engineering.scale_standard(df: DataFrame, columns: List[str] | None = None) DataFrame[source]
Standardize features (zero mean, unit variance).
- adamops.data.feature_engineering.select_by_correlation(df: DataFrame, threshold: float = 0.9, target: str | None = None) DataFrame[source]
Remove highly correlated features.
- adamops.data.feature_engineering.select_by_importance(df: DataFrame, target: str, n_features: int = 10, task: str = 'classification') DataFrame[source]
Select features by tree-based importance.
adamops.data.loaders module
AdamOps Data Loaders Module
Provides comprehensive data loading capabilities from various sources: - CSV files with auto-encoding detection - Excel files (.xlsx, .xls) - JSON files - SQL databases (SQLite, PostgreSQL, MySQL) - API/URL endpoints - Compressed files (.zip, .gz)
- adamops.data.loaders.detect_encoding(filepath: str | Path, sample_size: int = 10000) str[source]
Detect the encoding of a file.
- Parameters:
filepath – Path to the file.
sample_size – Number of bytes to sample for detection.
- Returns:
Detected encoding (e.g., ‘utf-8’, ‘latin-1’).
- Return type:
Example
>>> encoding = detect_encoding("data.csv") >>> print(encoding) 'utf-8'
- adamops.data.loaders.get_excel_sheet_names(filepath: str | Path) List[str][source]
Get sheet names from an Excel file.
- Parameters:
filepath – Path to the Excel file.
- Returns:
List of sheet names.
- Return type:
List[str]
- adamops.data.loaders.load_api(url: str, method: str = 'GET', params: Dict | None = None, data: Dict | None = None, json_data: Dict | None = None, headers: Dict | None = None, auth: tuple | None = None, timeout: int = 30, data_key: str | None = None, paginate: bool = False, page_key: str = 'page', limit_key: str = 'limit', limit: int = 100, max_pages: int = 100) DataFrame[source]
Load data from a REST API with pagination support.
- Parameters:
url – API endpoint URL.
method – HTTP method.
params – Query parameters.
data – Form data.
json_data – JSON body data.
headers – HTTP headers.
auth – Authentication tuple.
timeout – Request timeout.
data_key – Key in response containing the data array.
paginate – Whether to paginate through results.
page_key – Parameter name for page number.
limit_key – Parameter name for page size.
limit – Number of items per page.
max_pages – Maximum number of pages to fetch.
- Returns:
Loaded data.
- Return type:
pd.DataFrame
Example
>>> df = load_api( ... "https://api.example.com/users", ... headers={"Authorization": "Bearer token"}, ... data_key="users", ... paginate=True ... )
- adamops.data.loaders.load_auto(source: str | Path, **kwargs) DataFrame[source]
Automatically detect and load data from various sources.
Supports CSV, Excel, JSON, SQL, and compressed files. Automatically detects the format based on file extension or URL.
- Parameters:
source – Path to file, URL, or SQL connection string.
**kwargs – Additional arguments passed to the appropriate loader.
- Returns:
Loaded data.
- Return type:
pd.DataFrame
Example
>>> df = load_auto("data.csv") >>> df = load_auto("https://example.com/data.json") >>> df = load_auto("data.xlsx")
- adamops.data.loaders.load_compressed(filepath: str | Path, format: str = 'csv', compression: str | None = None, **kwargs) DataFrame[source]
Load data from a compressed file (.zip, .gz, .bz2, .xz).
- Parameters:
filepath – Path to the compressed file.
format – Data format inside the archive (‘csv’, ‘json’, ‘excel’).
compression – Compression type. Auto-detected if None.
**kwargs – Additional arguments for the format loader.
- Returns:
Loaded data.
- Return type:
pd.DataFrame
Example
>>> df = load_compressed("data.csv.gz") >>> df = load_compressed("archive.zip", format="csv")
- adamops.data.loaders.load_csv(filepath: str | Path, encoding: str | None = None, auto_detect_encoding: bool = True, sep: str = ',', header: int | List[int] | str = 'infer', index_col: int | str | List | None = None, usecols: List | None = None, dtype: Dict | None = None, parse_dates: bool | List | None = None, na_values: List | None = None, nrows: int | None = None, skiprows: int | List | None = None, low_memory: bool = True, **kwargs) DataFrame[source]
Load data from a CSV file with auto-encoding detection.
- Parameters:
filepath – Path to the CSV file.
encoding – File encoding. If None and auto_detect_encoding is True, encoding will be detected automatically.
auto_detect_encoding – Whether to auto-detect encoding.
sep – Column separator.
header – Row number(s) to use as column names.
index_col – Column(s) to use as index.
usecols – Columns to load.
dtype – Data types for columns.
parse_dates – Columns to parse as dates.
na_values – Additional values to treat as NA.
nrows – Number of rows to read.
skiprows – Rows to skip.
low_memory – Use low memory mode.
**kwargs – Additional arguments passed to pd.read_csv.
- Returns:
Loaded data.
- Return type:
pd.DataFrame
Example
>>> df = load_csv("data.csv") >>> df = load_csv("data.csv", usecols=["id", "name", "value"]) >>> df = load_csv("data.csv", parse_dates=["date_column"])
- adamops.data.loaders.load_excel(filepath: str | Path, sheet_name: str | int | List | None = 0, header: int | List[int] | None = 0, index_col: int | str | List | None = None, usecols: str | List | None = None, dtype: Dict | None = None, parse_dates: bool | List | None = None, na_values: List | None = None, nrows: int | None = None, skiprows: int | List | None = None, **kwargs) DataFrame | Dict[str, DataFrame][source]
Load data from an Excel file (.xlsx, .xls).
- Parameters:
filepath – Path to the Excel file.
sheet_name – Sheet name or index, or list for multiple sheets. Use None to read all sheets.
header – Row number(s) to use as column names.
index_col – Column(s) to use as index.
usecols – Columns to load.
dtype – Data types for columns.
parse_dates – Columns to parse as dates.
na_values – Additional values to treat as NA.
nrows – Number of rows to read.
skiprows – Rows to skip.
**kwargs – Additional arguments passed to pd.read_excel.
- Returns:
Loaded data.
- Return type:
pd.DataFrame or Dict[str, pd.DataFrame]
Example
>>> df = load_excel("data.xlsx") >>> df = load_excel("data.xlsx", sheet_name="Sheet1") >>> sheets = load_excel("data.xlsx", sheet_name=None) # All sheets
- adamops.data.loaders.load_json(filepath: str | Path, orient: str | None = None, lines: bool = False, encoding: str = 'utf-8', **kwargs) DataFrame[source]
Load data from a JSON file.
- Parameters:
filepath – Path to the JSON file.
orient – JSON structure orientation. Options: ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’
lines – Read file as line-delimited JSON.
encoding – File encoding.
**kwargs – Additional arguments passed to pd.read_json.
- Returns:
Loaded data.
- Return type:
pd.DataFrame
Example
>>> df = load_json("data.json") >>> df = load_json("data.jsonl", lines=True)
- adamops.data.loaders.load_json_nested(filepath: str | Path, record_path: str | List[str] | None = None, meta: List[str] | None = None, max_level: int | None = None, encoding: str = 'utf-8') DataFrame[source]
Load nested JSON data and normalize it to a flat DataFrame.
- Parameters:
filepath – Path to the JSON file.
record_path – Path to the records in the JSON structure.
meta – Fields to include from higher level.
max_level – Maximum normalization depth.
encoding – File encoding.
- Returns:
Normalized data.
- Return type:
pd.DataFrame
Example
>>> # For JSON like: {"data": [{"id": 1, "info": {"name": "A"}}]} >>> df = load_json_nested("data.json", record_path="data")
- adamops.data.loaders.load_sql(query: str, connection_string: str, params: Dict | None = None, index_col: str | List[str] | None = None, parse_dates: List[str] | Dict | None = None, chunksize: int | None = None, **kwargs) DataFrame | SQLiteDatabase[source]
Load data from a SQL database.
Supports SQLite, PostgreSQL, MySQL, and other SQLAlchemy-compatible databases.
- Parameters:
query – SQL query to execute.
connection_string – Database connection string. Examples: - SQLite: “sqlite:///database.db” - PostgreSQL: “postgresql://user:pass@host:port/db” - MySQL: “mysql+pymysql://user:pass@host:port/db”
params – Query parameters.
index_col – Column(s) to use as index.
parse_dates – Columns to parse as dates.
chunksize – Number of rows per chunk (for large datasets).
**kwargs – Additional arguments passed to pd.read_sql.
- Returns:
Loaded data.
- Return type:
pd.DataFrame or Iterator
Example
>>> df = load_sql("SELECT * FROM users", "sqlite:///app.db") >>> df = load_sql( ... "SELECT * FROM orders WHERE date > :date", ... "postgresql://user:pass@localhost:5432/shop", ... params={"date": "2023-01-01"} ... )
- adamops.data.loaders.load_sql_table(table_name: str, connection_string: str, schema: str | None = None, columns: List[str] | None = None, index_col: str | List[str] | None = None, chunksize: int | None = None, **kwargs) DataFrame[source]
Load an entire table from a SQL database.
- Parameters:
table_name – Name of the table to load.
connection_string – Database connection string.
schema – Database schema.
columns – Columns to load (None for all).
index_col – Column(s) to use as index.
chunksize – Number of rows per chunk.
**kwargs – Additional arguments.
- Returns:
Loaded data.
- Return type:
pd.DataFrame
- adamops.data.loaders.load_url(url: str, format: str = 'csv', params: Dict | None = None, headers: Dict | None = None, auth: tuple | None = None, timeout: int = 30, **kwargs) DataFrame[source]
Load data from a URL.
- Parameters:
url – URL to load data from.
format – Data format (‘csv’, ‘json’, ‘excel’).
params – Query parameters.
headers – HTTP headers.
auth – Authentication tuple (username, password).
timeout – Request timeout in seconds.
**kwargs – Additional arguments for the format loader.
- Returns:
Loaded data.
- Return type:
pd.DataFrame
Example
>>> df = load_url("https://example.com/data.csv") >>> df = load_url( ... "https://api.example.com/data", ... format="json", ... headers={"Authorization": "Bearer token"} ... )
- adamops.data.loaders.save_csv(df: DataFrame, filepath: str | Path, index: bool = False, encoding: str = 'utf-8', **kwargs) None[source]
Save DataFrame to CSV file.
- Parameters:
df – DataFrame to save.
filepath – Output file path.
index – Whether to include index.
encoding – File encoding.
**kwargs – Additional arguments passed to df.to_csv.
- adamops.data.loaders.save_excel(df: DataFrame, filepath: str | Path, sheet_name: str = 'Sheet1', index: bool = False, **kwargs) None[source]
Save DataFrame to Excel file.
- Parameters:
df – DataFrame to save.
filepath – Output file path.
sheet_name – Name of the sheet.
index – Whether to include index.
**kwargs – Additional arguments.
- adamops.data.loaders.save_json(df: DataFrame, filepath: str | Path, orient: str = 'records', indent: int = 2, **kwargs) None[source]
Save DataFrame to JSON file.
- Parameters:
df – DataFrame to save.
filepath – Output file path.
orient – JSON structure orientation.
indent – Indentation level.
**kwargs – Additional arguments.
adamops.data.preprocessors module
AdamOps Data Preprocessors Module
Provides data cleaning capabilities: missing values, outliers, duplicates, type conversion.
- adamops.data.preprocessors.clean_text(df: DataFrame, columns: List[str] | None = None, lowercase: bool = True, strip: bool = True, remove_special: bool = False) DataFrame[source]
Clean text columns.
- adamops.data.preprocessors.convert_types(df: DataFrame, type_mapping: Dict[str, str] | None = None, auto_convert: bool = True, datetime_columns: List[str] | None = None) DataFrame[source]
Convert column types.
- Parameters:
df – DataFrame to process.
type_mapping – {column: target_type}
auto_convert – Auto-detect and convert types.
datetime_columns – Columns to parse as datetime.
- adamops.data.preprocessors.handle_duplicates(df: DataFrame, subset: List[str] | None = None, keep: str = 'first') DataFrame[source]
Remove duplicate rows.
- adamops.data.preprocessors.handle_missing(df: DataFrame, strategy: str = 'mean', columns: List[str] | None = None, fill_value: any | None = None, n_neighbors: int = 5) DataFrame[source]
Handle missing values.
- Parameters:
df – DataFrame to process.
strategy – ‘drop’, ‘mean’, ‘median’, ‘mode’, ‘constant’, ‘ffill’, ‘bfill’, ‘knn’, ‘iterative’
columns – Columns to process (None for all).
fill_value – Value for ‘constant’ strategy.
n_neighbors – Neighbors for KNN.
- Returns:
Processed DataFrame.
- adamops.data.preprocessors.handle_outliers(df: DataFrame, method: str = 'iqr', columns: List[str] | None = None, threshold: float = 1.5, action: str = 'clip', contamination: float = 0.1) DataFrame[source]
Handle outliers.
- Parameters:
df – DataFrame to process.
method – ‘iqr’, ‘zscore’, ‘isolation_forest’
columns – Columns to process (None for numeric).
threshold – IQR multiplier or Z-score threshold.
action – ‘clip’, ‘drop’, ‘nan’
contamination – For isolation forest.
- Returns:
Processed DataFrame.
adamops.data.splitters module
AdamOps Data Splitters Module
Provides data splitting: train/test, train/val/test, time-series, K-Fold, stratified.
- adamops.data.splitters.create_cv_splits(X: DataFrame | ndarray, y: Series | ndarray | None = None, method: str = 'kfold', n_splits: int = 5, **kwargs) List[Tuple][source]
Create cross-validation splits.
- Parameters:
X – Features.
y – Target.
method – ‘kfold’, ‘stratified’, ‘timeseries’, ‘group’
n_splits – Number of folds.
- Returns:
List of (train_idx, test_idx) tuples.
- adamops.data.splitters.get_fold_data(X: DataFrame | ndarray, y: Series | ndarray | None, train_idx: ndarray, test_idx: ndarray) Tuple[source]
Get train/test data for a fold.
- adamops.data.splitters.split_group_kfold(X: DataFrame | ndarray, y: Series | ndarray, groups: Series | ndarray, n_splits: int = 5) Iterator[Tuple][source]
Group K-Fold split. Ensures groups are not split across train/test.
- Yields:
(train_idx, test_idx) tuples.
- adamops.data.splitters.split_kfold(X: DataFrame | ndarray, y: Series | ndarray | None = None, n_splits: int = 5, shuffle: bool = True, random_state: int = 42) Iterator[Tuple][source]
K-Fold cross-validation split.
- Yields:
(train_idx, test_idx) tuples.
- adamops.data.splitters.split_stratified_kfold(X: DataFrame | ndarray, y: Series | ndarray, n_splits: int = 5, shuffle: bool = True, random_state: int = 42) Iterator[Tuple][source]
Stratified K-Fold cross-validation split.
Preserves class distribution in each fold.
- Yields:
(train_idx, test_idx) tuples.
- adamops.data.splitters.split_timeseries(X: DataFrame | ndarray, y: Series | ndarray | None = None, n_splits: int = 5, test_size: int | None = None, gap: int = 0) Iterator[Tuple][source]
Time series split for temporal data.
- Parameters:
X – Features.
y – Target.
n_splits – Number of splits.
test_size – Test set size per split.
gap – Gap between train and test.
- Yields:
(train_idx, test_idx) tuples.
- adamops.data.splitters.split_train_test(X: DataFrame | ndarray, y: Series | ndarray | None = None, test_size: float = 0.2, random_state: int = 42, stratify: bool = False, shuffle: bool = True) Tuple[source]
Split data into train and test sets.
- Parameters:
X – Features.
y – Target (optional).
test_size – Test set proportion.
random_state – Random seed.
stratify – Stratify by target.
shuffle – Shuffle before splitting.
- Returns:
(X_train, X_test) or (X_train, X_test, y_train, y_test)
- adamops.data.splitters.split_train_val_test(X: DataFrame | ndarray, y: Series | ndarray | None = None, train_size: float = 0.7, val_size: float = 0.15, test_size: float = 0.15, random_state: int = 42, stratify: bool = False) Tuple[source]
Split data into train, validation, and test sets.
- Returns:
(X_train, X_val, X_test) or (X_train, X_val, X_test, y_train, y_val, y_test)
adamops.data.validators module
AdamOps Data Validators Module
Provides data validation: type validation, missing value checks, duplicate detection, shape validation, and statistical checks.
- class adamops.data.validators.ColumnStats(name: str, dtype: str, count: int, missing_count: int, missing_pct: float, unique_count: int, unique_pct: float, mean: float | None = None, std: float | None = None, min: float | None = None, max: float | None = None)[source]
Bases:
objectStatistics for a column.
- class adamops.data.validators.DataValidator(missing_threshold: float = 0.5, unique_threshold: float = 0.95)[source]
Bases:
objectData validator for DataFrames.
- class adamops.data.validators.ValidationIssue(severity: str, category: str, column: str | None, message: str, details: Dict | None = None)[source]
Bases:
objectRepresents a validation issue.
- class adamops.data.validators.ValidationReport(timestamp: str, shape: ~typing.Tuple[int, int], memory_usage: float, issues: ~typing.List[~adamops.data.validators.ValidationIssue] = <factory>, column_stats: ~typing.Dict[str, ~adamops.data.validators.ColumnStats] = <factory>, duplicate_rows: int = 0, passed: bool = True)[source]
Bases:
objectComplete validation report.
- column_stats: Dict[str, ColumnStats]
- issues: List[ValidationIssue]
- adamops.data.validators.check_duplicates(df: DataFrame, subset: List[str] | None = None) DataFrame[source]
Get duplicate rows.
- adamops.data.validators.validate(df: DataFrame, **kwargs) ValidationReport[source]
Validate a DataFrame.
Module contents
AdamOps Data Module
Provides comprehensive data handling capabilities: - loaders: Load data from various sources (CSV, Excel, JSON, SQL, API, compressed files) - validators: Validate data types, missing values, duplicates, shapes, and statistics - preprocessors: Clean data (handle missing values, outliers, duplicates, type conversion) - feature_engineering: Encode, scale, and generate features - splitters: Split data for training and evaluation