adamops.data package

Submodules

adamops.data.feature_engineering module

AdamOps Feature Engineering Module

Provides encoding, scaling, feature selection, and auto feature generation.

adamops.data.feature_engineering.auto_feature_engineering(df: DataFrame, target: str | None = None, polynomial: bool = False, interactions: bool = False, datetime_cols: List[str] | None = None) DataFrame[source]

Automatic feature engineering pipeline.

adamops.data.feature_engineering.encode(df: DataFrame, columns: List[str], method: str = 'onehot', **kwargs) DataFrame[source]

Encode categorical columns with specified method.

adamops.data.feature_engineering.encode_label(df: DataFrame, columns: List[str]) Tuple[DataFrame, Dict][source]

Label encode categorical columns. Returns df and encoders dict.

adamops.data.feature_engineering.encode_onehot(df: DataFrame, columns: List[str], drop_first: bool = False, handle_unknown: str = 'ignore') DataFrame[source]

One-hot encode categorical columns.

adamops.data.feature_engineering.encode_ordinal(df: DataFrame, columns: List[str], categories: Dict[str, List] | None = None) DataFrame[source]

Ordinal encode columns with optional category order.

adamops.data.feature_engineering.encode_target(df: DataFrame, columns: List[str], target: str, smoothing: float = 1.0) DataFrame[source]

Target encode categorical columns.

adamops.data.feature_engineering.generate_datetime_features(df: DataFrame, column: str) DataFrame[source]

Extract datetime features from a column.

adamops.data.feature_engineering.generate_interactions(df: DataFrame, columns: List[str], operations: List[str] = ['multiply']) DataFrame[source]

Generate interaction features between columns.

adamops.data.feature_engineering.generate_polynomial(df: DataFrame, columns: List[str], degree: int = 2, include_bias: bool = False) DataFrame[source]

Generate polynomial features.

adamops.data.feature_engineering.scale(df: DataFrame, method: str = 'standard', columns: List[str] | None = None) DataFrame[source]

Scale numeric columns with specified method.

adamops.data.feature_engineering.scale_minmax(df: DataFrame, columns: List[str] | None = None) DataFrame[source]

Scale features to [0, 1] range.

adamops.data.feature_engineering.scale_robust(df: DataFrame, columns: List[str] | None = None) DataFrame[source]

Scale with median and IQR (robust to outliers).

adamops.data.feature_engineering.scale_standard(df: DataFrame, columns: List[str] | None = None) DataFrame[source]

Standardize features (zero mean, unit variance).

adamops.data.feature_engineering.select_by_correlation(df: DataFrame, threshold: float = 0.9, target: str | None = None) DataFrame[source]

Remove highly correlated features.

adamops.data.feature_engineering.select_by_importance(df: DataFrame, target: str, n_features: int = 10, task: str = 'classification') DataFrame[source]

Select features by tree-based importance.

adamops.data.feature_engineering.select_by_variance(df: DataFrame, threshold: float = 0.0, columns: List[str] | None = None) DataFrame[source]

Remove low variance features.

adamops.data.feature_engineering.select_features(df: DataFrame, target: str, method: str = 'importance', n_features: int = 10, **kwargs) DataFrame[source]

Select features using specified method.

adamops.data.loaders module

AdamOps Data Loaders Module

Provides comprehensive data loading capabilities from various sources: - CSV files with auto-encoding detection - Excel files (.xlsx, .xls) - JSON files - SQL databases (SQLite, PostgreSQL, MySQL) - API/URL endpoints - Compressed files (.zip, .gz)

adamops.data.loaders.detect_encoding(filepath: str | Path, sample_size: int = 10000) str[source]

Detect the encoding of a file.

Parameters:
  • filepath – Path to the file.

  • sample_size – Number of bytes to sample for detection.

Returns:

Detected encoding (e.g., ‘utf-8’, ‘latin-1’).

Return type:

str

Example

>>> encoding = detect_encoding("data.csv")
>>> print(encoding)
'utf-8'
adamops.data.loaders.get_excel_sheet_names(filepath: str | Path) List[str][source]

Get sheet names from an Excel file.

Parameters:

filepath – Path to the Excel file.

Returns:

List of sheet names.

Return type:

List[str]

adamops.data.loaders.load_api(url: str, method: str = 'GET', params: Dict | None = None, data: Dict | None = None, json_data: Dict | None = None, headers: Dict | None = None, auth: tuple | None = None, timeout: int = 30, data_key: str | None = None, paginate: bool = False, page_key: str = 'page', limit_key: str = 'limit', limit: int = 100, max_pages: int = 100) DataFrame[source]

Load data from a REST API with pagination support.

Parameters:
  • url – API endpoint URL.

  • method – HTTP method.

  • params – Query parameters.

  • data – Form data.

  • json_data – JSON body data.

  • headers – HTTP headers.

  • auth – Authentication tuple.

  • timeout – Request timeout.

  • data_key – Key in response containing the data array.

  • paginate – Whether to paginate through results.

  • page_key – Parameter name for page number.

  • limit_key – Parameter name for page size.

  • limit – Number of items per page.

  • max_pages – Maximum number of pages to fetch.

Returns:

Loaded data.

Return type:

pd.DataFrame

Example

>>> df = load_api(
...     "https://api.example.com/users",
...     headers={"Authorization": "Bearer token"},
...     data_key="users",
...     paginate=True
... )
adamops.data.loaders.load_auto(source: str | Path, **kwargs) DataFrame[source]

Automatically detect and load data from various sources.

Supports CSV, Excel, JSON, SQL, and compressed files. Automatically detects the format based on file extension or URL.

Parameters:
  • source – Path to file, URL, or SQL connection string.

  • **kwargs – Additional arguments passed to the appropriate loader.

Returns:

Loaded data.

Return type:

pd.DataFrame

Example

>>> df = load_auto("data.csv")
>>> df = load_auto("https://example.com/data.json")
>>> df = load_auto("data.xlsx")
adamops.data.loaders.load_compressed(filepath: str | Path, format: str = 'csv', compression: str | None = None, **kwargs) DataFrame[source]

Load data from a compressed file (.zip, .gz, .bz2, .xz).

Parameters:
  • filepath – Path to the compressed file.

  • format – Data format inside the archive (‘csv’, ‘json’, ‘excel’).

  • compression – Compression type. Auto-detected if None.

  • **kwargs – Additional arguments for the format loader.

Returns:

Loaded data.

Return type:

pd.DataFrame

Example

>>> df = load_compressed("data.csv.gz")
>>> df = load_compressed("archive.zip", format="csv")
adamops.data.loaders.load_csv(filepath: str | Path, encoding: str | None = None, auto_detect_encoding: bool = True, sep: str = ',', header: int | List[int] | str = 'infer', index_col: int | str | List | None = None, usecols: List | None = None, dtype: Dict | None = None, parse_dates: bool | List | None = None, na_values: List | None = None, nrows: int | None = None, skiprows: int | List | None = None, low_memory: bool = True, **kwargs) DataFrame[source]

Load data from a CSV file with auto-encoding detection.

Parameters:
  • filepath – Path to the CSV file.

  • encoding – File encoding. If None and auto_detect_encoding is True, encoding will be detected automatically.

  • auto_detect_encoding – Whether to auto-detect encoding.

  • sep – Column separator.

  • header – Row number(s) to use as column names.

  • index_col – Column(s) to use as index.

  • usecols – Columns to load.

  • dtype – Data types for columns.

  • parse_dates – Columns to parse as dates.

  • na_values – Additional values to treat as NA.

  • nrows – Number of rows to read.

  • skiprows – Rows to skip.

  • low_memory – Use low memory mode.

  • **kwargs – Additional arguments passed to pd.read_csv.

Returns:

Loaded data.

Return type:

pd.DataFrame

Example

>>> df = load_csv("data.csv")
>>> df = load_csv("data.csv", usecols=["id", "name", "value"])
>>> df = load_csv("data.csv", parse_dates=["date_column"])
adamops.data.loaders.load_excel(filepath: str | Path, sheet_name: str | int | List | None = 0, header: int | List[int] | None = 0, index_col: int | str | List | None = None, usecols: str | List | None = None, dtype: Dict | None = None, parse_dates: bool | List | None = None, na_values: List | None = None, nrows: int | None = None, skiprows: int | List | None = None, **kwargs) DataFrame | Dict[str, DataFrame][source]

Load data from an Excel file (.xlsx, .xls).

Parameters:
  • filepath – Path to the Excel file.

  • sheet_name – Sheet name or index, or list for multiple sheets. Use None to read all sheets.

  • header – Row number(s) to use as column names.

  • index_col – Column(s) to use as index.

  • usecols – Columns to load.

  • dtype – Data types for columns.

  • parse_dates – Columns to parse as dates.

  • na_values – Additional values to treat as NA.

  • nrows – Number of rows to read.

  • skiprows – Rows to skip.

  • **kwargs – Additional arguments passed to pd.read_excel.

Returns:

Loaded data.

Return type:

pd.DataFrame or Dict[str, pd.DataFrame]

Example

>>> df = load_excel("data.xlsx")
>>> df = load_excel("data.xlsx", sheet_name="Sheet1")
>>> sheets = load_excel("data.xlsx", sheet_name=None)  # All sheets
adamops.data.loaders.load_json(filepath: str | Path, orient: str | None = None, lines: bool = False, encoding: str = 'utf-8', **kwargs) DataFrame[source]

Load data from a JSON file.

Parameters:
  • filepath – Path to the JSON file.

  • orient – JSON structure orientation. Options: ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’

  • lines – Read file as line-delimited JSON.

  • encoding – File encoding.

  • **kwargs – Additional arguments passed to pd.read_json.

Returns:

Loaded data.

Return type:

pd.DataFrame

Example

>>> df = load_json("data.json")
>>> df = load_json("data.jsonl", lines=True)
adamops.data.loaders.load_json_nested(filepath: str | Path, record_path: str | List[str] | None = None, meta: List[str] | None = None, max_level: int | None = None, encoding: str = 'utf-8') DataFrame[source]

Load nested JSON data and normalize it to a flat DataFrame.

Parameters:
  • filepath – Path to the JSON file.

  • record_path – Path to the records in the JSON structure.

  • meta – Fields to include from higher level.

  • max_level – Maximum normalization depth.

  • encoding – File encoding.

Returns:

Normalized data.

Return type:

pd.DataFrame

Example

>>> # For JSON like: {"data": [{"id": 1, "info": {"name": "A"}}]}
>>> df = load_json_nested("data.json", record_path="data")
adamops.data.loaders.load_sql(query: str, connection_string: str, params: Dict | None = None, index_col: str | List[str] | None = None, parse_dates: List[str] | Dict | None = None, chunksize: int | None = None, **kwargs) DataFrame | SQLiteDatabase[source]

Load data from a SQL database.

Supports SQLite, PostgreSQL, MySQL, and other SQLAlchemy-compatible databases.

Parameters:
  • query – SQL query to execute.

  • connection_string – Database connection string. Examples: - SQLite: “sqlite:///database.db” - PostgreSQL: “postgresql://user:pass@host:port/db” - MySQL: “mysql+pymysql://user:pass@host:port/db”

  • params – Query parameters.

  • index_col – Column(s) to use as index.

  • parse_dates – Columns to parse as dates.

  • chunksize – Number of rows per chunk (for large datasets).

  • **kwargs – Additional arguments passed to pd.read_sql.

Returns:

Loaded data.

Return type:

pd.DataFrame or Iterator

Example

>>> df = load_sql("SELECT * FROM users", "sqlite:///app.db")
>>> df = load_sql(
...     "SELECT * FROM orders WHERE date > :date",
...     "postgresql://user:pass@localhost:5432/shop",
...     params={"date": "2023-01-01"}
... )
adamops.data.loaders.load_sql_table(table_name: str, connection_string: str, schema: str | None = None, columns: List[str] | None = None, index_col: str | List[str] | None = None, chunksize: int | None = None, **kwargs) DataFrame[source]

Load an entire table from a SQL database.

Parameters:
  • table_name – Name of the table to load.

  • connection_string – Database connection string.

  • schema – Database schema.

  • columns – Columns to load (None for all).

  • index_col – Column(s) to use as index.

  • chunksize – Number of rows per chunk.

  • **kwargs – Additional arguments.

Returns:

Loaded data.

Return type:

pd.DataFrame

adamops.data.loaders.load_url(url: str, format: str = 'csv', params: Dict | None = None, headers: Dict | None = None, auth: tuple | None = None, timeout: int = 30, **kwargs) DataFrame[source]

Load data from a URL.

Parameters:
  • url – URL to load data from.

  • format – Data format (‘csv’, ‘json’, ‘excel’).

  • params – Query parameters.

  • headers – HTTP headers.

  • auth – Authentication tuple (username, password).

  • timeout – Request timeout in seconds.

  • **kwargs – Additional arguments for the format loader.

Returns:

Loaded data.

Return type:

pd.DataFrame

Example

>>> df = load_url("https://example.com/data.csv")
>>> df = load_url(
...     "https://api.example.com/data",
...     format="json",
...     headers={"Authorization": "Bearer token"}
... )
adamops.data.loaders.save_csv(df: DataFrame, filepath: str | Path, index: bool = False, encoding: str = 'utf-8', **kwargs) None[source]

Save DataFrame to CSV file.

Parameters:
  • df – DataFrame to save.

  • filepath – Output file path.

  • index – Whether to include index.

  • encoding – File encoding.

  • **kwargs – Additional arguments passed to df.to_csv.

adamops.data.loaders.save_excel(df: DataFrame, filepath: str | Path, sheet_name: str = 'Sheet1', index: bool = False, **kwargs) None[source]

Save DataFrame to Excel file.

Parameters:
  • df – DataFrame to save.

  • filepath – Output file path.

  • sheet_name – Name of the sheet.

  • index – Whether to include index.

  • **kwargs – Additional arguments.

adamops.data.loaders.save_json(df: DataFrame, filepath: str | Path, orient: str = 'records', indent: int = 2, **kwargs) None[source]

Save DataFrame to JSON file.

Parameters:
  • df – DataFrame to save.

  • filepath – Output file path.

  • orient – JSON structure orientation.

  • indent – Indentation level.

  • **kwargs – Additional arguments.

adamops.data.preprocessors module

AdamOps Data Preprocessors Module

Provides data cleaning capabilities: missing values, outliers, duplicates, type conversion.

adamops.data.preprocessors.clean_text(df: DataFrame, columns: List[str] | None = None, lowercase: bool = True, strip: bool = True, remove_special: bool = False) DataFrame[source]

Clean text columns.

adamops.data.preprocessors.convert_types(df: DataFrame, type_mapping: Dict[str, str] | None = None, auto_convert: bool = True, datetime_columns: List[str] | None = None) DataFrame[source]

Convert column types.

Parameters:
  • df – DataFrame to process.

  • type_mapping – {column: target_type}

  • auto_convert – Auto-detect and convert types.

  • datetime_columns – Columns to parse as datetime.

adamops.data.preprocessors.handle_duplicates(df: DataFrame, subset: List[str] | None = None, keep: str = 'first') DataFrame[source]

Remove duplicate rows.

adamops.data.preprocessors.handle_missing(df: DataFrame, strategy: str = 'mean', columns: List[str] | None = None, fill_value: any | None = None, n_neighbors: int = 5) DataFrame[source]

Handle missing values.

Parameters:
  • df – DataFrame to process.

  • strategy – ‘drop’, ‘mean’, ‘median’, ‘mode’, ‘constant’, ‘ffill’, ‘bfill’, ‘knn’, ‘iterative’

  • columns – Columns to process (None for all).

  • fill_value – Value for ‘constant’ strategy.

  • n_neighbors – Neighbors for KNN.

Returns:

Processed DataFrame.

adamops.data.preprocessors.handle_outliers(df: DataFrame, method: str = 'iqr', columns: List[str] | None = None, threshold: float = 1.5, action: str = 'clip', contamination: float = 0.1) DataFrame[source]

Handle outliers.

Parameters:
  • df – DataFrame to process.

  • method – ‘iqr’, ‘zscore’, ‘isolation_forest’

  • columns – Columns to process (None for numeric).

  • threshold – IQR multiplier or Z-score threshold.

  • action – ‘clip’, ‘drop’, ‘nan’

  • contamination – For isolation forest.

Returns:

Processed DataFrame.

adamops.data.preprocessors.preprocess(df: DataFrame, missing_strategy: str = 'mean', outlier_method: str | None = None, remove_duplicates: bool = True, convert_types_auto: bool = True) DataFrame[source]

Full preprocessing pipeline.

adamops.data.splitters module

AdamOps Data Splitters Module

Provides data splitting: train/test, train/val/test, time-series, K-Fold, stratified.

adamops.data.splitters.create_cv_splits(X: DataFrame | ndarray, y: Series | ndarray | None = None, method: str = 'kfold', n_splits: int = 5, **kwargs) List[Tuple][source]

Create cross-validation splits.

Parameters:
  • X – Features.

  • y – Target.

  • method – ‘kfold’, ‘stratified’, ‘timeseries’, ‘group’

  • n_splits – Number of folds.

Returns:

List of (train_idx, test_idx) tuples.

adamops.data.splitters.get_fold_data(X: DataFrame | ndarray, y: Series | ndarray | None, train_idx: ndarray, test_idx: ndarray) Tuple[source]

Get train/test data for a fold.

adamops.data.splitters.split_group_kfold(X: DataFrame | ndarray, y: Series | ndarray, groups: Series | ndarray, n_splits: int = 5) Iterator[Tuple][source]

Group K-Fold split. Ensures groups are not split across train/test.

Yields:

(train_idx, test_idx) tuples.

adamops.data.splitters.split_kfold(X: DataFrame | ndarray, y: Series | ndarray | None = None, n_splits: int = 5, shuffle: bool = True, random_state: int = 42) Iterator[Tuple][source]

K-Fold cross-validation split.

Yields:

(train_idx, test_idx) tuples.

adamops.data.splitters.split_stratified_kfold(X: DataFrame | ndarray, y: Series | ndarray, n_splits: int = 5, shuffle: bool = True, random_state: int = 42) Iterator[Tuple][source]

Stratified K-Fold cross-validation split.

Preserves class distribution in each fold.

Yields:

(train_idx, test_idx) tuples.

adamops.data.splitters.split_timeseries(X: DataFrame | ndarray, y: Series | ndarray | None = None, n_splits: int = 5, test_size: int | None = None, gap: int = 0) Iterator[Tuple][source]

Time series split for temporal data.

Parameters:
  • X – Features.

  • y – Target.

  • n_splits – Number of splits.

  • test_size – Test set size per split.

  • gap – Gap between train and test.

Yields:

(train_idx, test_idx) tuples.

adamops.data.splitters.split_train_test(X: DataFrame | ndarray, y: Series | ndarray | None = None, test_size: float = 0.2, random_state: int = 42, stratify: bool = False, shuffle: bool = True) Tuple[source]

Split data into train and test sets.

Parameters:
  • X – Features.

  • y – Target (optional).

  • test_size – Test set proportion.

  • random_state – Random seed.

  • stratify – Stratify by target.

  • shuffle – Shuffle before splitting.

Returns:

(X_train, X_test) or (X_train, X_test, y_train, y_test)

adamops.data.splitters.split_train_val_test(X: DataFrame | ndarray, y: Series | ndarray | None = None, train_size: float = 0.7, val_size: float = 0.15, test_size: float = 0.15, random_state: int = 42, stratify: bool = False) Tuple[source]

Split data into train, validation, and test sets.

Returns:

(X_train, X_val, X_test) or (X_train, X_val, X_test, y_train, y_val, y_test)

adamops.data.validators module

AdamOps Data Validators Module

Provides data validation: type validation, missing value checks, duplicate detection, shape validation, and statistical checks.

class adamops.data.validators.ColumnStats(name: str, dtype: str, count: int, missing_count: int, missing_pct: float, unique_count: int, unique_pct: float, mean: float | None = None, std: float | None = None, min: float | None = None, max: float | None = None)[source]

Bases: object

Statistics for a column.

count: int
dtype: str
max: float | None = None
mean: float | None = None
min: float | None = None
missing_count: int
missing_pct: float
name: str
std: float | None = None
unique_count: int
unique_pct: float
class adamops.data.validators.DataValidator(missing_threshold: float = 0.5, unique_threshold: float = 0.95)[source]

Bases: object

Data validator for DataFrames.

validate(df: DataFrame, schema: Dict | None = None, required_columns: List[str] | None = None) ValidationReport[source]

Validate a DataFrame.

class adamops.data.validators.ValidationIssue(severity: str, category: str, column: str | None, message: str, details: Dict | None = None)[source]

Bases: object

Represents a validation issue.

category: str
column: str | None
details: Dict | None = None
message: str
severity: str
class adamops.data.validators.ValidationReport(timestamp: str, shape: ~typing.Tuple[int, int], memory_usage: float, issues: ~typing.List[~adamops.data.validators.ValidationIssue] = <factory>, column_stats: ~typing.Dict[str, ~adamops.data.validators.ColumnStats] = <factory>, duplicate_rows: int = 0, passed: bool = True)[source]

Bases: object

Complete validation report.

column_stats: Dict[str, ColumnStats]
duplicate_rows: int = 0
issues: List[ValidationIssue]
memory_usage: float
passed: bool = True
shape: Tuple[int, int]
summary() str[source]

Generate text summary.

timestamp: str
adamops.data.validators.check_duplicates(df: DataFrame, subset: List[str] | None = None) DataFrame[source]

Get duplicate rows.

adamops.data.validators.check_missing(df: DataFrame) Dict[str, Dict][source]

Check missing values.

adamops.data.validators.check_types(df: DataFrame) Dict[str, str][source]

Get column types.

adamops.data.validators.describe_data(df: DataFrame) DataFrame[source]

Generate data description.

adamops.data.validators.validate(df: DataFrame, **kwargs) ValidationReport[source]

Validate a DataFrame.

Module contents

AdamOps Data Module

Provides comprehensive data handling capabilities: - loaders: Load data from various sources (CSV, Excel, JSON, SQL, API, compressed files) - validators: Validate data types, missing values, duplicates, shapes, and statistics - preprocessors: Clean data (handle missing values, outliers, duplicates, type conversion) - feature_engineering: Encode, scale, and generate features - splitters: Split data for training and evaluation