slik_wrangler package

Submodules

slik_wrangler.dqa

Module for Asseting the Data Quality

slik_wrangler.dqa.consistent_structure_assessement(dataframe, display_findings=True)

Checks the consitent nature of each feature column.

It checks if the dtype across each feature column is consistent. i.e. if there is an interger variable and a string variable across the various feature columns.

Parameters

dataframe (pandas Dataframe) – Data set to perform assessment on.

slik_wrangler.dqa.data_cleanness_assessment(dataframe, display_findings=True)

Checks for the overall cleanness of the dataframe(missing values in the dataset,duplicates, any inconsistent feature columns)

Parameters
  • dataframe (pandas Dataframe) – Data set to perform assessment on.

  • display_findings (boolean, Default True) – Whether or not to display a dataframe highlighting the missing values count and percentage.

  • report. (Gives a) –

slik_wrangler.dqa.duplicate_assessment(dataframe, display_findings=True)

Assets the duplicate values from the given datset and generates a report of its findings. It does this assessment for both rows and feature columns.

Parameters
  • dataframe (pandas Dataframe) – Data set to perform assessment on.

  • display_findings (boolean, Default True) – Whether or not to display a dataframe highlighting the missing values count and percentage.

slik_wrangler.dqa.missing_value_assessment(dataframe, display_findings=True)

Assets the missing values from the given datset and generates a report of its findings.

Parameters
  • dataframe (pandas Dataframe) – Data set to perform assessment on.

  • display_findings (boolean, Default True) – Whether or not to display a dataframe highlighting the missing values count and percentage.

slik_wrangler.loadfile

high level support for loading files.

slik_wrangler.loadfile.read_file(file_path, input_col=None, **kwargs)

Load a file path into a dataframe.

This funtion takes in a file path - CSV, excel or parquet and reads the data based on the input columns specified. Can only load one file at a time.

Parameters
  • file_path (str/file path) – path to where data is stored.

  • input_col (list) – select columns to be loaded as a pandas dataframe

  • **kwargs – use keyword arguements from pandas read file method

Returns

Return type

pandas Dataframe

slik_wrangler.loadfile.split_csv_file(file_path=None, delimiter=',', row_limit=1000000, output_path='.', keep_headers=True)

Split large csv files to small csv files.

Function splits large csv files into smaller files based on the row_limit specified. The files are stored in present working dir by default.

Parameters
  • file_path (str/file path) – path to where data is stored.

  • delimiter (str. Default is ',') – separator in each row and column,

  • row_limit (int) – split each file by row count

  • output_path (str) – output path to store splitted files

  • keep_headers (Bool. Default is True) – make use of headers for all csv files

Returns

Return type

Splitted files are stored in output_path

slik_wrangler.messages

Creates functionality that colors log messages

slik_wrangler.messages.log(*messages, code='normal', sep=' ', end='\n', file=None)

Distinguishes log messages from print statements. Works like a normal print statement but inclusive of colors :param messages: Message to be logged :param code: Log significance :return: distinguished log message

slik_wrangler.pipeline

Build Data and Model pipelines efficiently.

class slik_wrangler.pipeline.DenseTransformer(*args: Any, **kwargs: Any)

Bases: sklearn.base.

Transform sparse matrix to a dense matrix.

fit(X, y=None, **fit_params)

Fit a sparse matrix.

Fit a sparse matrix with the DenseTransformer class

Parameters
  • X (numpy array) – Sparse matrix to be fitted

  • y (numpy array) – Target array

transform(X, y=None, **fit_params)

Transform a fitted sparse matrix to a dense matrix.

DenseTransformer tranforms a sparse matrix to a dense matrix. Some Transformer class do not work with sparse martix, hence the transformation.

Parameters
  • X (numpy array) – Sparse matrix to be fitted

  • y (numpy array) – Target array

Returns

Output – Dense matrix

Return type

numpy array

slik_wrangler.pipeline.build_data_pipeline(data=None, target_column=None, id_column=None, clean_data=True, project_path=None, numerical_transformer=None, categorical_transformer=None, select_columns=None, pca=True, algorithm=None, grid_search=False, display_inline=False, hashing=False, params=None, hash_size=500, balance_data=False, **kwargs)

Build data and model pipeline.

Build production ready pipelines efficiently. Specify numerical and categorical transformer. Function also helps to clean your data, reduce dimensionality and handle sparse categorical features.

Parameters
  • data (str/ pandas dataframe) – Data path or Pandas dataframe.

  • target_column (str) – target column name

  • id_column (str) – id column name

  • clean_data (Bool, default is True) – handle missing value, outlier treatment, feature engineering

  • project_path (str/file path) – file path to processed data

  • numerical_transformer (sklearn pipeline) – numerical transformer to transform numerical attributes

  • categorical_transformer (sklearn pipeline) – categorical transformer to transform numerical attributes

  • select_columns (list) – columns to be passed/loaded as a dataframe

  • pca (Bool, default is True) – reduce feature dimensionality

  • algorithm (Default is None) – sklearn estimator

  • grid_search (Bool. default is False) – select best parameter after hyperparameter tuning

  • hashing (Bool. default is False) – handle sparse categorical features

  • params (dict.) – dictionary of keyword arguments.

  • display_inline (Bool, default is True) – display dataframe print statement

  • hash_size (int, default is 500) – size for hashing

Returns

sklearn pipeline estimator

Return type

Output

slik_wrangler.pipeline.evaluate_model(model_path=None, eval_data=None, select_columns=None, project_path=None, **kwargs)

Check model strength by validating model with an evaluation data.

Evaluate model based on slik build data pipeline function. Invoke model on transformed data and return evaluation plots in a file path.

Parameters
  • model_path (str/file path) – file path to model object

  • eval_data (str/ pandas dataframe) – Data path or Pandas dataframe.

  • select_columns (list) – columns to be passed/loaded as a dataframe

  • project_path (str/file path) – path to project

slik_wrangler.pipeline.get_feature_names(column_transformer)

Get feature names after using column transformer object.

Get feature names after trabsformations from each transformers object in the column transformer class.

Parameters

column_transformer (sklearn column transformer) –

Returns

feature_names – Names of the features produced by transform.

Return type

list of strings

slik_wrangler.pipeline.pipeline_transform_predict(data=None, select_columns=None, project_path=None, model_path=None)

Transform pipeline object and return Predictions.

Transform dataframe based on slik build data pipeline function. Invoke model on transformed data and return predictions

Parameters
  • data (str/ pandas dataframe) – Data path or Pandas dataframe.

  • select_columns (list) – columns to be passed/loaded as a dataframe

  • project_path (str/file path) – path to project

  • model_path (str/file path) – file path to model object

Returns

results – list of numpy array predictions

Return type

numpy array

slik_wrangler.plot_funcs

slik_wrangler.plot_funcs.confusion_matrix(cm, fp, norm_axis=1)

[TN, FP] [FN, TP]

The confusion matrix after validating the model on a test set.

Parameters
  • cm (confusion matrix) –

  • fp (the file path to save the figure.) –

Returns

Return type

A bar plot of the confusion matrix

slik_wrangler.plot_funcs.corr_matrix(corr, fp)
slik_wrangler.plot_funcs.feature_importance(features, feature_importances, title, fp)

The feature importance indicating features that contribute the most to the predictive power of the model.

Parameters
  • features (features of the data set) –

  • feature_importances (the importances of the features that contributes) – the most to the predictive power of the model

  • title (title of the feature importance chart) –

  • fp (the file path to save the figure.) –

Returns

Return type

A bar plot of the feature importances

slik_wrangler.plot_funcs.label_share(share, fp)

The distribution of label in a data set.

Parameters
  • share (label distribution) –

  • fp (the file path to save the figure.) –

Returns

Return type

A bar plot of the label distribution

slik_wrangler.plot_funcs.metric(metrics, fp)
slik_wrangler.plot_funcs.plot_nan(data)

Plot the top values from a value count in a dataframe.

Parameters
  • data (DataFrame or name Series.) – Data set to perform plot operation on.

  • Returns (A bar plot) – The bar plot of top n values.

slik_wrangler.plot_funcs.pr_curve(pre, rec, auc, fp)

The precision-recall curve for model validation.

Parameters
  • pre (precision) –

  • rec (recall) –

  • auc (area under the curve) –

  • fp (the file path to save the figure.) –

Returns

Return type

The precision-recall curve

slik_wrangler.plot_funcs.roc_curve(fpr, tpr, auc, fp)

The roc_curve for model validation.

Parameters
  • fpr (false positive rate) –

  • tpr (true positive rate) –

  • auc (area under the curve) –

  • fp (the file path to save the figure.) –

Returns

Return type

The ROC AUC curve

slik_wrangler.plot_funcs.scores(scores, fp)

The average classification score for model validation.

Parameters
  • scores (test data scores) –

  • fp (the file path to save the figure.) –

Returns

Return type

A bar plot of the average classification scores

slik_wrangler.preprocessing

slik_wrangler.preprocessing.bin_age(dataframe=None, age_col=None, add_prefix=True)

The age attribute in a DataFrame is binned into 5 categories: (baby/toddler, child, young adult, mid age and elderly).

Parameters
  • dataframe (DataFrame or name Series.) – Data set to perform operation on.

  • age_col (str.) – The column to perform the operation on.

  • add_prefix (Bool. Default is set to True) – add prefix to the column name.

Returns

Return type

Dataframe with binned age attribute

slik_wrangler.preprocessing.change_case(dataframe, columns=None, case='lower', inplace=False)

Change the case of a pandas series to either upper or lower case

Parameters
  • dataframe (Dataframe or named Series) –

  • columns (str, list) – The column or list of columns to perform the operation on

  • case (str. Default is set to lower) – Indicates the type of operation to perform

  • inplace (bool. Default is set to False) – Indicates if changes should by made within the dataframe or not.

Returns

Return type

Pandas Dataframe

slik_wrangler.preprocessing.check_datefield(dataframe=None, column=None)

Check if a column is a datefield and Returns a Bool.

Parameters
  • dataframe (DataFrame or name Series.) – Data set to perform operation on.

  • column (str) – The column name to perform the operation on.

Returns

Returns True if the data point is a datefield.

Return type

Boolean

slik_wrangler.preprocessing.check_nan(dataframe=None, plot=False, display_inline=True)

Display missing values as a pandas dataframe and give a proportion in terms of percentages.

Parameters
  • data (pandas DataFrame or named Series) –

  • plot (bool, Default False) – Plots missing values in dataset as a heatmap

  • display_inline (bool, Default False) – shows missing values in the dataset as a dataframe

Returns

Bar plot of missing values

Return type

Matplotlib Figure

slik_wrangler.preprocessing.create_schema_file(dataframe, target_column, id_column, project_path='.', save=True, display_inline=True)

A data schema of column names and types are automatically inferred and saved in a YAML file

Parameters
  • dataframe (DataFrame or name Series.) – Data set to perform operation on.

  • target_column (the name of the target column in the dataset. A string is expected) – The column to perform the operation on.

  • id_column (str) – Unique Identifier column.

  • project_path (str.) – The path of the schema file you want to create.

  • save (Bool. Default is set to True) – save schema file to file path.

  • display_inline (Bool. Default is set to True) – display dataframe print statements.

Returns

A schema file is created in the data directory

Return type

file path

slik_wrangler.preprocessing.detect_fix_outliers(dataframe=None, target_column=None, n=1, num_features=None, fix_method='mean', display_inline=True)

Detect outliers present in the numerical features and fix the outliers present.

Parameters
  • dataframe (DataFrame or name Series.) – Data set to perform operation on.

  • num_features (List, Series, Array.) – Numerical features to perform operation on. If not provided, we automatically infer from the dataset.

  • target_column (string) – The target attribute name.

  • fix_method (mean or log_transformatio. Default is 'mean') – Method of fixing outliers present in the data. mean or log_transformation.

  • n (integer) – A value to determine whether there are multiple outliers in a record, which is highly dependent on the number of features that are being checked.

  • display_inline (Bool. Default is True.) – Display the outliers present in the data in form of a dataframe.

Returns

dataframe after removing outliers.

Return type

Dataframe

slik_wrangler.preprocessing.drop_duplicate(dataframe=None, columns=None, method='rows', display_inline=True)

Drop duplicate values across rows, columns in the dataframe.

Parameters
  • dataframe (DataFrame or name Series.) – Data set to perform operation on.

  • columns (List/String.) – list of column names

  • method ('rows' or 'columns', default is 'rows') – Drop duplicate values across rows, columns.

  • display_inline (Bool. Default is True.) – Display print statements.

Returns

dataframe after dropping duplicates.

Return type

Dataframe

slik_wrangler.preprocessing.drop_uninformative_fields(dataframe=None, exclude=None, display_inline=True)

Drop fields that have only a single unique value or are all NaN, meaning that they are entirely uninformative.

Parameters
  • dataframe (DataFrame or name Series.) – Data set to perform operation on.

  • exclude (string/list.) – A column or list of columns you want to exclude from being dropped.

  • display_inline (Bool. Default is True.) – Display print statements.

Returns

dataframe after dropping uninformative fields.

Return type

Dataframe

slik_wrangler.preprocessing.featurize_datetime(dataframe=None, column_name=None, date_features=None, drop=True)

Featurize datetime in the dataset to create new fields such as the Year, Month, Day, Day of the week, Day of the year, Week, end of the month, start of the month, end of the quarter, start of a quarter, end of the year, start of the year

Parameters
  • dataframe (DataFrame or name Series.) – Data set to perform operation on.

  • column_name (String) – The column to perform the operation on.

  • date_features (List.) – A list of new datetime features to include in the dataset. Expected list should contain either of the elements in this list [‘Year’, ‘Month’, ‘Day’, ‘Dayofweek’, ‘Dayofyear’, ‘Week’,’Is_month_end’, ‘Is_month_start’, ‘Is_quarter_end’, ‘Hour’,’Minute’,’Is_quarter_start’, ‘Is_year_end’, ‘Is_year_start’, ‘Date’]

  • drop (Bool. Default is set to True) – drop original datetime column.

Returns

Dataframe with new datetime fields

Return type

Dataframe

slik_wrangler.preprocessing.get_attributes(data=None, target_column=None)

Returns the categorical features and Numerical features(in a pandas dataframe) as a list

Parameters
  • data (DataFrame or named Series) – Data set to perform operation on.

  • target_column (str) – Label or Target column

Returns

A list of all the categorical features and numerical features in a dataset.

Return type

List

slik_wrangler.preprocessing.handle_nan(dataframe=None, target_name=None, strategy='mean', fillna='mode', drop_outliers=True, thresh_y=75, thresh_x=75, display_inline=True, **kwargs)

Handle missing values present in a pandas dataframe.

Take care of missing values in the data both cateforical and numerical features by dropping or filling missing values. Using the threshold parameter you can also drop missing values present in the data. Outliers are treated before handling missing values by default.

Parameters
  • data (DataFrame or name Series.) – Data set to perform operation on.

  • target_name (str) – Name of the target column

  • strategy (str. Default is 'mean') – Method of filling numerical features

  • fillna (str. Default is 'mode') – Method of filling categorical features

  • drop_outliers (bool, Default True) – Drops outliers present in the data.

  • thresh_x (Int, Default is 75.) – Threshold for dropping rows with missing values.

  • thresh_y (In, Default is 75.) – Threshold for dropping columns with missing value

  • display_inline (Bool. default is True.) – display pandas dataframe print statements

Returns

Dataframe without missing values

Return type

Pandas Dataframe

slik_wrangler.preprocessing.identify_columns(dataframe=None, target_column=None, id_column=None, high_dim=100, display_inline=True, project_path=None)

Identifies numerical attributes ,categorical attributes with sparse features and categorical attributes with lower features present in the data and saves the output in a yaml file.

Parameters
  • dataframe (DataFrame or named Series) –

  • target_column (str) – Label or Target column.

  • id_column (str) – unique identifier column.

  • high_dim (int, default 100) – Integer to identify categorical attributes greater than 100 observations

  • display (Bool, default=True) – display print statement

  • project_path (str) – path to where the yaml file is saved.

slik_wrangler.preprocessing.manage_columns(dataframe=None, columns=None, select_columns=False, drop_columns=False, drop_duplicates=None)

Manage operations on pandas dataframe based on columns. Operations include selecting of columns, dropping column and dropping duplicates.

Parameters
  • dataframe (DataFrame or named Series) –

  • columns (used to specify columns to be selected, dropped or used in dropping duplicates.) –

  • select_columns (Boolean True or False, default is False) – The columns you want to select from your dataframe. Requires a list to be passed into the columns param

  • drop_columns (Boolean True or False, default is False) – The columns you want to drop from your dataset. Requires a list to be passed into the columns param

  • drop_duplicates ('rows' or 'columns', default is None) – Drop duplicate values across rows, columns. If columns, a list is required to be passed into the columns param

Returns

A new dataframe after dropping/selecting/removing duplicate columns or the original dataframe if params are left as default

Return type

Pandas Dataframe

slik_wrangler.preprocessing.map_column(dataframe=None, column_name=None, items=None, add_prefix=True)

Map values in a pandas dataframe column with a dict.

Parameters
  • data (DataFrame or named Series) –

  • column_name (str.) – Name of pandas dataframe column to be mapped

  • items (Dict, default is None) – A dict with key and value to be mapped

  • add_prefix (Bool, default is True) – Include a prefix of the target column in the dataset

Returns

A new dataframe with mapped features.

Return type

Pandas Dataframe

slik_wrangler.preprocessing.map_target(dataframe=None, target_column=None, add_prefix=True, drop=False, display_inline=True)

Map target column in a pandas dataframe column with a dict. This can be applied to both binary and multi-class target

Parameters
  • dataframe (DataFrame or named Series) –

  • target_column (str) – Name of the target column

  • add_prefix (Bool. Default is True) – Include a prefix of the target column in the dataset

  • drop (Bool. Default is True) – drop original target column name

  • display_inline (Bool. Default is True) –

Returns

A new dataframe with mapped target column

Return type

Pandas Dataframe

slik_wrangler.preprocessing.preprocess(data=None, target_column=None, train=False, select_columns=None, display_inline=True, project_path=None, **kwargs)

Automatically preprocess dataframe/file-path. Handles missing value, Outlier treatment, feature engineering.

Parameters
  • data (DataFrame or named Series) – Dataframe or dath path to the data

  • target_column (String) – Name of pandas dataframe target column

  • train (Bool, default is True) –

  • select_columns (List) – List of columns to be used

  • project_path (Str) – Path to where the preprocessed data will be stored

  • display_inline (Bool. Default is True) –

Returns

Returns a clean dataframe in the filepath

Return type

Pandas Dataframe

slik_wrangler.preprocessing.rename_similar_values(dataframe, column_name, cut_off=0.75, n=None)

Use Sequence Matcher to check for best “good enough” matches.

Rename values based on similar matches.

Parameters
  • dataframe (Pandas Series) –

  • column_name (str.) – Name of pandas column to perform operation on

  • cut_off (int) – Possibilities that don’t score at least that similar to word are ignored

  • n(optional) (int. default 2.) – The maximum number of close matches to return. n must be > 0.

Returns

Return type

Pandas Dataframe.

Example

>>> pd.dataframe(["Lagos", "Lag", "Abuja", "Abuja FCT", 'Ibadan'],column=['column_name'])
>>> Applying the function to this pandas series yields
>>> ["Lagos", "Lagos", "Abuja", "Abuja", 'Ibadan']
slik_wrangler.preprocessing.trim_all_columns(dataframe)

Trim whitespace from ends of each value across all series in dataframe

Parameters

dataframe (Pandas dataframe) –

Returns

Return type

Pandas Dataframe

slik_wrangler.utils

class slik_wrangler.utils.HiddenPrints

Bases: object

Hide prints of a function

Parameters

None

Returns

None

slik_wrangler.utils.get_scores(y_true, y_pred)

Get metrics of model performance such as accuracy, precision, recall and f1.

Parameters
  • y_true – the target value of test/validation data.

  • y_pred – the predicted value

Returns

Accuracy, precision, recall and f1

slik_wrangler.utils.load_pickle(fp)

Load pickle file(data, model or pipeline object).

Parameters

fp – the file path of the pickle files.

Returns

Loaded pickle file

slik_wrangler.utils.log_plot(args, plot_func, fp)

Log the plots of your metrics and save output in a specified file path.

Parameters
  • args – A tuple. Arguments required to plot the required metrics

  • plot_func – A function Contains different method for plotting metrics such as ROC-AUC, PR-Curve

  • fp – File Path The path to write the output logs of the plot

Returns

None

slik_wrangler.utils.print_divider(title)

Expand print function with a clear differentiator using -.

Parameters

Title – the title of the print statement

Returns

None

slik_wrangler.utils.store_attribute(dict_file, output_path)

Store attributes of a dataframe as a dict.

Parameters
  • dict_file – the dictionary.

  • output_path – the path where the file is saved

Returns

None

slik_wrangler.utils.store_pipeline(pipeline_object, pipeline_path)

Store the column transformer pipeline object.

Parameters
  • pipeline_object – the pipeline object.

  • pipeline_path – the path where the pipeline is saved

Returns

None

Module contents