slik_wrangler package

Submodules

slik_wrangler.dqa

Module for Asseting the Data Quality

slik_wrangler.dqa.consistent_structure_assessement(dataframe, display_findings=True)

Checks the consitent nature of each feature column.

It checks if the dtype across each feature column is consistent. i.e. if there is an interger variable and a string variable across the various feature columns.

Parameters: dataframe (pandas Dataframe) – Data set to perform assessment on.

slik_wrangler.dqa.data_cleanness_assessment(dataframe, display_findings=True)

Checks for the overall cleanness of the dataframe(missing values in the dataset,duplicates, any inconsistent feature columns)

Parameters

dataframe (pandas Dataframe) – Data set to perform assessment on.
display_findings (boolean, Default True) – Whether or not to display a dataframe highlighting the missing values count and percentage.
report. (Gives a) –

slik_wrangler.dqa.duplicate_assessment(dataframe, display_findings=True)

Assets the duplicate values from the given datset and generates a report of its findings. It does this assessment for both rows and feature columns.

Parameters

dataframe (pandas Dataframe) – Data set to perform assessment on.
display_findings (boolean, Default True) – Whether or not to display a dataframe highlighting the missing values count and percentage.

slik_wrangler.dqa.missing_value_assessment(dataframe, display_findings=True)

Assets the missing values from the given datset and generates a report of its findings.

Parameters

dataframe (pandas Dataframe) – Data set to perform assessment on.
display_findings (boolean, Default True) – Whether or not to display a dataframe highlighting the missing values count and percentage.

slik_wrangler.loadfile

high level support for loading files.

slik_wrangler.loadfile.read_file(file_path, input_col=None, **kwargs)

Load a file path into a dataframe.

This funtion takes in a file path - CSV, excel or parquet and reads the data based on the input columns specified. Can only load one file at a time.

Parameters

file_path (str/file path) – path to where data is stored.
input_col (list) – select columns to be loaded as a pandas dataframe
**kwargs – use keyword arguements from pandas read file method

Returns

Return type

pandas Dataframe

slik_wrangler.loadfile.split_csv_file(file_path=None, delimiter=',', row_limit=1000000, output_path='.', keep_headers=True)

Split large csv files to small csv files.

Function splits large csv files into smaller files based on the row_limit specified. The files are stored in present working dir by default.

Parameters

file_path (str/file path) – path to where data is stored.
delimiter (str. Default is ',') – separator in each row and column,
row_limit (int) – split each file by row count
output_path (str) – output path to store splitted files
keep_headers (Bool. Default is True) – make use of headers for all csv files

Returns

Return type

Splitted files are stored in output_path

slik_wrangler.messages

Creates functionality that colors log messages

slik_wrangler.messages.log(*messages, code='normal', sep=' ', end='\n', file=None): Distinguishes log messages from print statements. Works like a normal print statement but inclusive of colors :param messages: Message to be logged :param code: Log significance :return: distinguished log message

slik_wrangler.pipeline

Build Data and Model pipelines efficiently.

class slik_wrangler.pipeline.DenseTransformer(*args: Any, **kwargs: Any)

Bases: sklearn.base.

Transform sparse matrix to a dense matrix.

fit(X, y=None, **fit_params)

Fit a sparse matrix.

Fit a sparse matrix with the DenseTransformer class

Parameters

X (numpy array) – Sparse matrix to be fitted
y (numpy array) – Target array

transform(X, y=None, **fit_params)

Transform a fitted sparse matrix to a dense matrix.

DenseTransformer tranforms a sparse matrix to a dense matrix. Some Transformer class do not work with sparse martix, hence the transformation.

Parameters

X (numpy array) – Sparse matrix to be fitted
y (numpy array) – Target array

Returns

Output – Dense matrix

Return type

numpy array

slik_wrangler.pipeline.build_data_pipeline(data=None, target_column=None, id_column=None, clean_data=True, project_path=None, numerical_transformer=None, categorical_transformer=None, select_columns=None, pca=True, algorithm=None, grid_search=False, display_inline=False, hashing=False, params=None, hash_size=500, balance_data=False, **kwargs)

Build data and model pipeline.

Build production ready pipelines efficiently. Specify numerical and categorical transformer. Function also helps to clean your data, reduce dimensionality and handle sparse categorical features.

Parameters

data (str/ pandas dataframe) – Data path or Pandas dataframe.
target_column (str) – target column name
id_column (str) – id column name
clean_data (Bool, default is True) – handle missing value, outlier treatment, feature engineering
project_path (str/file path) – file path to processed data
numerical_transformer (sklearn pipeline) – numerical transformer to transform numerical attributes
categorical_transformer (sklearn pipeline) – categorical transformer to transform numerical attributes
select_columns (list) – columns to be passed/loaded as a dataframe
pca (Bool, default is True) – reduce feature dimensionality
algorithm (Default is None) – sklearn estimator
grid_search (Bool. default is False) – select best parameter after hyperparameter tuning
hashing (Bool. default is False) – handle sparse categorical features
params (dict.) – dictionary of keyword arguments.
display_inline (Bool, default is True) – display dataframe print statement
hash_size (int, default is 500) – size for hashing

Returns

sklearn pipeline estimator

Return type

Output

slik_wrangler.pipeline.evaluate_model(model_path=None, eval_data=None, select_columns=None, project_path=None, **kwargs)

Check model strength by validating model with an evaluation data.

Evaluate model based on slik build data pipeline function. Invoke model on transformed data and return evaluation plots in a file path.

Parameters

model_path (str/file path) – file path to model object
eval_data (str/ pandas dataframe) – Data path or Pandas dataframe.
select_columns (list) – columns to be passed/loaded as a dataframe
project_path (str/file path) – path to project

slik_wrangler.pipeline.get_feature_names(column_transformer)

Get feature names after using column transformer object.

Get feature names after trabsformations from each transformers object in the column transformer class.

Parameters: column_transformer (sklearn column transformer) –
Returns: feature_names – Names of the features produced by transform.
Return type: list of strings

slik_wrangler.pipeline.pipeline_transform_predict(data=None, select_columns=None, project_path=None, model_path=None)

Transform pipeline object and return Predictions.

Transform dataframe based on slik build data pipeline function. Invoke model on transformed data and return predictions

Parameters

data (str/ pandas dataframe) – Data path or Pandas dataframe.
select_columns (list) – columns to be passed/loaded as a dataframe
project_path (str/file path) – path to project
model_path (str/file path) – file path to model object

Returns

results – list of numpy array predictions

Return type

numpy array

slik_wrangler.plot_funcs

slik_wrangler.plot_funcs.confusion_matrix(cm, fp, norm_axis=1)

[TN, FP] [FN, TP]

The confusion matrix after validating the model on a test set.

Parameters

cm (confusion matrix) –
fp (the file path to save the figure.) –

Returns

Return type

A bar plot of the confusion matrix

slik_wrangler.plot_funcs.corr_matrix(corr, fp)

slik_wrangler.plot_funcs.feature_importance(features, feature_importances, title, fp)

The feature importance indicating features that contribute the most to the predictive power of the model.

Parameters

features (features of the data set) –
feature_importances (the importances of the features that contributes) – the most to the predictive power of the model
title (title of the feature importance chart) –
fp (the file path to save the figure.) –

Returns

Return type

A bar plot of the feature importances

slik_wrangler.plot_funcs.label_share(share, fp)

The distribution of label in a data set.

Parameters

share (label distribution) –
fp (the file path to save the figure.) –

Returns

Return type

A bar plot of the label distribution

slik_wrangler.plot_funcs.metric(metrics, fp)

slik_wrangler.plot_funcs.plot_nan(data)

Plot the top values from a value count in a dataframe.

Parameters

data (DataFrame or name Series.) – Data set to perform plot operation on.
Returns (A bar plot) – The bar plot of top n values.

slik_wrangler.plot_funcs.pr_curve(pre, rec, auc, fp)

The precision-recall curve for model validation.

Parameters

pre (precision) –
rec (recall) –
auc (area under the curve) –
fp (the file path to save the figure.) –

Returns

Return type

The precision-recall curve

slik_wrangler.plot_funcs.roc_curve(fpr, tpr, auc, fp)

The roc_curve for model validation.

Parameters

fpr (false positive rate) –
tpr (true positive rate) –
auc (area under the curve) –
fp (the file path to save the figure.) –

Returns

Return type

The ROC AUC curve

slik_wrangler.plot_funcs.scores(scores, fp)

The average classification score for model validation.

Parameters

scores (test data scores) –
fp (the file path to save the figure.) –

Returns

Return type

A bar plot of the average classification scores

slik_wrangler.preprocessing

slik_wrangler.preprocessing.bin_age(dataframe=None, age_col=None, add_prefix=True)

The age attribute in a DataFrame is binned into 5 categories: (baby/toddler, child, young adult, mid age and elderly).

Parameters

dataframe (DataFrame or name Series.) – Data set to perform operation on.
age_col (str.) – The column to perform the operation on.
add_prefix (Bool. Default is set to True) – add prefix to the column name.

Returns

Return type

Dataframe with binned age attribute

slik_wrangler.preprocessing.change_case(dataframe, columns=None, case='lower', inplace=False)

Change the case of a pandas series to either upper or lower case

Parameters

dataframe (Dataframe or named Series) –
columns (str, list) – The column or list of columns to perform the operation on
case (str. Default is set to lower) – Indicates the type of operation to perform
inplace (bool. Default is set to False) – Indicates if changes should by made within the dataframe or not.

Returns

Return type

Pandas Dataframe

slik_wrangler.preprocessing.check_datefield(dataframe=None, column=None)

Check if a column is a datefield and Returns a Bool.

Parameters

dataframe (DataFrame or name Series.) – Data set to perform operation on.
column (str) – The column name to perform the operation on.

Returns

Returns True if the data point is a datefield.

Return type

Boolean

slik_wrangler.preprocessing.check_nan(dataframe=None, plot=False, display_inline=True)

Display missing values as a pandas dataframe and give a proportion in terms of percentages.

Parameters

data (pandas DataFrame or named Series) –
plot (bool, Default False) – Plots missing values in dataset as a heatmap
display_inline (bool, Default False) – shows missing values in the dataset as a dataframe

Returns

Bar plot of missing values

Return type

Matplotlib Figure

slik_wrangler.preprocessing.create_schema_file(dataframe, target_column, id_column, project_path='.', save=True, display_inline=True)

A data schema of column names and types are automatically inferred and saved in a YAML file

Parameters

dataframe (DataFrame or name Series.) – Data set to perform operation on.
target_column (the name of the target column in the dataset. A string is expected) – The column to perform the operation on.
id_column (str) – Unique Identifier column.
project_path (str.) – The path of the schema file you want to create.
save (Bool. Default is set to True) – save schema file to file path.
display_inline (Bool. Default is set to True) – display dataframe print statements.

Returns

A schema file is created in the data directory

Return type

file path

slik_wrangler.preprocessing.detect_fix_outliers(dataframe=None, target_column=None, n=1, num_features=None, fix_method='mean', display_inline=True)

Detect outliers present in the numerical features and fix the outliers present.

Parameters

dataframe (DataFrame or name Series.) – Data set to perform operation on.
num_features (List, Series, Array.) – Numerical features to perform operation on. If not provided, we automatically infer from the dataset.
target_column (string) – The target attribute name.
fix_method (mean or log_transformatio. Default is 'mean') – Method of fixing outliers present in the data. mean or log_transformation.
n (integer) – A value to determine whether there are multiple outliers in a record, which is highly dependent on the number of features that are being checked.
display_inline (Bool. Default is True.) – Display the outliers present in the data in form of a dataframe.

Returns

dataframe after removing outliers.

Return type

Dataframe

slik_wrangler.preprocessing.drop_duplicate(dataframe=None, columns=None, method='rows', display_inline=True)

Drop duplicate values across rows, columns in the dataframe.

Parameters

dataframe (DataFrame or name Series.) – Data set to perform operation on.
columns (List/String.) – list of column names
method ('rows' or 'columns', default is 'rows') – Drop duplicate values across rows, columns.
display_inline (Bool. Default is True.) – Display print statements.

Returns

dataframe after dropping duplicates.

Return type

Dataframe

slik_wrangler.preprocessing.drop_uninformative_fields(dataframe=None, exclude=None, display_inline=True)

Drop fields that have only a single unique value or are all NaN, meaning that they are entirely uninformative.

Parameters

dataframe (DataFrame or name Series.) – Data set to perform operation on.
exclude (string/list.) – A column or list of columns you want to exclude from being dropped.
display_inline (Bool. Default is True.) – Display print statements.

Returns

dataframe after dropping uninformative fields.

Return type

Dataframe

slik_wrangler.preprocessing.featurize_datetime(dataframe=None, column_name=None, date_features=None, drop=True)

Featurize datetime in the dataset to create new fields such as the Year, Month, Day, Day of the week, Day of the year, Week, end of the month, start of the month, end of the quarter, start of a quarter, end of the year, start of the year

Parameters

dataframe (DataFrame or name Series.) – Data set to perform operation on.
column_name (String) – The column to perform the operation on.
date_features (List.) – A list of new datetime features to include in the dataset. Expected list should contain either of the elements in this list [‘Year’, ‘Month’, ‘Day’, ‘Dayofweek’, ‘Dayofyear’, ‘Week’,’Is_month_end’, ‘Is_month_start’, ‘Is_quarter_end’, ‘Hour’,’Minute’,’Is_quarter_start’, ‘Is_year_end’, ‘Is_year_start’, ‘Date’]
drop (Bool. Default is set to True) – drop original datetime column.

Returns

Dataframe with new datetime fields

Return type

Dataframe

slik_wrangler.preprocessing.get_attributes(data=None, target_column=None)

Returns the categorical features and Numerical features(in a pandas dataframe) as a list

Parameters

data (DataFrame or named Series) – Data set to perform operation on.
target_column (str) – Label or Target column

Returns

A list of all the categorical features and numerical features in a dataset.

Return type

List

slik_wrangler.preprocessing.handle_nan(dataframe=None, target_name=None, strategy='mean', fillna='mode', drop_outliers=True, thresh_y=75, thresh_x=75, display_inline=True, **kwargs)

Handle missing values present in a pandas dataframe.

Take care of missing values in the data both cateforical and numerical features by dropping or filling missing values. Using the threshold parameter you can also drop missing values present in the data. Outliers are treated before handling missing values by default.

Parameters

data (DataFrame or name Series.) – Data set to perform operation on.
target_name (str) – Name of the target column
strategy (str. Default is 'mean') – Method of filling numerical features
fillna (str. Default is 'mode') – Method of filling categorical features
drop_outliers (bool, Default True) – Drops outliers present in the data.
thresh_x (Int, Default is 75.) – Threshold for dropping rows with missing values.
thresh_y (In, Default is 75.) – Threshold for dropping columns with missing value
display_inline (Bool. default is True.) – display pandas dataframe print statements

Returns

Dataframe without missing values

Return type

Pandas Dataframe

slik_wrangler.preprocessing.identify_columns(dataframe=None, target_column=None, id_column=None, high_dim=100, display_inline=True, project_path=None)

Identifies numerical attributes ,categorical attributes with sparse features and categorical attributes with lower features present in the data and saves the output in a yaml file.

Parameters

dataframe (DataFrame or named Series) –
target_column (str) – Label or Target column.
id_column (str) – unique identifier column.
high_dim (int, default 100) – Integer to identify categorical attributes greater than 100 observations
display (Bool, default=True) – display print statement
project_path (str) – path to where the yaml file is saved.

slik_wrangler.preprocessing.manage_columns(dataframe=None, columns=None, select_columns=False, drop_columns=False, drop_duplicates=None)

Manage operations on pandas dataframe based on columns. Operations include selecting of columns, dropping column and dropping duplicates.

Parameters

dataframe (DataFrame or named Series) –
columns (used to specify columns to be selected, dropped or used in dropping duplicates.) –
select_columns (Boolean True or False, default is False) – The columns you want to select from your dataframe. Requires a list to be passed into the columns param
drop_columns (Boolean True or False, default is False) – The columns you want to drop from your dataset. Requires a list to be passed into the columns param
drop_duplicates ('rows' or 'columns', default is None) – Drop duplicate values across rows, columns. If columns, a list is required to be passed into the columns param

Returns

A new dataframe after dropping/selecting/removing duplicate columns or the original dataframe if params are left as default

Return type

Pandas Dataframe

slik_wrangler.preprocessing.map_column(dataframe=None, column_name=None, items=None, add_prefix=True)

Map values in a pandas dataframe column with a dict.

Parameters

data (DataFrame or named Series) –
column_name (str.) – Name of pandas dataframe column to be mapped
items (Dict, default is None) – A dict with key and value to be mapped
add_prefix (Bool, default is True) – Include a prefix of the target column in the dataset

Returns

A new dataframe with mapped features.

Return type

Pandas Dataframe

slik_wrangler.preprocessing.map_target(dataframe=None, target_column=None, add_prefix=True, drop=False, display_inline=True)

Map target column in a pandas dataframe column with a dict. This can be applied to both binary and multi-class target

Parameters

dataframe (DataFrame or named Series) –
target_column (str) – Name of the target column
add_prefix (Bool. Default is True) – Include a prefix of the target column in the dataset
drop (Bool. Default is True) – drop original target column name
display_inline (Bool. Default is True) –

Returns

A new dataframe with mapped target column

Return type

Pandas Dataframe

slik_wrangler.preprocessing.preprocess(data=None, target_column=None, train=False, select_columns=None, display_inline=True, project_path=None, **kwargs)

Automatically preprocess dataframe/file-path. Handles missing value, Outlier treatment, feature engineering.

Parameters

data (DataFrame or named Series) – Dataframe or dath path to the data
target_column (String) – Name of pandas dataframe target column
train (Bool, default is True) –
select_columns (List) – List of columns to be used
project_path (Str) – Path to where the preprocessed data will be stored
display_inline (Bool. Default is True) –

Returns

Returns a clean dataframe in the filepath

Return type

Pandas Dataframe

slik_wrangler.preprocessing.rename_similar_values(dataframe, column_name, cut_off=0.75, n=None)

Use Sequence Matcher to check for best “good enough” matches.

Rename values based on similar matches.

Parameters

dataframe (Pandas Series) –
column_name (str.) – Name of pandas column to perform operation on
cut_off (int) – Possibilities that don’t score at least that similar to word are ignored
n(optional) (int. default 2.) – The maximum number of close matches to return. n must be > 0.

Returns

Return type

Pandas Dataframe.

Example

>>> pd.dataframe(["Lagos", "Lag", "Abuja", "Abuja FCT", 'Ibadan'],column=['column_name'])
>>> Applying the function to this pandas series yields

>>> ["Lagos", "Lagos", "Abuja", "Abuja", 'Ibadan']

slik_wrangler.preprocessing.trim_all_columns(dataframe)

Trim whitespace from ends of each value across all series in dataframe

Parameters: dataframe (Pandas dataframe) –
Returns
Return type: Pandas Dataframe

slik_wrangler.utils

class slik_wrangler.utils.HiddenPrints

Bases: object

Hide prints of a function

Parameters: None –
Returns: None

slik_wrangler.utils.get_scores(y_true, y_pred)

Get metrics of model performance such as accuracy, precision, recall and f1.

Parameters

y_true – the target value of test/validation data.
y_pred – the predicted value

Returns

Accuracy, precision, recall and f1

slik_wrangler.utils.load_pickle(fp)

Load pickle file(data, model or pipeline object).

Parameters: fp – the file path of the pickle files.
Returns: Loaded pickle file

slik_wrangler.utils.log_plot(args, plot_func, fp)

Log the plots of your metrics and save output in a specified file path.

Parameters

args – A tuple. Arguments required to plot the required metrics
plot_func – A function Contains different method for plotting metrics such as ROC-AUC, PR-Curve
fp – File Path The path to write the output logs of the plot

Returns

None

slik_wrangler.utils.print_divider(title)

Expand print function with a clear differentiator using -.

Parameters: Title – the title of the print statement
Returns: None

slik_wrangler.utils.store_attribute(dict_file, output_path)

Store attributes of a dataframe as a dict.

Parameters

dict_file – the dictionary.
output_path – the path where the file is saved

Returns

None

slik_wrangler.utils.store_pipeline(pipeline_object, pipeline_path)

Store the column transformer pipeline object.

Parameters

pipeline_object – the pipeline object.
pipeline_path – the path where the pipeline is saved

Returns

None

slik_wrangler package

Submodules

slik_wrangler.dqa

slik_wrangler.loadfile

slik_wrangler.messages

slik_wrangler.pipeline

slik_wrangler.plot_funcs

slik_wrangler.preprocessing

slik_wrangler.utils

Module contents