slik_wrangler package
Submodules
slik_wrangler.dqa
Module for Asseting the Data Quality
- slik_wrangler.dqa.consistent_structure_assessement(dataframe, display_findings=True)
Checks the consitent nature of each feature column.
It checks if the dtype across each feature column is consistent. i.e. if there is an interger variable and a string variable across the various feature columns.
- Parameters
dataframe (pandas Dataframe) – Data set to perform assessment on.
- slik_wrangler.dqa.data_cleanness_assessment(dataframe, display_findings=True)
Checks for the overall cleanness of the dataframe(missing values in the dataset,duplicates, any inconsistent feature columns)
- Parameters
dataframe (pandas Dataframe) – Data set to perform assessment on.
display_findings (boolean, Default True) – Whether or not to display a dataframe highlighting the missing values count and percentage.
report. (Gives a) –
- slik_wrangler.dqa.duplicate_assessment(dataframe, display_findings=True)
Assets the duplicate values from the given datset and generates a report of its findings. It does this assessment for both rows and feature columns.
- Parameters
dataframe (pandas Dataframe) – Data set to perform assessment on.
display_findings (boolean, Default True) – Whether or not to display a dataframe highlighting the missing values count and percentage.
- slik_wrangler.dqa.missing_value_assessment(dataframe, display_findings=True)
Assets the missing values from the given datset and generates a report of its findings.
- Parameters
dataframe (pandas Dataframe) – Data set to perform assessment on.
display_findings (boolean, Default True) – Whether or not to display a dataframe highlighting the missing values count and percentage.
slik_wrangler.loadfile
high level support for loading files.
- slik_wrangler.loadfile.read_file(file_path, input_col=None, **kwargs)
Load a file path into a dataframe.
This funtion takes in a file path - CSV, excel or parquet and reads the data based on the input columns specified. Can only load one file at a time.
- Parameters
file_path (str/file path) – path to where data is stored.
input_col (list) – select columns to be loaded as a pandas dataframe
**kwargs – use keyword arguements from pandas read file method
- Returns
- Return type
pandas Dataframe
- slik_wrangler.loadfile.split_csv_file(file_path=None, delimiter=',', row_limit=1000000, output_path='.', keep_headers=True)
Split large csv files to small csv files.
Function splits large csv files into smaller files based on the row_limit specified. The files are stored in present working dir by default.
- Parameters
file_path (str/file path) – path to where data is stored.
delimiter (str. Default is ',') – separator in each row and column,
row_limit (int) – split each file by row count
output_path (str) – output path to store splitted files
keep_headers (Bool. Default is True) – make use of headers for all csv files
- Returns
- Return type
Splitted files are stored in output_path
slik_wrangler.messages
Creates functionality that colors log messages
- slik_wrangler.messages.log(*messages, code='normal', sep=' ', end='\n', file=None)
Distinguishes log messages from print statements. Works like a normal print statement but inclusive of colors :param messages: Message to be logged :param code: Log significance :return: distinguished log message
slik_wrangler.pipeline
Build Data and Model pipelines efficiently.
- class slik_wrangler.pipeline.DenseTransformer(*args: Any, **kwargs: Any)
Bases:
sklearn.base.
Transform sparse matrix to a dense matrix.
- fit(X, y=None, **fit_params)
Fit a sparse matrix.
Fit a sparse matrix with the DenseTransformer class
- Parameters
X (numpy array) – Sparse matrix to be fitted
y (numpy array) – Target array
- transform(X, y=None, **fit_params)
Transform a fitted sparse matrix to a dense matrix.
DenseTransformer tranforms a sparse matrix to a dense matrix. Some Transformer class do not work with sparse martix, hence the transformation.
- Parameters
X (numpy array) – Sparse matrix to be fitted
y (numpy array) – Target array
- Returns
Output – Dense matrix
- Return type
numpy array
- slik_wrangler.pipeline.build_data_pipeline(data=None, target_column=None, id_column=None, clean_data=True, project_path=None, numerical_transformer=None, categorical_transformer=None, select_columns=None, pca=True, algorithm=None, grid_search=False, display_inline=False, hashing=False, params=None, hash_size=500, balance_data=False, **kwargs)
Build data and model pipeline.
Build production ready pipelines efficiently. Specify numerical and categorical transformer. Function also helps to clean your data, reduce dimensionality and handle sparse categorical features.
- Parameters
data (str/ pandas dataframe) – Data path or Pandas dataframe.
target_column (str) – target column name
id_column (str) – id column name
clean_data (Bool, default is True) – handle missing value, outlier treatment, feature engineering
project_path (str/file path) – file path to processed data
numerical_transformer (sklearn pipeline) – numerical transformer to transform numerical attributes
categorical_transformer (sklearn pipeline) – categorical transformer to transform numerical attributes
select_columns (list) – columns to be passed/loaded as a dataframe
pca (Bool, default is True) – reduce feature dimensionality
algorithm (Default is None) – sklearn estimator
grid_search (Bool. default is False) – select best parameter after hyperparameter tuning
hashing (Bool. default is False) – handle sparse categorical features
params (dict.) – dictionary of keyword arguments.
display_inline (Bool, default is True) – display dataframe print statement
hash_size (int, default is 500) – size for hashing
- Returns
sklearn pipeline estimator
- Return type
Output
- slik_wrangler.pipeline.evaluate_model(model_path=None, eval_data=None, select_columns=None, project_path=None, **kwargs)
Check model strength by validating model with an evaluation data.
Evaluate model based on slik build data pipeline function. Invoke model on transformed data and return evaluation plots in a file path.
- Parameters
model_path (str/file path) – file path to model object
eval_data (str/ pandas dataframe) – Data path or Pandas dataframe.
select_columns (list) – columns to be passed/loaded as a dataframe
project_path (str/file path) – path to project
- slik_wrangler.pipeline.get_feature_names(column_transformer)
Get feature names after using column transformer object.
Get feature names after trabsformations from each transformers object in the column transformer class.
- Parameters
column_transformer (sklearn column transformer) –
- Returns
feature_names – Names of the features produced by transform.
- Return type
list of strings
- slik_wrangler.pipeline.pipeline_transform_predict(data=None, select_columns=None, project_path=None, model_path=None)
Transform pipeline object and return Predictions.
Transform dataframe based on slik build data pipeline function. Invoke model on transformed data and return predictions
- Parameters
data (str/ pandas dataframe) – Data path or Pandas dataframe.
select_columns (list) – columns to be passed/loaded as a dataframe
project_path (str/file path) – path to project
model_path (str/file path) – file path to model object
- Returns
results – list of numpy array predictions
- Return type
numpy array
slik_wrangler.plot_funcs
- slik_wrangler.plot_funcs.confusion_matrix(cm, fp, norm_axis=1)
[TN, FP] [FN, TP]
The confusion matrix after validating the model on a test set.
- Parameters
cm (confusion matrix) –
fp (the file path to save the figure.) –
- Returns
- Return type
A bar plot of the confusion matrix
- slik_wrangler.plot_funcs.corr_matrix(corr, fp)
- slik_wrangler.plot_funcs.feature_importance(features, feature_importances, title, fp)
The feature importance indicating features that contribute the most to the predictive power of the model.
- Parameters
features (features of the data set) –
feature_importances (the importances of the features that contributes) – the most to the predictive power of the model
title (title of the feature importance chart) –
fp (the file path to save the figure.) –
- Returns
- Return type
A bar plot of the feature importances
The distribution of label in a data set.
- Parameters
share (label distribution) –
fp (the file path to save the figure.) –
- Returns
- Return type
A bar plot of the label distribution
- slik_wrangler.plot_funcs.metric(metrics, fp)
- slik_wrangler.plot_funcs.plot_nan(data)
Plot the top values from a value count in a dataframe.
- Parameters
data (DataFrame or name Series.) – Data set to perform plot operation on.
Returns (A bar plot) – The bar plot of top n values.
- slik_wrangler.plot_funcs.pr_curve(pre, rec, auc, fp)
The precision-recall curve for model validation.
- Parameters
pre (precision) –
rec (recall) –
auc (area under the curve) –
fp (the file path to save the figure.) –
- Returns
- Return type
The precision-recall curve
- slik_wrangler.plot_funcs.roc_curve(fpr, tpr, auc, fp)
The roc_curve for model validation.
- Parameters
fpr (false positive rate) –
tpr (true positive rate) –
auc (area under the curve) –
fp (the file path to save the figure.) –
- Returns
- Return type
The ROC AUC curve
- slik_wrangler.plot_funcs.scores(scores, fp)
The average classification score for model validation.
- Parameters
scores (test data scores) –
fp (the file path to save the figure.) –
- Returns
- Return type
A bar plot of the average classification scores
slik_wrangler.preprocessing
- slik_wrangler.preprocessing.bin_age(dataframe=None, age_col=None, add_prefix=True)
The age attribute in a DataFrame is binned into 5 categories: (baby/toddler, child, young adult, mid age and elderly).
- Parameters
dataframe (DataFrame or name Series.) – Data set to perform operation on.
age_col (str.) – The column to perform the operation on.
add_prefix (Bool. Default is set to True) – add prefix to the column name.
- Returns
- Return type
Dataframe with binned age attribute
- slik_wrangler.preprocessing.change_case(dataframe, columns=None, case='lower', inplace=False)
Change the case of a pandas series to either upper or lower case
- Parameters
dataframe (Dataframe or named Series) –
columns (str, list) – The column or list of columns to perform the operation on
case (str. Default is set to lower) – Indicates the type of operation to perform
inplace (bool. Default is set to False) – Indicates if changes should by made within the dataframe or not.
- Returns
- Return type
Pandas Dataframe
- slik_wrangler.preprocessing.check_datefield(dataframe=None, column=None)
Check if a column is a datefield and Returns a Bool.
- Parameters
dataframe (DataFrame or name Series.) – Data set to perform operation on.
column (str) – The column name to perform the operation on.
- Returns
Returns True if the data point is a datefield.
- Return type
Boolean
- slik_wrangler.preprocessing.check_nan(dataframe=None, plot=False, display_inline=True)
Display missing values as a pandas dataframe and give a proportion in terms of percentages.
- Parameters
data (pandas DataFrame or named Series) –
plot (bool, Default False) – Plots missing values in dataset as a heatmap
display_inline (bool, Default False) – shows missing values in the dataset as a dataframe
- Returns
Bar plot of missing values
- Return type
Matplotlib Figure
- slik_wrangler.preprocessing.create_schema_file(dataframe, target_column, id_column, project_path='.', save=True, display_inline=True)
A data schema of column names and types are automatically inferred and saved in a YAML file
- Parameters
dataframe (DataFrame or name Series.) – Data set to perform operation on.
target_column (the name of the target column in the dataset. A string is expected) – The column to perform the operation on.
id_column (str) – Unique Identifier column.
project_path (str.) – The path of the schema file you want to create.
save (Bool. Default is set to True) – save schema file to file path.
display_inline (Bool. Default is set to True) – display dataframe print statements.
- Returns
A schema file is created in the data directory
- Return type
file path
- slik_wrangler.preprocessing.detect_fix_outliers(dataframe=None, target_column=None, n=1, num_features=None, fix_method='mean', display_inline=True)
Detect outliers present in the numerical features and fix the outliers present.
- Parameters
dataframe (DataFrame or name Series.) – Data set to perform operation on.
num_features (List, Series, Array.) – Numerical features to perform operation on. If not provided, we automatically infer from the dataset.
target_column (string) – The target attribute name.
fix_method (mean or log_transformatio. Default is 'mean') – Method of fixing outliers present in the data. mean or log_transformation.
n (integer) – A value to determine whether there are multiple outliers in a record, which is highly dependent on the number of features that are being checked.
display_inline (Bool. Default is True.) – Display the outliers present in the data in form of a dataframe.
- Returns
dataframe after removing outliers.
- Return type
Dataframe
- slik_wrangler.preprocessing.drop_duplicate(dataframe=None, columns=None, method='rows', display_inline=True)
Drop duplicate values across rows, columns in the dataframe.
- Parameters
dataframe (DataFrame or name Series.) – Data set to perform operation on.
columns (List/String.) – list of column names
method ('rows' or 'columns', default is 'rows') – Drop duplicate values across rows, columns.
display_inline (Bool. Default is True.) – Display print statements.
- Returns
dataframe after dropping duplicates.
- Return type
Dataframe
- slik_wrangler.preprocessing.drop_uninformative_fields(dataframe=None, exclude=None, display_inline=True)
Drop fields that have only a single unique value or are all NaN, meaning that they are entirely uninformative.
- Parameters
dataframe (DataFrame or name Series.) – Data set to perform operation on.
exclude (string/list.) – A column or list of columns you want to exclude from being dropped.
display_inline (Bool. Default is True.) – Display print statements.
- Returns
dataframe after dropping uninformative fields.
- Return type
Dataframe
- slik_wrangler.preprocessing.featurize_datetime(dataframe=None, column_name=None, date_features=None, drop=True)
Featurize datetime in the dataset to create new fields such as the Year, Month, Day, Day of the week, Day of the year, Week, end of the month, start of the month, end of the quarter, start of a quarter, end of the year, start of the year
- Parameters
dataframe (DataFrame or name Series.) – Data set to perform operation on.
column_name (String) – The column to perform the operation on.
date_features (List.) – A list of new datetime features to include in the dataset. Expected list should contain either of the elements in this list [‘Year’, ‘Month’, ‘Day’, ‘Dayofweek’, ‘Dayofyear’, ‘Week’,’Is_month_end’, ‘Is_month_start’, ‘Is_quarter_end’, ‘Hour’,’Minute’,’Is_quarter_start’, ‘Is_year_end’, ‘Is_year_start’, ‘Date’]
drop (Bool. Default is set to True) – drop original datetime column.
- Returns
Dataframe with new datetime fields
- Return type
Dataframe
- slik_wrangler.preprocessing.get_attributes(data=None, target_column=None)
Returns the categorical features and Numerical features(in a pandas dataframe) as a list
- Parameters
data (DataFrame or named Series) – Data set to perform operation on.
target_column (str) – Label or Target column
- Returns
A list of all the categorical features and numerical features in a dataset.
- Return type
List
- slik_wrangler.preprocessing.handle_nan(dataframe=None, target_name=None, strategy='mean', fillna='mode', drop_outliers=True, thresh_y=75, thresh_x=75, display_inline=True, **kwargs)
Handle missing values present in a pandas dataframe.
Take care of missing values in the data both cateforical and numerical features by dropping or filling missing values. Using the threshold parameter you can also drop missing values present in the data. Outliers are treated before handling missing values by default.
- Parameters
data (DataFrame or name Series.) – Data set to perform operation on.
target_name (str) – Name of the target column
strategy (str. Default is 'mean') – Method of filling numerical features
fillna (str. Default is 'mode') – Method of filling categorical features
drop_outliers (bool, Default True) – Drops outliers present in the data.
thresh_x (Int, Default is 75.) – Threshold for dropping rows with missing values.
thresh_y (In, Default is 75.) – Threshold for dropping columns with missing value
display_inline (Bool. default is True.) – display pandas dataframe print statements
- Returns
Dataframe without missing values
- Return type
Pandas Dataframe
- slik_wrangler.preprocessing.identify_columns(dataframe=None, target_column=None, id_column=None, high_dim=100, display_inline=True, project_path=None)
Identifies numerical attributes ,categorical attributes with sparse features and categorical attributes with lower features present in the data and saves the output in a yaml file.
- Parameters
dataframe (DataFrame or named Series) –
target_column (str) – Label or Target column.
id_column (str) – unique identifier column.
high_dim (int, default 100) – Integer to identify categorical attributes greater than 100 observations
display (Bool, default=True) – display print statement
project_path (str) – path to where the yaml file is saved.
- slik_wrangler.preprocessing.manage_columns(dataframe=None, columns=None, select_columns=False, drop_columns=False, drop_duplicates=None)
Manage operations on pandas dataframe based on columns. Operations include selecting of columns, dropping column and dropping duplicates.
- Parameters
dataframe (DataFrame or named Series) –
columns (used to specify columns to be selected, dropped or used in dropping duplicates.) –
select_columns (Boolean True or False, default is False) – The columns you want to select from your dataframe. Requires a list to be passed into the columns param
drop_columns (Boolean True or False, default is False) – The columns you want to drop from your dataset. Requires a list to be passed into the columns param
drop_duplicates ('rows' or 'columns', default is None) – Drop duplicate values across rows, columns. If columns, a list is required to be passed into the columns param
- Returns
A new dataframe after dropping/selecting/removing duplicate columns or the original dataframe if params are left as default
- Return type
Pandas Dataframe
- slik_wrangler.preprocessing.map_column(dataframe=None, column_name=None, items=None, add_prefix=True)
Map values in a pandas dataframe column with a dict.
- Parameters
data (DataFrame or named Series) –
column_name (str.) – Name of pandas dataframe column to be mapped
items (Dict, default is None) – A dict with key and value to be mapped
add_prefix (Bool, default is True) – Include a prefix of the target column in the dataset
- Returns
A new dataframe with mapped features.
- Return type
Pandas Dataframe
- slik_wrangler.preprocessing.map_target(dataframe=None, target_column=None, add_prefix=True, drop=False, display_inline=True)
Map target column in a pandas dataframe column with a dict. This can be applied to both binary and multi-class target
- Parameters
dataframe (DataFrame or named Series) –
target_column (str) – Name of the target column
add_prefix (Bool. Default is True) – Include a prefix of the target column in the dataset
drop (Bool. Default is True) – drop original target column name
display_inline (Bool. Default is True) –
- Returns
A new dataframe with mapped target column
- Return type
Pandas Dataframe
- slik_wrangler.preprocessing.preprocess(data=None, target_column=None, train=False, select_columns=None, display_inline=True, project_path=None, **kwargs)
Automatically preprocess dataframe/file-path. Handles missing value, Outlier treatment, feature engineering.
- Parameters
data (DataFrame or named Series) – Dataframe or dath path to the data
target_column (String) – Name of pandas dataframe target column
train (Bool, default is True) –
select_columns (List) – List of columns to be used
project_path (Str) – Path to where the preprocessed data will be stored
display_inline (Bool. Default is True) –
- Returns
Returns a clean dataframe in the filepath
- Return type
Pandas Dataframe
- slik_wrangler.preprocessing.rename_similar_values(dataframe, column_name, cut_off=0.75, n=None)
Use Sequence Matcher to check for best “good enough” matches.
Rename values based on similar matches.
- Parameters
dataframe (Pandas Series) –
column_name (str.) – Name of pandas column to perform operation on
cut_off (int) – Possibilities that don’t score at least that similar to word are ignored
n(optional) (int. default 2.) – The maximum number of close matches to return. n must be > 0.
- Returns
- Return type
Pandas Dataframe.
Example
>>> pd.dataframe(["Lagos", "Lag", "Abuja", "Abuja FCT", 'Ibadan'],column=['column_name']) >>> Applying the function to this pandas series yields
>>> ["Lagos", "Lagos", "Abuja", "Abuja", 'Ibadan']
- slik_wrangler.preprocessing.trim_all_columns(dataframe)
Trim whitespace from ends of each value across all series in dataframe
- Parameters
dataframe (Pandas dataframe) –
- Returns
- Return type
Pandas Dataframe
slik_wrangler.utils
- class slik_wrangler.utils.HiddenPrints
Bases:
object
Hide prints of a function
- Parameters
None –
- Returns
None
- slik_wrangler.utils.get_scores(y_true, y_pred)
Get metrics of model performance such as accuracy, precision, recall and f1.
- Parameters
y_true – the target value of test/validation data.
y_pred – the predicted value
- Returns
Accuracy, precision, recall and f1
- slik_wrangler.utils.load_pickle(fp)
Load pickle file(data, model or pipeline object).
- Parameters
fp – the file path of the pickle files.
- Returns
Loaded pickle file
- slik_wrangler.utils.log_plot(args, plot_func, fp)
Log the plots of your metrics and save output in a specified file path.
- Parameters
args – A tuple. Arguments required to plot the required metrics
plot_func – A function Contains different method for plotting metrics such as ROC-AUC, PR-Curve
fp – File Path The path to write the output logs of the plot
- Returns
None
- slik_wrangler.utils.print_divider(title)
Expand print function with a clear differentiator using -.
- Parameters
Title – the title of the print statement
- Returns
None
- slik_wrangler.utils.store_attribute(dict_file, output_path)
Store attributes of a dataframe as a dict.
- Parameters
dict_file – the dictionary.
output_path – the path where the file is saved
- Returns
None
- slik_wrangler.utils.store_pipeline(pipeline_object, pipeline_path)
Store the column transformer pipeline object.
- Parameters
pipeline_object – the pipeline object.
pipeline_path – the path where the pipeline is saved
- Returns
None