slik-wrangler Preprocessing API
Brief description on how to use the preprocessing module in the slik-wrangler package. This sample notebook explains some very important methods in the preprocessing module
[1]:
dataset_path = 'data/titanic.csv'
Using Slik-wrangler load file module you only need to specify your data path.
Slik-wrangler can infer the file type that was passed and read it as a pandas dataframe.
slik-wrangler.loadfile.read_file
function makes use of the same keyword arguments as pandas read functions.
[2]:
from slik_wrangler import loadfile as lf
You can get a brief summary of the rows and column that was loaded by Slik-wrangler
[3]:
train = lf.read_file(dataset_path)
CSV file read sucessfully
Data has 891 rows and 12 columns
Working with a large csv file and you can not load the whole data to Excel or with pandas, with slik-wrangler you can split a csv into multiple csv files.
Specify the number of rows that should be present in each csv file
[4]:
lf.split_csv_file(dataset_path,row_limit=200)
[5]:
train.head()
[5]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Before any step in preprocessing your dataset, it is essential to look at the overall state of your dataset.
Pandas provide a means to achieve this by its info()
method, which gives us an insight into the missing values, data type, data size, and data memory usage.
While this is useful, with Slik-wrangler you could quickly get an overview of all you need to adjust to make a balanced dataset. By balanced dataset here, we’re implying a dataset void of missing values, duplicate values, and inconsistency in the data type of one or more feature columns.
Slik-wrangler provides a data quality assessment module (slik-wrangler.dqa
) for this purpose entirely.
[6]:
from slik_wrangler.dqa import data_cleanness_assessment
With the slik-wrangler.dqa.data_cleanness_assessment
You could get a general overview of how balanced your dataset is.
[7]:
data_cleanness_assessment(train)
Checking for missing values
Dataframe contains missing values that you should address.
columns=['Age', 'Cabin', 'Embarked']
missing_counts | missing_percent | |
---|---|---|
features | ||
Age | 177 | 19.9 |
Cabin | 687 | 77.1 |
Embarked | 2 | 0.2 |
Checking for duplicate variables
No duplicate values in both rows and columns!!!
Checking for inconsistent values
No inconsistent feature columns values!!!
Knowing this, you can proceed to preprocess your dataset using the slik-wrangler.preprocessing
module.
[8]:
from slik_wrangler import preprocessing as pp
[9]:
pp.check_datefield(train,'Ticket')
[9]:
False
[10]:
pp.change_case(train,'Name','lower').head()
[10]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | braund, mr. owen harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | cumings, mrs. john bradley (florence briggs th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | heikkinen, miss. laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | futrelle, mrs. jacques heath (lily may peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | allen, mr. william henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
slik-wrangler will identify the data type of each data point, data points with high cardinality and save it in a file. With slik-wrangler, data integrity can be done efficiently to validate downstream data points
[11]:
pp.identify_columns(train,'Survived','PassengerId',project_path='./data')
--------------- Identifying columns present in the data ---------------
Target column is Survived. Attribute in target column:[0, 1]
Features with high cardinality:['Name', 'Ticket', 'Cabin']
{'cat_feat': ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'],
'high_card_feat': ['Name', 'Ticket', 'Cabin'],
'id_column': 'PassengerId',
'input_columns': ['Pclass',
'Name',
'Sex',
'Age',
'SibSp',
'Parch',
'Ticket',
'Fare',
'Cabin',
'Embarked'],
'lower_cat': ['Sex', 'Embarked'],
'num_feat': ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'],
'parse_dates': [],
'target_column': 'Survived'}
Attributes are stored in data\metadata
You can map your data observations more efficiently with Slik by passing the dictionary of the observation you want to map/rename
[12]:
pp.map_column(train,column_name='Sex',items={'male':1,'female':0}).head()
--------------- Mapping passed column ---------------
male was mapped to 1
female was mapped to 0
[12]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | transformed_Sex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 1 |
slik-wrangler currently relies on the Interquartile range approach to detect outliers present in a data point. slik-wrangler also fixes the outlier present in the data using different methods like replacing an outlier with the mean of the data point. You can also select the numerical features you want to perform the operation on. You can also display a table identifying at least ‘n’ outliers in a row.
[13]:
det = pp.detect_fix_outliers(train,target_column='Survived',n=2)
--------------- Table identifying at least 2 outliers in a row ---------------
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
745 | 746 | 0 | 1 | Crosby, Capt. Edward Gifford | male | 70.0 | 1 | 1 | WE/P 5735 | 71.00 | B22 | S |
27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.00 | C23 C25 C27 | S |
88 | 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23.0 | 3 | 2 | 19950 | 263.00 | C23 C25 C27 | S |
159 | 160 | 0 | 3 | Sage, Master. Thomas Henry | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
180 | 181 | 0 | 3 | Sage, Miss. Constance Gladys | female | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
201 | 202 | 0 | 3 | Sage, Mr. Frederick | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
324 | 325 | 0 | 3 | Sage, Mr. George John Jr | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
341 | 342 | 1 | 1 | Fortune, Miss. Alice Elizabeth | female | 24.0 | 3 | 2 | 19950 | 263.00 | C23 C25 C27 | S |
792 | 793 | 0 | 3 | Sage, Miss. Stella Anna | female | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
846 | 847 | 0 | 3 | Sage, Mr. Douglas Bullen | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
As we have seen with the data quality assessment module, with slik you can also check the mssing values in your data and even plot a percentage distribution to see the top 30 missing values in your dataset
[14]:
pp.check_nan(train,plot=True)

--------------- Count and Percentage of missing value ---------------
features | missing_counts | missing_percent | |
---|---|---|---|
0 | PassengerId | 0 | 0.0 |
1 | Survived | 0 | 0.0 |
2 | Pclass | 0 | 0.0 |
3 | Name | 0 | 0.0 |
4 | Sex | 0 | 0.0 |
5 | Age | 177 | 19.9 |
6 | SibSp | 0 | 0.0 |
7 | Parch | 0 | 0.0 |
8 | Ticket | 0 | 0.0 |
9 | Fare | 0 | 0.0 |
10 | Cabin | 687 | 77.1 |
11 | Embarked | 2 | 0.2 |
slik-wrangler helps to handle the missing values in your data intelligently and efficiently. You can choose a strategy to handle your numerical features
and pass a value for fillna params to handle your categorical features or fill it with the mode by default.
You can also drop missing values across the rows and columns using threshold parameters.
[15]:
data = pp.handle_nan(dataframe=train,target_name='Survived',strategy='mean',fillna='mode',
drop_outliers=True,thresh_x=75,thresh_y=50,display_inline=True)
Dropping rows with 75% missing value: Number of records dropped is 0
Dropping Columns with 50% missing value: ['Cabin']
New data shape is (891, 11)
Beyond slik-wrangler preprocessing abilities, you can also engineer new features intelligently. slik-wrangler can help you bin/discretize your age column intelligently and creating new data points with the transformations
[16]:
pp.bin_age(data,'Age')
[16]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | binned_Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.000000 | 1.0 | 0.000000 | A/5 21171 | 7.250000 | S | Young Adult |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.000000 | 1.0 | 0.000000 | PC 17599 | 32.204208 | C | Mid-Age |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.000000 | 0.0 | 0.000000 | STON/O2. 3101282 | 7.925000 | S | Young Adult |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.000000 | 1.0 | 0.000000 | 113803 | 53.100000 | S | Mid-Age |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.000000 | 0.0 | 0.000000 | 373450 | 8.050000 | S | Mid-Age |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.000000 | 0.0 | 0.000000 | 211536 | 13.000000 | S | Young Adult |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.000000 | 0.0 | 0.000000 | 112053 | 30.000000 | S | Young Adult |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | 29.081737 | 1.0 | 0.381594 | W./C. 6607 | 23.450000 | S | Young Adult |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.000000 | 0.0 | 0.000000 | 111369 | 30.000000 | C | Young Adult |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.000000 | 0.0 | 0.000000 | 370376 | 7.750000 | Q | Mid-Age |
891 rows × 12 columns
with Slik you can infer the schema of your pandas dataframe and save the schema file in a project path you define
[17]:
# import ,yaml
pp.create_schema_file(train,target_column='Survived',id_column='PassengerId',save=False)
--------------- Creating Schema file ---------------
{'dtype': {'PassengerId': 'int64',
'Survived': 'int64',
'Pclass': 'int64',
'Name': 'object',
'Sex': 'object',
'Age': 'float64',
'SibSp': 'int64',
'Parch': 'int64',
'Ticket': 'object',
'Fare': 'float64',
'Cabin': 'object',
'Embarked': 'object'}}
with Slik you can also drop uninformative field in your pandas dataframe
[18]:
pp.drop_uninformative_fields(train,exclude='Parch')
--------------- Dropping uninformative fields ---------------
uninformative fields dropped: []
[18]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
Slik helps you manage your data points better by handling different management operations techniques on pandas dataframe based on columns.
Operations include selecting of columns, dropping column and dropping duplicates. By selecting the list data points that you need to perform the transformation on and choosing the particular transformation you want
[19]:
pp.manage_columns(train,['PassengerId'],drop_duplicates='columns').head()
--------------- Dropping duplicates across the columns ---------------
New datashape is (891, 12)
[19]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Trim whitespaces from ends of each value across all data points in a pandas dataframe
[20]:
pp.trim_all_columns(train)
[20]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
Slik can clean your data in one line of code. The slik.preprocessing.preprocess
function cleans your data by removing outliers present in the data, handing missing values, featurizing datetime columns, and mapping relevant columns. The functions saves the preprocessed file in a project path that you specify.
[21]:
pp.preprocess(data=train,target_column='Survived',train=True,display_inline=True,project_path='./Titanic',logging='display')
The task for preprocessing is classification
Dropping rows with 75% missing value: Number of records dropped is 0
Dropping Columns with 75% missing value: ['Cabin']
New data shape is (891, 11)
--------------- Mapping target columns ---------------
0 was mapped to 0
1 was mapped to 1
--------------- Bucketize Age columns ---------------
Inferred age column: [Age]
--------------- Mapping passed column ---------------
male was mapped to 0
female was mapped to 1
--------------- Dropping uninformative fields ---------------
uninformative fields dropped: []
--------------- Creating Schema file ---------------
{'dtype': {'PassengerId': 'int64',
'Pclass': 'int64',
'Name': 'object',
'Age': 'float64',
'SibSp': 'float64',
'Parch': 'float64',
'Ticket': 'object',
'Fare': 'float64',
'Embarked': 'object',
'transformed_Survived': 'int64',
'binned_Age': 'object',
'transformed_Sex': 'int64'}}
Schema file stored in Titanic\data\metadata
--------------- Preview the preprocessed data ---------------
PassengerId | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Embarked | transformed_Survived | binned_Age | transformed_Sex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | Braund, Mr. Owen Harris | 22.0 | 1.0 | 0.0 | A/5 21171 | 7.250000 | S | 0 | Young Adult | 0 |
1 | 2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1.0 | 0.0 | PC 17599 | 32.204208 | C | 1 | Mid-Age | 1 |
2 | 3 | 3 | Heikkinen, Miss. Laina | 26.0 | 0.0 | 0.0 | STON/O2. 3101282 | 7.925000 | S | 1 | Young Adult | 1 |
3 | 4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1.0 | 0.0 | 113803 | 53.100000 | S | 1 | Mid-Age | 1 |
4 | 5 | 3 | Allen, Mr. William Henry | 35.0 | 0.0 | 0.0 | 373450 | 8.050000 | S | 0 | Mid-Age | 0 |
--------------- Preprocessed data saved ---------------
Input data preprocessed successfully and stored in ./Titanic\data\train_data.pkl
[ ]: