{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## slik-wrangler Preprocessing API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Brief description on how to use the preprocessing module in the slik-wrangler package. This sample notebook explains some very important methods in the preprocessing module" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "dataset_path = 'data/titanic.csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Slik-wrangler load file module you only need to specify your data path. \n", "\n", "Slik-wrangler can infer the file type that was passed and read it as a pandas dataframe.\n", "\n", "`slik-wrangler.loadfile.read_file` function makes use of the same keyword arguments as pandas read functions. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from slik_wrangler import loadfile as lf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can get a brief summary of the rows and column that was loaded by Slik-wrangler" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32m\n", "CSV file read sucessfully\n", "\u001b[36m\n", "Data has 891 rows and 12 columns\n" ] } ], "source": [ "train = lf.read_file(dataset_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Working with a large csv file and you can not load the whole data to Excel or with pandas, with slik-wrangler you can split a csv into multiple csv files. \n", "\n", "Specify the number of rows that should be present in each csv file" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "lf.split_csv_file(dataset_path,row_limit=200)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before any step in preprocessing your dataset, it is essential to look at the overall state of your dataset. \n", "\n", "Pandas provide a means to achieve this by its `info()` method, which gives us an insight into the missing values, data type, data size, and data memory usage.\n", "\n", "While this is useful, with Slik-wrangler you could quickly get an overview of all you need to adjust to make a balanced dataset. By balanced dataset here, we're implying a dataset void of missing values, duplicate values, and inconsistency in the data type of one or more feature columns.\n", "\n", "Slik-wrangler provides a data quality assessment module (`slik-wrangler.dqa`) for this purpose entirely. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from slik_wrangler.dqa import data_cleanness_assessment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the `slik-wrangler.dqa.data_cleanness_assessment` You could get a general overview of how balanced your dataset is." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[36mChecking for missing values\n", "\n", "\u001b[33mDataframe contains missing values that you should address. \n", "\n", "columns=['Age', 'Cabin', 'Embarked']\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
missing_countsmissing_percent
features
Age17719.9
Cabin68777.1
Embarked20.2
\n", "
" ], "text/plain": [ " missing_counts missing_percent\n", "features \n", "Age 177 19.9\n", "Cabin 687 77.1\n", "Embarked 2 0.2" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[39m\n", "\n", "\u001b[36mChecking for duplicate variables\n", "\n", "\u001b[32mNo duplicate values in both rows and columns!!!\n", "\u001b[39m\n", "\n", "\u001b[36mChecking for inconsistent values\n", "\n", "\u001b[32mNo inconsistent feature columns values!!!\n", "\u001b[39m\n", "\n" ] } ], "source": [ "data_cleanness_assessment(train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Knowing this, you can proceed to preprocess your dataset using the `slik-wrangler.preprocessing` module." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from slik_wrangler import preprocessing as pp" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.check_datefield(train,'Ticket')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103braund, mr. owen harrismale22.010A/5 211717.2500NaNS
1211cumings, mrs. john bradley (florence briggs th...female38.010PC 1759971.2833C85C
2313heikkinen, miss. lainafemale26.000STON/O2. 31012827.9250NaNS
3411futrelle, mrs. jacques heath (lily may peel)female35.01011380353.1000C123S
4503allen, mr. william henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 braund, mr. owen harris male 22.0 1 \n", "1 cumings, mrs. john bradley (florence briggs th... female 38.0 1 \n", "2 heikkinen, miss. laina female 26.0 0 \n", "3 futrelle, mrs. jacques heath (lily may peel) female 35.0 1 \n", "4 allen, mr. william henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.change_case(train,'Name','lower').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "slik-wrangler will identify the data type of each data point, data points with high cardinality and save it in a file. With slik-wrangler, data integrity can be done efficiently to validate downstream data points" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[39m\n", "--------------- Identifying columns present in the data ---------------\n", "\n", "Target column is Survived. Attribute in target column:[0, 1]\n", "\n", "Features with high cardinality:['Name', 'Ticket', 'Cabin']\n", "\n", "{'cat_feat': ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'],\n", " 'high_card_feat': ['Name', 'Ticket', 'Cabin'],\n", " 'id_column': 'PassengerId',\n", " 'input_columns': ['Pclass',\n", " 'Name',\n", " 'Sex',\n", " 'Age',\n", " 'SibSp',\n", " 'Parch',\n", " 'Ticket',\n", " 'Fare',\n", " 'Cabin',\n", " 'Embarked'],\n", " 'lower_cat': ['Sex', 'Embarked'],\n", " 'num_feat': ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'],\n", " 'parse_dates': [],\n", " 'target_column': 'Survived'}\n", "\n", "Attributes are stored in data\\metadata\n", "\n" ] } ], "source": [ "pp.identify_columns(train,'Survived','PassengerId',project_path='./data')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can map your data observations more efficiently with Slik by passing the dictionary of the observation you want to map/rename" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[39m\n", "--------------- Mapping passed column ---------------\n", "\n", "male was mapped to 1\n", "\n", "female was mapped to 0\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedtransformed_Sex
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C0
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S0
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS1
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked transformed_Sex \n", "0 0 A/5 21171 7.2500 NaN S 1 \n", "1 0 PC 17599 71.2833 C85 C 0 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 0 \n", "3 0 113803 53.1000 C123 S 0 \n", "4 0 373450 8.0500 NaN S 1 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.map_column(train,column_name='Sex',items={'male':1,'female':0}).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "slik-wrangler currently relies on the Interquartile range approach to detect outliers present in a data point. slik-wrangler also fixes the outlier present in the data using different methods like replacing an outlier with the mean of the data point. You can also select the numerical features you want to perform the operation on. You can also display a table identifying at least 'n' outliers in a row." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[39m\n", "--------------- Table identifying at least 2 outliers in a row ---------------\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
74574601Crosby, Capt. Edward Giffordmale70.011WE/P 573571.00B22S
272801Fortune, Mr. Charles Alexandermale19.03219950263.00C23 C25 C27S
888911Fortune, Miss. Mabel Helenfemale23.03219950263.00C23 C25 C27S
15916003Sage, Master. Thomas HenrymaleNaN82CA. 234369.55NaNS
18018103Sage, Miss. Constance GladysfemaleNaN82CA. 234369.55NaNS
20120203Sage, Mr. FrederickmaleNaN82CA. 234369.55NaNS
32432503Sage, Mr. George John JrmaleNaN82CA. 234369.55NaNS
34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.00C23 C25 C27S
79279303Sage, Miss. Stella AnnafemaleNaN82CA. 234369.55NaNS
84684703Sage, Mr. Douglas BullenmaleNaN82CA. 234369.55NaNS
86386403Sage, Miss. Dorothy Edith \"Dolly\"femaleNaN82CA. 234369.55NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name Sex \\\n", "745 746 0 1 Crosby, Capt. Edward Gifford male \n", "27 28 0 1 Fortune, Mr. Charles Alexander male \n", "88 89 1 1 Fortune, Miss. Mabel Helen female \n", "159 160 0 3 Sage, Master. Thomas Henry male \n", "180 181 0 3 Sage, Miss. Constance Gladys female \n", "201 202 0 3 Sage, Mr. Frederick male \n", "324 325 0 3 Sage, Mr. George John Jr male \n", "341 342 1 1 Fortune, Miss. Alice Elizabeth female \n", "792 793 0 3 Sage, Miss. Stella Anna female \n", "846 847 0 3 Sage, Mr. Douglas Bullen male \n", "863 864 0 3 Sage, Miss. Dorothy Edith \"Dolly\" female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked \n", "745 70.0 1 1 WE/P 5735 71.00 B22 S \n", "27 19.0 3 2 19950 263.00 C23 C25 C27 S \n", "88 23.0 3 2 19950 263.00 C23 C25 C27 S \n", "159 NaN 8 2 CA. 2343 69.55 NaN S \n", "180 NaN 8 2 CA. 2343 69.55 NaN S \n", "201 NaN 8 2 CA. 2343 69.55 NaN S \n", "324 NaN 8 2 CA. 2343 69.55 NaN S \n", "341 24.0 3 2 19950 263.00 C23 C25 C27 S \n", "792 NaN 8 2 CA. 2343 69.55 NaN S \n", "846 NaN 8 2 CA. 2343 69.55 NaN S \n", "863 NaN 8 2 CA. 2343 69.55 NaN S " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "det = pp.detect_fix_outliers(train,target_column='Survived',n=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we have seen with the data quality assessment module, with slik you can also check the mssing values in your data and even plot a percentage distribution to see the top 30 missing values in your dataset" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[39m\n", "--------------- Count and Percentage of missing value ---------------\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featuresmissing_countsmissing_percent
0PassengerId00.0
1Survived00.0
2Pclass00.0
3Name00.0
4Sex00.0
5Age17719.9
6SibSp00.0
7Parch00.0
8Ticket00.0
9Fare00.0
10Cabin68777.1
11Embarked20.2
\n", "
" ], "text/plain": [ " features missing_counts missing_percent\n", "0 PassengerId 0 0.0\n", "1 Survived 0 0.0\n", "2 Pclass 0 0.0\n", "3 Name 0 0.0\n", "4 Sex 0 0.0\n", "5 Age 177 19.9\n", "6 SibSp 0 0.0\n", "7 Parch 0 0.0\n", "8 Ticket 0 0.0\n", "9 Fare 0 0.0\n", "10 Cabin 687 77.1\n", "11 Embarked 2 0.2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pp.check_nan(train,plot=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "slik-wrangler helps to handle the missing values in your data intelligently and efficiently. You can choose a strategy to handle your numerical features \n", "\n", "and pass a value for fillna params to handle your categorical features or fill it with the mode by default. \n", "\n", "You can also drop missing values across the rows and columns using threshold parameters." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Dropping rows with 75% missing value: Number of records dropped is 0\n", "\n", "Dropping Columns with 50% missing value: ['Cabin']\n", "\n", "New data shape is (891, 11)\n" ] } ], "source": [ "data = pp.handle_nan(dataframe=train,target_name='Survived',strategy='mean',fillna='mode',\n", " drop_outliers=True,thresh_x=75,thresh_y=50,display_inline=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beyond slik-wrangler preprocessing abilities, you can also engineer new features intelligently. \n", "slik-wrangler can help you bin/discretize your age column intelligently and creating new data points with the transformations" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareEmbarkedbinned_Age
0103Braund, Mr. Owen Harrismale22.0000001.00.000000A/5 211717.250000SYoung Adult
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.0000001.00.000000PC 1759932.204208CMid-Age
2313Heikkinen, Miss. Lainafemale26.0000000.00.000000STON/O2. 31012827.925000SYoung Adult
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.0000001.00.00000011380353.100000SMid-Age
4503Allen, Mr. William Henrymale35.0000000.00.0000003734508.050000SMid-Age
.......................................
88688702Montvila, Rev. Juozasmale27.0000000.00.00000021153613.000000SYoung Adult
88788811Graham, Miss. Margaret Edithfemale19.0000000.00.00000011205330.000000SYoung Adult
88888903Johnston, Miss. Catherine Helen \"Carrie\"female29.0817371.00.381594W./C. 660723.450000SYoung Adult
88989011Behr, Mr. Karl Howellmale26.0000000.00.00000011136930.000000CYoung Adult
89089103Dooley, Mr. Patrickmale32.0000000.00.0000003703767.750000QMid-Age
\n", "

891 rows × 12 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", ".. ... ... ... \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Sex Age \\\n", "0 Braund, Mr. Owen Harris male 22.000000 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.000000 \n", "2 Heikkinen, Miss. Laina female 26.000000 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.000000 \n", "4 Allen, Mr. William Henry male 35.000000 \n", ".. ... ... ... \n", "886 Montvila, Rev. Juozas male 27.000000 \n", "887 Graham, Miss. Margaret Edith female 19.000000 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" female 29.081737 \n", "889 Behr, Mr. Karl Howell male 26.000000 \n", "890 Dooley, Mr. Patrick male 32.000000 \n", "\n", " SibSp Parch Ticket Fare Embarked binned_Age \n", "0 1.0 0.000000 A/5 21171 7.250000 S Young Adult \n", "1 1.0 0.000000 PC 17599 32.204208 C Mid-Age \n", "2 0.0 0.000000 STON/O2. 3101282 7.925000 S Young Adult \n", "3 1.0 0.000000 113803 53.100000 S Mid-Age \n", "4 0.0 0.000000 373450 8.050000 S Mid-Age \n", ".. ... ... ... ... ... ... \n", "886 0.0 0.000000 211536 13.000000 S Young Adult \n", "887 0.0 0.000000 112053 30.000000 S Young Adult \n", "888 1.0 0.381594 W./C. 6607 23.450000 S Young Adult \n", "889 0.0 0.000000 111369 30.000000 C Young Adult \n", "890 0.0 0.000000 370376 7.750000 Q Mid-Age \n", "\n", "[891 rows x 12 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.bin_age(data,'Age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "with Slik you can infer the schema of your pandas dataframe and save the schema file in a project path you define" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[39m\n", "--------------- Creating Schema file ---------------\n", "\n" ] }, { "data": { "text/plain": [ "{'dtype': {'PassengerId': 'int64',\n", " 'Survived': 'int64',\n", " 'Pclass': 'int64',\n", " 'Name': 'object',\n", " 'Sex': 'object',\n", " 'Age': 'float64',\n", " 'SibSp': 'int64',\n", " 'Parch': 'int64',\n", " 'Ticket': 'object',\n", " 'Fare': 'float64',\n", " 'Cabin': 'object',\n", " 'Embarked': 'object'}}" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# import ,yaml\n", "pp.create_schema_file(train,target_column='Survived',id_column='PassengerId',save=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "with Slik you can also drop uninformative field in your pandas dataframe" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[39m\n", "--------------- Dropping uninformative fields ---------------\n", "\n", "uninformative fields dropped: []\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
\n", "

891 rows × 12 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", ".. ... ... ... \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", ".. ... ... ... ... \n", "886 Montvila, Rev. Juozas male 27.0 0 \n", "887 Graham, Miss. Margaret Edith female 19.0 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n", "889 Behr, Mr. Karl Howell male 26.0 0 \n", "890 Dooley, Mr. Patrick male 32.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S \n", ".. ... ... ... ... ... \n", "886 0 211536 13.0000 NaN S \n", "887 0 112053 30.0000 B42 S \n", "888 2 W./C. 6607 23.4500 NaN S \n", "889 0 111369 30.0000 C148 C \n", "890 0 370376 7.7500 NaN Q \n", "\n", "[891 rows x 12 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.drop_uninformative_fields(train,exclude='Parch')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slik helps you manage your data points better by handling different management operations techniques on pandas dataframe based on columns. \n", "\n", "Operations include selecting of columns, dropping column and dropping duplicates. By selecting the list data points that you need to perform the transformation on and choosing the particular transformation you want " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[39m\n", "--------------- Dropping duplicates across the columns ---------------\n", "\n", "New datashape is (891, 12)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.manage_columns(train,['PassengerId'],drop_duplicates='columns').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trim whitespaces from ends of each value across all data points in a pandas dataframe" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
\n", "

891 rows × 12 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", ".. ... ... ... \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", ".. ... ... ... ... \n", "886 Montvila, Rev. Juozas male 27.0 0 \n", "887 Graham, Miss. Margaret Edith female 19.0 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n", "889 Behr, Mr. Karl Howell male 26.0 0 \n", "890 Dooley, Mr. Patrick male 32.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S \n", ".. ... ... ... ... ... \n", "886 0 211536 13.0000 NaN S \n", "887 0 112053 30.0000 B42 S \n", "888 2 W./C. 6607 23.4500 NaN S \n", "889 0 111369 30.0000 C148 C \n", "890 0 370376 7.7500 NaN Q \n", "\n", "[891 rows x 12 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.trim_all_columns(train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slik can clean your data in one line of code. The `slik.preprocessing.preprocess` function cleans your data\n", "by removing outliers present in the data, handing missing values, featurizing datetime columns, and mapping relevant columns.\n", "The functions saves the preprocessed file in a project path that you specify." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The task for preprocessing is classification\n", "\n", "Dropping rows with 75% missing value: Number of records dropped is 0\n", "\n", "Dropping Columns with 75% missing value: ['Cabin']\n", "\n", "New data shape is (891, 11)\n", "\u001b[39m\n", "--------------- Mapping target columns ---------------\n", "\n", "0 was mapped to 0\n", "\n", "1 was mapped to 1\n", "\n", "\u001b[39m\n", "--------------- Bucketize Age columns ---------------\n", "\n", " Inferred age column: [Age]\n", "\u001b[39m\n", "--------------- Mapping passed column ---------------\n", "\n", "male was mapped to 0\n", "\n", "female was mapped to 1\n", "\n", "\u001b[39m\n", "--------------- Dropping uninformative fields ---------------\n", "\n", "uninformative fields dropped: []\n", "\u001b[39m\n", "--------------- Creating Schema file ---------------\n", "\n" ] }, { "data": { "text/plain": [ "{'dtype': {'PassengerId': 'int64',\n", " 'Pclass': 'int64',\n", " 'Name': 'object',\n", " 'Age': 'float64',\n", " 'SibSp': 'float64',\n", " 'Parch': 'float64',\n", " 'Ticket': 'object',\n", " 'Fare': 'float64',\n", " 'Embarked': 'object',\n", " 'transformed_Survived': 'int64',\n", " 'binned_Age': 'object',\n", " 'transformed_Sex': 'int64'}}" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Schema file stored in Titanic\\data\\metadata\n", "\u001b[39m\n", "--------------- Preview the preprocessed data ---------------\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameAgeSibSpParchTicketFareEmbarkedtransformed_Survivedbinned_Agetransformed_Sex
013Braund, Mr. Owen Harris22.01.00.0A/5 211717.250000S0Young Adult0
121Cumings, Mrs. John Bradley (Florence Briggs Th...38.01.00.0PC 1759932.204208C1Mid-Age1
233Heikkinen, Miss. Laina26.00.00.0STON/O2. 31012827.925000S1Young Adult1
341Futrelle, Mrs. Jacques Heath (Lily May Peel)35.01.00.011380353.100000S1Mid-Age1
453Allen, Mr. William Henry35.00.00.03734508.050000S0Mid-Age0
\n", "
" ], "text/plain": [ " PassengerId Pclass Name \\\n", "0 1 3 Braund, Mr. Owen Harris \n", "1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... \n", "2 3 3 Heikkinen, Miss. Laina \n", "3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) \n", "4 5 3 Allen, Mr. William Henry \n", "\n", " Age SibSp Parch Ticket Fare Embarked \\\n", "0 22.0 1.0 0.0 A/5 21171 7.250000 S \n", "1 38.0 1.0 0.0 PC 17599 32.204208 C \n", "2 26.0 0.0 0.0 STON/O2. 3101282 7.925000 S \n", "3 35.0 1.0 0.0 113803 53.100000 S \n", "4 35.0 0.0 0.0 373450 8.050000 S \n", "\n", " transformed_Survived binned_Age transformed_Sex \n", "0 0 Young Adult 0 \n", "1 1 Mid-Age 1 \n", "2 1 Young Adult 1 \n", "3 1 Mid-Age 1 \n", "4 0 Mid-Age 0 " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[39m\n", "--------------- Preprocessed data saved ---------------\n", "\n", "\n", " Input data preprocessed successfully and stored in ./Titanic\\data\\train_data.pkl\n", "\n" ] } ], "source": [ "pp.preprocess(data=train,target_column='Survived',train=True,display_inline=True,project_path='./Titanic',logging='display')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }