############################################### Quickstart to ML with Slik-Wrangler ############################################### Let's dive right in! Let's start with the classic. You have the titanic.csv file and want to predict whether a passenger survived or not based on the information about the passenger in that file. We know, for tabular data like this, pandas is our friend. Clearly we need to start with loading our data. slik-wrangler has a read_file function that reads either CSV, Excel or parquet files >>> import pandas as pd >>> import slik_wrangler >>> from slik_wrangler import loadfile as lf >>> titanic = lf.read_file('titanic.csv') CSV file read sucessfully Data has 891 rows and 12 columns Let's familiarize ourself with the data a bit; what's the shape, what are the columns, what do they look like? >>> titanic.shape (891, 12) >>> titanic.head() # doctest: +ELLIPSIS PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S [5 rows x 14 columns] So far so good! There's already a bunch of things going on in the data that we can see here, Lets ask slik-wrangler to check if there are missing values present in the dataset: >>> slik_wrangler.preprocessing.check_nan(titanic,plot=True,verbose=False) .. image:: images/missing_values.png :align: center This provides us with lots of information about percentage count of missing values in different columns. In this case, we might have been able to figure this out quickly from the call to head, but in larger datasets this might be a bit tricky. For example we can see that there are several dirty columns with "NaN" in it. let's try and continue with what slik-wrangler is doing automatically for now. Now we can handle missing values with slik-wrangler. >>> slik_wrangler.preprocessing.handle_nan(dataframe=titanic,target_name='Survived',strategy='mean',fillna='mode', >>> thresh_x=50,thresh_y=50) Dropping rows with 50% missing value :Number of records dropped from 891 to 891 Dropping Columns with 50% missing value: ['Cabin'] New data shape is (891, 11) In slik-wrangler, we can also identify each columns present in the data: >>> slik_wrangler.preprocessing.identify_columns(titanic,'Survived',id_column='PassengerId',project_path='Titanic') --------------- Identifying columns present in the data --------------- Target column is Survived. Attribute in target column:[0, 1] Features with high cardinality:['Name', 'Ticket', 'Cabin'] {'cat_feat': ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], 'high_card_feat': ['Name', 'Ticket', 'Cabin'], 'id_column': 'PassengerId', 'input_columns': ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], 'lower_cat': ['Sex', 'Embarked'], 'num_feat': ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], 'parse_dates': [], 'target_column': 'Survived'} Attributes are stored in data/metadata slik-wrangler can clean your data in one line of code. The `slik-wrangler.preprocessing.preprocess` function cleans your data by removing outliers present in the data, handing missing values, featurizing datetime columns, and mapping relevant columns. The functions saves the preprocessed file in a project path that you specify. >>> slik_wrangler.preprocessing.preprocess(data=train,target_column='Survived',train=True,display_inline=True, >>> project_path='./Titanic',logging='display') The task for preprocessing is classification Dropping rows with 75% missing value: Number of records dropped is 0 Dropping Columns with 20% missing value: [] New data shape is (891, 11) --------------- Mapping target columns --------------- 0 was mapped to 0 1 was mapped to 1 --------------- Bucketize Age columns --------------- Inferred age column: [Age] --------------- Mapping passed column --------------- male was mapped to 0 female was mapped to 1 --------------- Creating Schema file --------------- {'dtype': {'PassengerId': 'int64', 'Pclass': 'int64', 'Name': 'object', 'Age': 'float64', 'SibSp': 'float64', 'Parch': 'float64', 'Ticket': 'object', 'Fare': 'float64', 'Embarked': 'object', 'transformed_Survived': 'int64', 'binned_Age': 'object', 'transformed_Sex': 'int64'}} Schema file stored in Titanic/data/metadata --------------- Preview the preprocessed data --------------- PassengerId Pclass Name Age SibSp Parch Ticket Fare Embarked transformed_Survived binned_Age transformed_Sex 0 1 3 Braund, Mr. Owen Harris 22.0 1.0 0.0 A/5 21171 7.250000 S 0 Young Adult 0 1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1.0 0.0 PC 17599 32.204208 C 1 Mid-Age 1 2 3 3 Heikkinen, Miss. Laina 26.0 0.0 0.0 STON/O2. 3101282 7.925000 S 1 Young Adult 1 3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1.0 0.0 113803 53.100000 S 1 Mid-Age 1 4 5 3 Allen, Mr. William Henry 35.0 0.0 0.0 373450 8.050000 S 0 Mid-Age 0 --------------- Preprocessed data saved --------------- Input data preprocessed successfully and stored in ./Titanic/data/train_data.pkl