Unleash the power of Python for your data analysis projects with For Dummies!
Python is the preferred programming language for data scientists and combines the best features of Matlab, Mathematica, and R into libraries specific to data analysis and visualization. Python for Data Science For Dummies shows you how to take advantage of Python programming to acquire, organize, process, and analyze large amounts of information and use basic statistics concepts to identify trends and patterns. You’ll get familiar with the Python development environment, manipulate data, design compelling visualizations, and solve scientific computing challenges as you work your way through this user-friendly guide.
- Covers the fundamentals of Python data analysis programming and statistics to help you build a solid foundation in data science concepts like probability, random distributions, hypothesis testing, and regression models
- Explains objects, functions, modules, and libraries and their role in data analysis
- Walks you through some of the most widely-used libraries, including NumPy, SciPy, BeautifulSoup, Pandas, and MatPlobLib
Whether you’re new to data analysis or just new to Python, Python for Data Science For Dummies is your practical guide to getting a grip on data overload and doing interesting things with the oodles of information you uncover.
Part I: Getting Started with Python for Data Science....... 7
Chapter 1: Discovering the Match between
Data Science and Python . 9
Defining the Sexiest Job of the 21st Century...............................................11
Considering the emergence of data science.....................................11
Outlining the core competencies of a data scientist........................12
Linking data science and big data......................................................13
Understanding the role of programming...........................................13
Creating the Data Science Pipeline...............................................................14
Preparing the data................................................................................14
Performing exploratory data analysis................................................15
Learning from data...............................................................................15
Visualizing..............................................................................................15
Obtaining insights and data products................................................15
Understanding Python’s Role in Data Science............................................16
Considering the shifting profile of data scientists............................16
Working with a multipurpose, simple, and efficient language........17
Learning to Use Python Fast.........................................................................18
Loading data..........................................................................................18
Training a model...................................................................................18
Viewing a result.....................................................................................20
Chapter 2: Introducing Python’s Capabilities and Wonders . 21
Why Python?...................................................................................................22
Grasping Python’s core philosophy...................................................23
Discovering present and future development
goals........................23
Working with Python.....................................................................................24
Getting a taste of the language............................................................24
Understanding the need for indentation...........................................25
Working at the command line or in the IDE......................................25
Performing Rapid Prototyping and Experimentation................................29
Considering Speed of Execution...................................................................30
Visualizing Power...........................................................................................32
Using the Python Ecosystem for Data Science...........................................33
Accessing scientific tools using SciPy................................................33
Performing fundamental scientific computing
using NumPy..........34
Performing data analysis using pandas.............................................34
Implementing machine learning using Scikit‐learn...........................35
Plotting the data using matplotlib......................................................35
Parsing HTML documents using Beautiful Soup...............................35
Chapter 3: Setting Up Python for Data Science . 37
Considering the Off‐the‐Shelf Cross‐Platform Scientific
Distributions................................................................................................38
Getting Continuum Analytics Anaconda............................................39
Getting Enthought Canopy Express...................................................40
Getting pythonxy..................................................................................40
Getting WinPython................................................................................41
Installing Anaconda on Windows.................................................................41
Installing Anaconda on Linux........................................................................45
Installing Anaconda on Mac OS X.................................................................46
Downloading the Datasets and Example Code...........................................47
Using IPython Notebook......................................................................47
Defining the code repository...............................................................48
Understanding the datasets used in this book.................................54
Chapter 4: Reviewing Basic Python . 57
Working with Numbers and Logic................................................................59
Performing variable assignments.......................................................60
Doing arithmetic...................................................................................61
Comparing data using Boolean expressions.....................................62
Creating and Using Strings............................................................................65
Interacting with Dates....................................................................................66
Creating and Using Functions.......................................................................68
Creating reusable functions................................................................68
Calling functions in a variety of ways.................................................70
Using Conditional and Loop Statements.....................................................73
Making decisions using the if statement............................................73
Choosing between multiple options using nested decisions..........74
Performing repetitive tasks using for.................................................75
Using the while statement...................................................................76
Storing Data Using Sets, Lists, and Tuples..................................................77
Performing operations on sets............................................................77
Working with lists.................................................................................78
Creating and using Tuples...................................................................80
Defining Useful Iterators................................................................................81
Indexing Data Using Dictionaries..................................................................82
Chapter 5: Working with Real Data . 85
Uploading, Streaming, and Sampling Data..................................................86
Uploading small amounts of data into memory................................87
Streaming large amounts of data into memory.................................88
Sampling data........................................................................................89
Accessing Data in Structured Flat‐File Form...............................................90
Reading from a text file........................................................................91
Reading CSV delimited format............................................................92
Reading Excel and other Microsoft Office files.................................94
Sending Data in Unstructured File Form.....................................................95
Managing Data from Relational Databases..................................................98
Interacting with Data from NoSQL Databases..........................................100
Accessing Data from the Web.....................................................................101
Chapter 6: Conditioning Your Data . 105
Juggling between NumPy and pandas.......................................................106
Knowing when to use NumPy............................................................106
Knowing when to use pandas............................................................106
Validating Your Data....................................................................................107
Figuring out what’s in your data.......................................................108
Removing duplicates..........................................................................109
Creating a data map and data plan...................................................110
Manipulating Categorical Variables...........................................................112
Creating categorical variables..........................................................113
Renaming levels..................................................................................114
Combining levels.................................................................................115
Dealing with Dates in Your Data.................................................................116
Formatting date and time values......................................................117
Using the right time transformation.................................................117
Dealing with Missing Data...........................................................................118
Finding the missing data....................................................................119
Encoding missingness........................................................................119
Imputing missing data........................................................................120
Slicing and Dicing: Filtering and Selecting Data........................................122
Slicing rows..........................................................................................122
Slicing columns...................................................................................123
Dicing....................................................................................................123
Concatenating and Transforming...............................................................124
Adding new cases and variables.......................................................125
Removing data.....................................................................................126
Sorting and shuffling...........................................................................127
Aggregating Data at Any Level....................................................................128
Chapter 7: Shaping Data . 131
Working with HTML Pages..........................................................................132
Parsing XML and HTML.....................................................................132
Using XPath for data extraction........................................................133
Working with Raw Text................................................................................134
Dealing with Unicode.........................................................................134
Stemming and removing stop words................................................136
Introducing regular expressions.......................................................137
Using the Bag of Words Model and Beyond..............................................140
Understanding the bag of words model...........................................141
Working with n‐grams........................................................................142
Implementing TF‐IDF transformations.............................................144
Working with Graph Data............................................................................145
Understanding the adjacency matrix...............................................146
Using NetworkX basics......................................................................146
Chapter 8: Putting What You Know in Action 149
Contextualizing Problems and Data...........................................................150
Evaluating a data science problem...................................................151
Researching solutions........................................................................151
Formulating a hypothesis..................................................................152
Preparing your data............................................................................153
Considering the Art of Feature Creation...................................................153
Defining feature creation...................................................................153
Combining variables...........................................................................154
Understanding binning and discretization......................................155
Using indicator variables...................................................................155
Transforming distributions...............................................................156
Performing Operations on Arrays..............................................................156
Using vectorization.............................................................................157
Performing simple arithmetic on vectors and matrices................157
Performing matrix vector multiplication.........................................158
Performing matrix multiplication.....................................................159
Part III: Visualizing the Invisible................................ 161
Chapter 9: Getting a Crash Course in MatPlotLib 163
Starting with a Graph...................................................................................164
Defining the plot..................................................................................164
Drawing multiple lines and plots......................................................165
Saving your work................................................................................165
Setting the Axis, Ticks, Grids......................................................................166
Getting the axes..................................................................................167
Formatting the axes............................................................................167
Adding grids........................................................................................168
Defining the Line Appearance.....................................................................169
Working with line styles.....................................................................170
Using colors.........................................................................................170
Adding markers...................................................................................172
Using Labels, Annotations, and Legends...................................................173
Adding labels.......................................................................................174
Annotating the chart..........................................................................174
Creating a legend................................................................................175
Chapter 10: Visualizing the Data . 179
Choosing the Right Graph...........................................................................180
Showing parts of a whole with pie charts........................................180
Creating comparisons with bar charts............................................181
Showing distributions using histograms.........................................183
Depicting groups using box plots.....................................................184
Seeing data patterns using scatterplots..........................................185
Creating Advanced Scatterplots.................................................................187
Depicting groups.................................................................................187
Showing correlations..........................................................................188
Plotting Time Series.....................................................................................189
Representing time on axes................................................................190
Plotting trends over time...................................................................191
Plotting Geographical Data.........................................................................193
Visualizing Graphs........................................................................................195
Developing undirected graphs..........................................................195
Developing directed graphs..............................................................197
Chapter 11: Understanding the Tools . 199
Using the IPython Console..........................................................................200
Interacting with screen text..............................................................200
Changing the window appearance...................................................202
Getting Python help............................................................................203
Getting IPython help...........................................................................205
Using magic functions........................................................................205
Discovering objects............................................................................207
Using IPython Notebook..............................................................................208
Working with styles............................................................................208
Restarting the kernel..........................................................................210
Restoring a checkpoint......................................................................210
Performing Multimedia and Graphic Integration.....................................212
Embedding plots and other images..................................................212
Loading examples from online sites.................................................212
Obtaining online graphics and multimedia.....................................212
Part IV: Wrangling Data............................................ 215
Chapter 12: Stretching Python’s Capabilities . 217
Playing with Scikit‐learn..............................................................................218
Understanding classes in Scikit‐learn..............................................218
Defining applications for data science.............................................219
Performing the Hashing Trick.....................................................................222
Using hash functions..........................................................................223
Demonstrating the hashing trick......................................................223
Working with deterministic selection..............................................225
Considering Timing and Performance.......................................................227
Benchmarking with timeit.................................................................228
Working with the memory profiler...................................................230
Running in Parallel.......................................................................................232
Performing multicore parallelism.....................................................232
Demonstrating multiprocessing.......................................................233
Chapter 13: Exploring Data Analysis . 235
The EDA Approach.......................................................................................236
Defining Descriptive Statistics for Numeric Data.....................................237
Measuring central tendency..............................................................238
Measuring variance and range..........................................................239
Working with percentiles...................................................................239
Defining measures of normality........................................................240
Counting for Categorical Data.....................................................................241
Understanding frequencies...............................................................242
Creating contingency tables..............................................................243
Creating Applied Visualization for EDA.....................................................243
Inspecting boxplots............................................................................244
Performing t‐tests after boxplots......................................................245
Observing parallel coordinates.........................................................246
Graphing distributions.......................................................................247
Plotting scatterplots...........................................................................248
Understanding Correlation..........................................................................250
Using covariance and correlation.....................................................250
Using nonparametric correlation.....................................................252
Considering chi‐square for tables.....................................................253
Modifying Data Distributions......................................................................253
Using the normal distribution...........................................................254
Creating a Z‐score standardization..................................................254
Transforming other notable distributions......................................254
Chapter 14: Reducing Dimensionality . 257
Understanding SVD......................................................................................258
Looking for dimensionality reduction..............................................259
Using SVD to measure the invisible..................................................260
Performing Factor and Principal Component Analysis...........................261
Considering the psychometric model..............................................262
Looking for hidden factors................................................................262
Using components, not factors.........................................................263
Achieving dimensionality reduction................................................264
Understanding Some Applications.............................................................264
Recognizing faces with PCA..............................................................265
Extracting Topics with NMF..............................................................267
Recommending movies......................................................................270
Chapter 15: Clustering 273
Clustering with K‐means..............................................................................275
Understanding centroid‐based algorithms......................................275
Creating an example with image data..............................................277
Looking for optimal solutions...........................................................278
Clustering big data..............................................................................281
Performing Hierarchical Clustering...........................................................282
Moving Beyond the Round-Shaped Clusters: DBScan.............................286
Chapter 16: Detecting Outliers in Data 289
Considering Detection of Outliers..............................................................290
Finding more things that can go wrong...........................................291
Understanding anomalies and novel data.......................................292
Examining a Simple Univariate Method.....................................................292
Leveraging on the Gaussian distribution.........................................294
Making assumptions and checking out............................................295
Developing a Multivariate Approach.........................................................296
Using principal component analysis................................................297
Using cluster analysis.........................................................................298
Automating outliers detection with SVM.........................................299
Part V: Learning from Data........................................ 301
Chapter 17: Exploring Four Simple and Effective Algorithms . 303
Guessing the Number: Linear Regression.................................................304
Defining the family of linear models.................................................304
Using more variables..........................................................................305
Understanding limitations and problems........................................307
Moving to Logistic Regression....................................................................307
Applying logistic regression..............................................................308
Considering when classes are more.................................................309
Making Things as Simple as Naïve Bayes..................................................310
Finding out that Naïve Bayes isn’t so naïve.....................................312
Predicting text classifications...........................................................313
Learning Lazily with Nearest Neighbors....................................................315
Predicting after observing neighbors..............................................316
Choosing your k parameter wisely...................................................317
Chapter 18: Performing Cross‐Validation, Selection,
and Optimization 319
Pondering the Problem of Fitting a Model................................................320
Understanding bias and variance.....................................................321
Defining a strategy for picking models.............................................322
Dividing between training and test sets..........................................325
Cross‐Validating............................................................................................328
Using cross‐validation on k folds......................................................329
Sampling stratifications for complex data.......................................329
Selecting Variables Like a Pro.....................................................................331
Selecting by univariate measures.....................................................331
Using a greedy search........................................................................333
Pumping Up Your Hyperparameters..........................................................334
Implementing a grid search...............................................................335
Trying a randomized search.............................................................339
Chapter 19: Increasing Complexity with Linear
and Nonlinear Tricks 341
Using Nonlinear Transformations..............................................................341
Doing variable transformations........................................................342
Creating interactions between variables.........................................344
Regularizing Linear Models.........................................................................348
Relying on Ridge regression (L2)......................................................349
Using the Lasso (L1)...........................................................................349
Leveraging regularization..................................................................350
Combining L1 & L2: Elasticnet..........................................................350
Fighting with Big Data Chunk by Chunk....................................................351
Determining when there is too much data......................................351
Implementing Stochastic Gradient Descent....................................351
Understanding Support Vector Machines.................................................354
Relying on a computational method................................................355
Fixing many new parameters............................................................358
Classifying with SVC...........................................................................360
Going nonlinear is easy......................................................................365
Performing regression with SVR.......................................................366
Creating a stochastic solution with SVM.........................................368
Chapter 20: Understanding the Power of the Many 373
Starting with a Plain Decision Tree............................................................374
Understanding a decision tree..........................................................374
Creating classification and regression
trees...................................376
Making Machine Learning Accessible........................................................379
Working with a Random Forest classifier........................................381
Working with a Random Forest regressor.......................................382
Optimizing a Random Forest.............................................................383
Boosting Predictions....................................................................................384
Knowing that many weak predictors win........................................384
Creating a gradient boosting classifier............................................385
Creating a gradient boosting regressor...........................................386
Using GBM hyper‐parameters...........................................................387
Part VI: The Part of Tens............................................ 389
Chapter 21: Ten Essential Data Science
Resource Collections . 391
Gaining Insights with Data Science Weekly...............................................392
Obtaining a Resource List at U Climb Higher...........................................392
Getting a Good Start with KDnuggets........................................................392
Accessing the Huge List of Resources on Data Science Central.............393
Obtaining the Facts of Open Source Data Science from Masters...........394
Locating Free Learning Resources with Quora.........................................394
Receiving Help with Advanced Topics at Conductrics............................394
Learning New Tricks from the Aspirational Data Scientist.....................395
Finding Data Intelligence and Analytics Resources at AnalyticBridge......................................................................................396
Zeroing In on Developer Resources with Jonathan Bower.....................396
Chapter 22: Ten Data Challenges You Should Take 397
Meeting the Data Science London + Scikit‐learn Challenge....................398
Predicting Survival on the Titanic..............................................................399
Finding a Kaggle Competition that Suits Your Needs..............................399
Honing Your Overfit Strategies...................................................................400
Trudging Through the MovieLens Dataset...............................................401
Getting Rid of Spam Emails.........................................................................401
Working with Handwritten Information.....................................................402
Working with Pictures..................................................................................403
Analyzing Amazon.com Reviews................................................................404
Interacting with a Huge Graph....................................................................405
Chapter 1: Discovering the Match between
Data Science and Python . 9
Defining the Sexiest Job of the 21st Century...............................................11
Considering the emergence of data science.....................................11
Outlining the core competencies of a data scientist........................12
Linking data science and big data......................................................13
Understanding the role of programming...........................................13
Creating the Data Science Pipeline...............................................................14
Preparing the data................................................................................14
Performing exploratory data analysis................................................15
Learning from data...............................................................................15
Visualizing..............................................................................................15
Obtaining insights and data products................................................15
Understanding Python’s Role in Data Science............................................16
Considering the shifting profile of data scientists............................16
Working with a multipurpose, simple, and efficient language........17
Learning to Use Python Fast.........................................................................18
Loading data..........................................................................................18
Training a model...................................................................................18
Viewing a result.....................................................................................20
Chapter 2: Introducing Python’s Capabilities and Wonders . 21
Why Python?...................................................................................................22
Grasping Python’s core philosophy...................................................23
Discovering present and future development
goals........................23
Working with Python.....................................................................................24
Getting a taste of the language............................................................24
Understanding the need for indentation...........................................25
Working at the command line or in the IDE......................................25
Performing Rapid Prototyping and Experimentation................................29
Considering Speed of Execution...................................................................30
Visualizing Power...........................................................................................32
Using the Python Ecosystem for Data Science...........................................33
Accessing scientific tools using SciPy................................................33
Performing fundamental scientific computing
using NumPy..........34
Performing data analysis using pandas.............................................34
Implementing machine learning using Scikit‐learn...........................35
Plotting the data using matplotlib......................................................35
Parsing HTML documents using Beautiful Soup...............................35
Chapter 3: Setting Up Python for Data Science . 37
Considering the Off‐the‐Shelf Cross‐Platform Scientific
Distributions................................................................................................38
Getting Continuum Analytics Anaconda............................................39
Getting Enthought Canopy Express...................................................40
Getting pythonxy..................................................................................40
Getting WinPython................................................................................41
Installing Anaconda on Windows.................................................................41
Installing Anaconda on Linux........................................................................45
Installing Anaconda on Mac OS X.................................................................46
Downloading the Datasets and Example Code...........................................47
Using IPython Notebook......................................................................47
Defining the code repository...............................................................48
Understanding the datasets used in this book.................................54
Chapter 4: Reviewing Basic Python . 57
Working with Numbers and Logic................................................................59
Performing variable assignments.......................................................60
Doing arithmetic...................................................................................61
Comparing data using Boolean expressions.....................................62
Creating and Using Strings............................................................................65
Interacting with Dates....................................................................................66
Creating and Using Functions.......................................................................68
Creating reusable functions................................................................68
Calling functions in a variety of ways.................................................70
Using Conditional and Loop Statements.....................................................73
Making decisions using the if statement............................................73
Choosing between multiple options using nested decisions..........74
Performing repetitive tasks using for.................................................75
Using the while statement...................................................................76
Storing Data Using Sets, Lists, and Tuples..................................................77
Performing operations on sets............................................................77
Working with lists.................................................................................78
Creating and using Tuples...................................................................80
Defining Useful Iterators................................................................................81
Indexing Data Using Dictionaries..................................................................82
Chapter 5: Working with Real Data . 85
Uploading, Streaming, and Sampling Data..................................................86
Uploading small amounts of data into memory................................87
Streaming large amounts of data into memory.................................88
Sampling data........................................................................................89
Accessing Data in Structured Flat‐File Form...............................................90
Reading from a text file........................................................................91
Reading CSV delimited format............................................................92
Reading Excel and other Microsoft Office files.................................94
Sending Data in Unstructured File Form.....................................................95
Managing Data from Relational Databases..................................................98
Interacting with Data from NoSQL Databases..........................................100
Accessing Data from the Web.....................................................................101
Chapter 6: Conditioning Your Data . 105
Juggling between NumPy and pandas.......................................................106
Knowing when to use NumPy............................................................106
Knowing when to use pandas............................................................106
Validating Your Data....................................................................................107
Figuring out what’s in your data.......................................................108
Removing duplicates..........................................................................109
Creating a data map and data plan...................................................110
Manipulating Categorical Variables...........................................................112
Creating categorical variables..........................................................113
Renaming levels..................................................................................114
Combining levels.................................................................................115
Dealing with Dates in Your Data.................................................................116
Formatting date and time values......................................................117
Using the right time transformation.................................................117
Dealing with Missing Data...........................................................................118
Finding the missing data....................................................................119
Encoding missingness........................................................................119
Imputing missing data........................................................................120
Slicing and Dicing: Filtering and Selecting Data........................................122
Slicing rows..........................................................................................122
Slicing columns...................................................................................123
Dicing....................................................................................................123
Concatenating and Transforming...............................................................124
Adding new cases and variables.......................................................125
Removing data.....................................................................................126
Sorting and shuffling...........................................................................127
Aggregating Data at Any Level....................................................................128
Chapter 7: Shaping Data . 131
Working with HTML Pages..........................................................................132
Parsing XML and HTML.....................................................................132
Using XPath for data extraction........................................................133
Working with Raw Text................................................................................134
Dealing with Unicode.........................................................................134
Stemming and removing stop words................................................136
Introducing regular expressions.......................................................137
Using the Bag of Words Model and Beyond..............................................140
Understanding the bag of words model...........................................141
Working with n‐grams........................................................................142
Implementing TF‐IDF transformations.............................................144
Working with Graph Data............................................................................145
Understanding the adjacency matrix...............................................146
Using NetworkX basics......................................................................146
Chapter 8: Putting What You Know in Action 149
Contextualizing Problems and Data...........................................................150
Evaluating a data science problem...................................................151
Researching solutions........................................................................151
Formulating a hypothesis..................................................................152
Preparing your data............................................................................153
Considering the Art of Feature Creation...................................................153
Defining feature creation...................................................................153
Combining variables...........................................................................154
Understanding binning and discretization......................................155
Using indicator variables...................................................................155
Transforming distributions...............................................................156
Performing Operations on Arrays..............................................................156
Using vectorization.............................................................................157
Performing simple arithmetic on vectors and matrices................157
Performing matrix vector multiplication.........................................158
Performing matrix multiplication.....................................................159
Part III: Visualizing the Invisible................................ 161
Chapter 9: Getting a Crash Course in MatPlotLib 163
Starting with a Graph...................................................................................164
Defining the plot..................................................................................164
Drawing multiple lines and plots......................................................165
Saving your work................................................................................165
Setting the Axis, Ticks, Grids......................................................................166
Getting the axes..................................................................................167
Formatting the axes............................................................................167
Adding grids........................................................................................168
Defining the Line Appearance.....................................................................169
Working with line styles.....................................................................170
Using colors.........................................................................................170
Adding markers...................................................................................172
Using Labels, Annotations, and Legends...................................................173
Adding labels.......................................................................................174
Annotating the chart..........................................................................174
Creating a legend................................................................................175
Chapter 10: Visualizing the Data . 179
Choosing the Right Graph...........................................................................180
Showing parts of a whole with pie charts........................................180
Creating comparisons with bar charts............................................181
Showing distributions using histograms.........................................183
Depicting groups using box plots.....................................................184
Seeing data patterns using scatterplots..........................................185
Creating Advanced Scatterplots.................................................................187
Depicting groups.................................................................................187
Showing correlations..........................................................................188
Plotting Time Series.....................................................................................189
Representing time on axes................................................................190
Plotting trends over time...................................................................191
Plotting Geographical Data.........................................................................193
Visualizing Graphs........................................................................................195
Developing undirected graphs..........................................................195
Developing directed graphs..............................................................197
Chapter 11: Understanding the Tools . 199
Using the IPython Console..........................................................................200
Interacting with screen text..............................................................200
Changing the window appearance...................................................202
Getting Python help............................................................................203
Getting IPython help...........................................................................205
Using magic functions........................................................................205
Discovering objects............................................................................207
Using IPython Notebook..............................................................................208
Working with styles............................................................................208
Restarting the kernel..........................................................................210
Restoring a checkpoint......................................................................210
Performing Multimedia and Graphic Integration.....................................212
Embedding plots and other images..................................................212
Loading examples from online sites.................................................212
Obtaining online graphics and multimedia.....................................212
Part IV: Wrangling Data............................................ 215
Chapter 12: Stretching Python’s Capabilities . 217
Playing with Scikit‐learn..............................................................................218
Understanding classes in Scikit‐learn..............................................218
Defining applications for data science.............................................219
Performing the Hashing Trick.....................................................................222
Using hash functions..........................................................................223
Demonstrating the hashing trick......................................................223
Working with deterministic selection..............................................225
Considering Timing and Performance.......................................................227
Benchmarking with timeit.................................................................228
Working with the memory profiler...................................................230
Running in Parallel.......................................................................................232
Performing multicore parallelism.....................................................232
Demonstrating multiprocessing.......................................................233
Chapter 13: Exploring Data Analysis . 235
The EDA Approach.......................................................................................236
Defining Descriptive Statistics for Numeric Data.....................................237
Measuring central tendency..............................................................238
Measuring variance and range..........................................................239
Working with percentiles...................................................................239
Defining measures of normality........................................................240
Counting for Categorical Data.....................................................................241
Understanding frequencies...............................................................242
Creating contingency tables..............................................................243
Creating Applied Visualization for EDA.....................................................243
Inspecting boxplots............................................................................244
Performing t‐tests after boxplots......................................................245
Observing parallel coordinates.........................................................246
Graphing distributions.......................................................................247
Plotting scatterplots...........................................................................248
Understanding Correlation..........................................................................250
Using covariance and correlation.....................................................250
Using nonparametric correlation.....................................................252
Considering chi‐square for tables.....................................................253
Modifying Data Distributions......................................................................253
Using the normal distribution...........................................................254
Creating a Z‐score standardization..................................................254
Transforming other notable distributions......................................254
Chapter 14: Reducing Dimensionality . 257
Understanding SVD......................................................................................258
Looking for dimensionality reduction..............................................259
Using SVD to measure the invisible..................................................260
Performing Factor and Principal Component Analysis...........................261
Considering the psychometric model..............................................262
Looking for hidden factors................................................................262
Using components, not factors.........................................................263
Achieving dimensionality reduction................................................264
Understanding Some Applications.............................................................264
Recognizing faces with PCA..............................................................265
Extracting Topics with NMF..............................................................267
Recommending movies......................................................................270
Chapter 15: Clustering 273
Clustering with K‐means..............................................................................275
Understanding centroid‐based algorithms......................................275
Creating an example with image data..............................................277
Looking for optimal solutions...........................................................278
Clustering big data..............................................................................281
Performing Hierarchical Clustering...........................................................282
Moving Beyond the Round-Shaped Clusters: DBScan.............................286
Chapter 16: Detecting Outliers in Data 289
Considering Detection of Outliers..............................................................290
Finding more things that can go wrong...........................................291
Understanding anomalies and novel data.......................................292
Examining a Simple Univariate Method.....................................................292
Leveraging on the Gaussian distribution.........................................294
Making assumptions and checking out............................................295
Developing a Multivariate Approach.........................................................296
Using principal component analysis................................................297
Using cluster analysis.........................................................................298
Automating outliers detection with SVM.........................................299
Part V: Learning from Data........................................ 301
Chapter 17: Exploring Four Simple and Effective Algorithms . 303
Guessing the Number: Linear Regression.................................................304
Defining the family of linear models.................................................304
Using more variables..........................................................................305
Understanding limitations and problems........................................307
Moving to Logistic Regression....................................................................307
Applying logistic regression..............................................................308
Considering when classes are more.................................................309
Making Things as Simple as Naïve Bayes..................................................310
Finding out that Naïve Bayes isn’t so naïve.....................................312
Predicting text classifications...........................................................313
Learning Lazily with Nearest Neighbors....................................................315
Predicting after observing neighbors..............................................316
Choosing your k parameter wisely...................................................317
Chapter 18: Performing Cross‐Validation, Selection,
and Optimization 319
Pondering the Problem of Fitting a Model................................................320
Understanding bias and variance.....................................................321
Defining a strategy for picking models.............................................322
Dividing between training and test sets..........................................325
Cross‐Validating............................................................................................328
Using cross‐validation on k folds......................................................329
Sampling stratifications for complex data.......................................329
Selecting Variables Like a Pro.....................................................................331
Selecting by univariate measures.....................................................331
Using a greedy search........................................................................333
Pumping Up Your Hyperparameters..........................................................334
Implementing a grid search...............................................................335
Trying a randomized search.............................................................339
Chapter 19: Increasing Complexity with Linear
and Nonlinear Tricks 341
Using Nonlinear Transformations..............................................................341
Doing variable transformations........................................................342
Creating interactions between variables.........................................344
Regularizing Linear Models.........................................................................348
Relying on Ridge regression (L2)......................................................349
Using the Lasso (L1)...........................................................................349
Leveraging regularization..................................................................350
Combining L1 & L2: Elasticnet..........................................................350
Fighting with Big Data Chunk by Chunk....................................................351
Determining when there is too much data......................................351
Implementing Stochastic Gradient Descent....................................351
Understanding Support Vector Machines.................................................354
Relying on a computational method................................................355
Fixing many new parameters............................................................358
Classifying with SVC...........................................................................360
Going nonlinear is easy......................................................................365
Performing regression with SVR.......................................................366
Creating a stochastic solution with SVM.........................................368
Chapter 20: Understanding the Power of the Many 373
Starting with a Plain Decision Tree............................................................374
Understanding a decision tree..........................................................374
Creating classification and regression
trees...................................376
Making Machine Learning Accessible........................................................379
Working with a Random Forest classifier........................................381
Working with a Random Forest regressor.......................................382
Optimizing a Random Forest.............................................................383
Boosting Predictions....................................................................................384
Knowing that many weak predictors win........................................384
Creating a gradient boosting classifier............................................385
Creating a gradient boosting regressor...........................................386
Using GBM hyper‐parameters...........................................................387
Part VI: The Part of Tens............................................ 389
Chapter 21: Ten Essential Data Science
Resource Collections . 391
Gaining Insights with Data Science Weekly...............................................392
Obtaining a Resource List at U Climb Higher...........................................392
Getting a Good Start with KDnuggets........................................................392
Accessing the Huge List of Resources on Data Science Central.............393
Obtaining the Facts of Open Source Data Science from Masters...........394
Locating Free Learning Resources with Quora.........................................394
Receiving Help with Advanced Topics at Conductrics............................394
Learning New Tricks from the Aspirational Data Scientist.....................395
Finding Data Intelligence and Analytics Resources at AnalyticBridge......................................................................................396
Zeroing In on Developer Resources with Jonathan Bower.....................396
Chapter 22: Ten Data Challenges You Should Take 397
Meeting the Data Science London + Scikit‐learn Challenge....................398
Predicting Survival on the Titanic..............................................................399
Finding a Kaggle Competition that Suits Your Needs..............................399
Honing Your Overfit Strategies...................................................................400
Trudging Through the MovieLens Dataset...............................................401
Getting Rid of Spam Emails.........................................................................401
Working with Handwritten Information.....................................................402
Working with Pictures..................................................................................403
Analyzing Amazon.com Reviews................................................................404
Interacting with a Huge Graph....................................................................405
Participants who are successful in the Data Science Training in Hyderabad program will acquire the real-world abilities that will help them navigate the Data Science industry in general.
ReplyDeleteData Science Course Hyderabad