Speaker Series – Vaibhav Srivastav

vaibhav_img - Vaibhav Srivastava.jpg Vaibhav is a Data Scientist working with Deloitte Consulting LLP. He works with Fortune Technology 10 clients to help them make data-driven (read profitable) decisions. Prior to this, he has worked with startups across India to build Social Media Analytics Dashboards, Chatbots, Recommendation Engines and Forecasting Models.

His core interest lies in Natural Language Processing, Machine Learning/ Statistics and Product development.

In his free time, Vaibhav gives talks and participates in local PyData/ PyUserGroup meetups. He has also previously given talks at Gartner Data and Analytics Summit, PyCon India, PyCon APAC (Philippines), PyCon Korea, PyCon Malaysia and Google Cloud Summit!

Topic: Machine Learning with Text

It can be difficult to figure out how to work with text in scikit-learn, even if you’re already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What’s the difference between a “fit” and a “transform”? What’s a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What’s the appropriate machine learning model to use? And so on…

In this tutorial, we’ll answer all of those questions, and more! We’ll start by walking through the vectorization process in order to understand the input and output formats. Then we’ll read a simple dataset into pandas, and immediately apply what we’ve learned about vectorization. We’ll move on to the model building process, including a discussion of which model is most appropriate for the task. We’ll evaluate our model a few different ways, and then examine the model for greater insight into how the text is influencing its predictions. Finally, we’ll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance.

Targeting audience with intermediate level of Python knowledge.