**Post Graduate Programme in Data Science**

##### This is a mostly online programme in association with IBM and IMDR. It is a modular programme and one can select the modules as required.

Python (20-25 Hrs) • Introduction to Python, • Understanding Operators, Variables and Data Types, • Conditional Statements, • Looping Constructs, Functions, • Data Structure, Lists, Dictionaries • Understanding Standard Libraries in Python, Reading a CSV File in Python • Data Frames and basic operations with Data Frames, Indexing a Data Frame • Libraries in Python – • NumPy, • SciPy, Matplotlib, Scikit-learn, • Web development frameworks: Django/Flask,

** **

Machine Learning (25-30 Hrs) • Supervised, Unsupervised and Reinforcement Learning, geometry (lines, curves and 3D spaces) and visualisation of algebraic concepts • Regression as a concept, simple one variable regression line, coefficients of the line, assumptions of linear regression, Gradient descent algorithm, cost function to find ‘beta’ values and concept, local and global minima, concept of learning rate • Matrix representation of problem, Gradient descent for multiple features, use of feature scaling techniques in gradient descent, types of feature scaling, finding coefficients analytically, normal equation (matrix)non-invertibility • Logistic regression model, matrix representation, general Sigmoid function and graphical representation, decision boundary (linear and non-linear), metrics for logistic regression (accuracy, sensitivity, specificity etcetera concepts), Receiver-operating characteristic curve, use of RoC curve to find out optimum decision boundary, convexity and non-convexity of a group of points • Optimization objective from logistic regression to support vector machines, large margin classifier, concepts behind large margin classifications, kernels (concept, types and graphical explanations), using SVM • Decision trees and random forests: Concept, diagrammatic representation, random forest as a voting committee of decision trees, parameter meaning and explanation. • Naive Bayes: Venn diagrams, Naive Bayes algorithm, application and problems, Naive Bayes learning, Bayesian inference, Retail basket analysis; Concept of boosting and bagging • Unsupervised learning methods/Clustering: K-means algorithm, optimization objective, graphical representation, random initialization, choosing number of clusters • Association rule mining, K-nearest neighbours algorithm. • Control flow and Pandas: Write conditional constructs to tweak the execution of your scripts and get to know the Pandas DataFrame: the key data structure for Data Science in Python.

Big Data and Data Analytics (24-30 Hrs) • Hortonworks Data Platform (HDP), Apache Ambari, Hadoop and the Hadoop Distributed File System, MapReduce and Yarn, Apache Spark, Storing and Quering data , ZooKeeper, Slider, and Knox , Loading data with Sqooq • Dataplane Service, Stream Computing, Data Science essentials, Drew Conway’s Venn Diagram – and that of others, The Scientific Process applied to Data Science, the steps in running a Data Science project • Languages used for Data Science (Python, R, Scala, Julia, …), Survey of Data Science Notebooks, Markdown language with notebooks, Resources for Data Science, including GitHub, Jupyter Notebook, Essential packages: NumPy, SciPy, Pandas, Scikit-learn, NLTK, BeautifulSoup. • Data visualizations: matplotlib, …, PixieDust , Using Jupyter “Magic” commands • Using Big SQL to access HDFS data, Creating Big SQL schemas and tables, Querying Big SQL tables, Managing the Big SQL Server, Configuring Big SQL security, • Data federation with Big SQL, IBM Watson Studio, Analyzing data with Watson Studio Prerequisites Skills.