Python (20-25 Hrs) • Introduction to Python, • Understanding Operators, Variables and Data Types, • Conditional Statements, • Looping Constructs, Functions, • Data Structure, Lists, Dictionaries • Understanding Standard Libraries in Python, Reading a CSV File in Python • Data Frames and basic operations with Data Frames, Indexing a Data Frame • Libraries in Python – • NumPy, • SciPy, Matplotlib, Scikit-learn, • Web development frameworks: Django/Flask,
Basic Statistics and Statistical Inference (20-25 Hrs) • Concept of statistics, population, sample, parameter and statistic, examples of use of statistic, data sources, representation of data, types of statistical analyses, sampling methods, types of variables, measures of central tendency, statistical estimation: point and interval, co-variance, coefficient of correlation, formulae • Permutations and combinations, Probability concepts, types of probabilities, collectively exhaustive event set, joint probability, Bayes Theorem, probability distribution for a discreet random variable, probabilistic view on variance, covariance • Distributions: Bernoulli’s trail, binomial distribution, Poisson distribution, Hypergeometric distribution, student-t distribution, Chi-square distribution, Fdistribution, Normal distribution, explanation of derivation of population parameter through samples and central limit theorem, Z score • Hypothesis and testing, single parameter and two-parameter testing, single sided and two-sided testing, p-value, tests and test statistic and logic behind it, problems on hypothesis testing, diagnostic tests: goodness of fit, t-test, f-test and chi-sq test, contingency table, degree of freedom, analysis of variances • Regression and allied concepts, data transformation, Linear and Matrix algebra concepts.
R Programming (20-25 Hrs) • Introduction to R-studio, mathematical and logical operators in R, Data types and data structures, simple operations and programs, matrix operations • Data frames, string operations, factors, handling categorical data, lists and list operations • Loops and conditional statements, switch and break function, Apply functions • Statistical problem solving in R, Visualizations in R • Hands-on data manipulations: cleaning, sub-setting, sampling, data transformations and allied data operations.
Machine Learning (25-30 Hrs) • Supervised, Unsupervised and Reinforcement Learning, geometry (lines, curves and 3D spaces) and visualisation of algebraic concepts • Regression as a concept, simple one variable regression line, coefficients of the line, assumptions of linear regression, Gradient descent algorithm, cost function to find ‘beta’ values and concept, local and global minima, concept of learning rate • Matrix representation of problem, Gradient descent for multiple features, use of feature scaling techniques in gradient descent, types of feature scaling, finding coefficients analytically, normal equation (matrix)non-invertibility • Logistic regression model, matrix representation, general Sigmoid function and graphical representation, decision boundary (linear and non-linear), metrics for logistic regression (accuracy, sensitivity, specificity etcetera concepts), Receiver-operating characteristic curve, use of RoC curve to find out optimum decision boundary, convexity and non-convexity of a group of points • Optimization objective from logistic regression to support vector machines, large margin classifier, concepts behind large margin classifications, kernels (concept, types and graphical explanations), using SVM • Decision trees and random forests: Concept, diagrammatic representation, random forest as a voting committee of decision trees, parameter meaning and explanation. • Naive Bayes: Venn diagrams, Naive Bayes algorithm, application and problems, Naive Bayes learning, Bayesian inference, Retail basket analysis; Concept of boosting and bagging • Unsupervised learning methods/Clustering: K-means algorithm, optimization objective, graphical representation, random initialization, choosing number of clusters • Association rule mining, K-nearest neighbours algorithm. • Control flow and Pandas: Write conditional constructs to tweak the execution of your scripts and get to know the Pandas DataFrame: the key data structure for Data Science in Python.
Big Data and Data Analytics (24-30 Hrs) • Hortonworks Data Platform (HDP), Apache Ambari, Hadoop and the Hadoop Distributed File System, MapReduce and Yarn, Apache Spark, Storing and Quering data , ZooKeeper, Slider, and Knox , Loading data with Sqooq • Dataplane Service, Stream Computing, Data Science essentials, Drew Conway’s Venn Diagram – and that of others, The Scientific Process applied to Data Science, the steps in running a Data Science project • Languages used for Data Science (Python, R, Scala, Julia, …), Survey of Data Science Notebooks, Markdown language with notebooks, Resources for Data Science, including GitHub, Jupyter Notebook, Essential packages: NumPy, SciPy, Pandas, Scikit-learn, NLTK, BeautifulSoup. • Data visualizations: matplotlib, …, PixieDust , Using Jupyter “Magic” commands • Using Big SQL to access HDFS data, Creating Big SQL schemas and tables, Querying Big SQL tables, Managing the Big SQL Server, Configuring Big SQL security, • Data federation with Big SQL, IBM Watson Studio, Analyzing data with Watson Studio Prerequisites Skills.