SlideShare a Scribd company logo
Data Analytics(BCA science:VI)
By.
Prof.Vrushali Solanke.
What is Data Science?
 Data science is the deep study of the massive amount of data which involves
extracting meaningful insights from raw ,structured and unstructured data that is
processed using scientific methods, different technologies, and algorithm.
 It is multidisciplinary field that uses tools and techniques to manipulate the data so
that you can find something new and meaningful.
 Data science uses most powerful hardware ,programming system, and most
efficient algorithm to solve the data related problems. It is future of artificial
intelligence.
Introduction of Data Science and Data Analytics
Data science component:
Data science refer to emerging area of work concerned with collection, preparation,
analysis, visualization, management, and preservation of large collection of information.
Data science is all about:
Examples:
 I’m sure you have seen smart watches — or maybe you use one, too. These smart
gadgets can measure your sleep quality, how much you walk, your heart rate, etc.
Tesla is famous for using data science – e.g. deep learning – for their
self-driving
Need of the data science:
Following are some main reasons for
using data science technology:
 With the help of data science technology, we can convert massive amount of raw
and unstructured data into meaningful insights.
 Data science technology is opting by various companies, whether it is big brand
or start up. Google Amazon ,Netflix etc. which handle huge amount of data, are
using data science algorithm for better consumers experience.
 Data science is working for automating transportation such as creating self
driving car, which is feature of transportation.
 Data science can help in different prediction such as various survey ,election
,flight ticket confirmation, etc.
 Data is the oil for today's world. With the right tools, technologies, algorithms, we can use
data and convert it into a distinctive business advantage
 Data Science can help you to detect fraud using advanced machine learning algorithms
It helps you to prevent any significant monetary losses
 Allows to build intelligence ability in machines.You can perform sentiment analysis to
gauge customer brand loyalty
 It enables you to take better and faster decisions
 Helps you to recommend the right product to the right customer to enhance your
business
Basics of data:
 Data:Data are characteristics or information, usually numeric,
that are collected through observation. In a more technical
sense, data are a set of values of qualitative or quantitative
variables about one or more persons or objects, while a
datum (singular of data) is a single value of a single variable.
 There are 3 main category of data:
1)Structured data
2)Unstructured data
3)Semi structured data.
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
Structured data:
 Structured data usually resides in relational databases RDBMS. Fields store length-
delineated data phone numbers, Social Security numbers, or ZIP codes. Even text strings of
variable length like names are contained in records, making it a simple matter to search. Data
may be human- or machine-generated as long as the data is created within an RDBMS
structure. This format is eminently searchable both with human generated queries and via
algorithms using type of data and field names, such as alphabetical or numeric, currency or
date.
 Common relational database applications with structured data include airline reservation
systems, inventory control, sales transactions, and ATM activity. Structured Query Language
(SQL) enables queries on this type of structured data within relational databases.
What Is Unstructured Data?
 Unstructured data is essentially everything else. Unstructured data has internal structure
but is not structured via pre-defined data models or schema. It may be textual or non-
textual, and human- or machine-generated. It may also be stored within a non-relational
database like NoSQL.
 Typical human-generated unstructured data includes:
• Text files: Word processing, spreadsheets, presentations, email, logs.
• Email: we sometimes refer to it as semi structured. However, its message field is
unstructured and traditional analytical tools cannot parse it.
• Social Media: Data from Facebook, Twitter, LinkedIn.
• Website: YouTube, Instagram, photo sharing sites.
• Mobile data: Text messages, locations.
• Communications: Chat, phone recordings, collaboration software(like Microsoft Teams,
Google Docs etc.).
• Media: MP3, digital photos, audio and video files.
• Business applications: MS Office documents, productivity applications.
 Typical machine-generated unstructured data:
Machine generated data is information that is automatically created by a
computer, process, application, or other machine without human
intervention.
• Satellite imagery: Weather data, land forms, military movements.
• Scientific data: Oil and gas exploration, space exploration, seismic
imagery, atmospheric data.
• Digital surveillance: Surveillance photos and video.
• Sensor data: Traffic, weather, oceanographic sensors.
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
Semi-structured data :
Semi-structured data maintains internal tags and markings that
identify separate data elements, which enables information
grouping and hierarchies.
Email is a very common example of a semi-structured data
type. Email’s native metadata enables classification and
keyword searching without any additional tools.
Sharing sensor data is a growing use case, as are Web-based
data sharing and transport: electronic data interchange (EDI),
many social media platforms, document markup languages,
and NoSQL databases.
Examples of Semi-structured Data
• Markup language XML This is a semi-structured document language. XML is a set of document encoding
rules that defines a human- and machine-readable format. Its value is that its tag-driven structure is highly
flexible, and coders can adapt it to universalize data structure, storage, and transport on the Web.
• Open standard JSON (JavaScript Object Notation) JSON is another semi-structured data interchange
format. Its structure consists of name/value pairs (or object, hash table, etc.) and an ordered value list (or
array, sequence, list). Since the structure is interchangeable among languages, JSON excels at transmitting
data between web applications and servers.
• NoSQL Semi-structured data is also an important element of many NoSQL (“not only SQL”) databases.
NoSQL databases differ from relational databases because they do not separate the organization (schema)
from the data. This makes NoSQL a better choice to store information that does not easily fit into the record
and table format, such as text with varying lengths. It also allows for easier data exchange between
databases. Some newer NoSQL databases like MongoDB and Couchbase also incorporate semi-structured
documents by natively storing them in the JSON format.
Differences between Structured, Semi-
structured and Unstructured data:
Properties Structured data Semi-structured data Unstructured data
Technology
It is based on Relational database
table
It is based on XML/RDF(Resource
Description Framework).
It is based on character and binary
data
Transaction management
Matured transaction and various
concurrency techniques
Transaction is adapted from DBMS
not matured
No transaction management and
no concurrency
Version management Versioning over tuples, row, tables
Versioning over tuples or graph is
possible
Versioned as a whole
Flexibility
It is schema dependent and less
flexible
It is more flexible than structured
data but less flexible than
unstructured data
It is more flexible and there is
absence of schema
Scalability
It is very difficult to scale DB
schema
It’s scaling is simpler than
structured data
It is more scalable.
Robustness Very robust New technology, not very spread —
Query performance
Structured query allow complex
joining
Queries over anonymous nodes are
possible
Only textual queries are possible
Basics of Data Science:
Providing some sort of understanding of the data. i.e the term use as information
extraction.
Insight:It is gained by analysing data and information to understand what is
going on with particular situation.
Data:It is row unorganized set of information.
Data
Information
Insights
Need of the Data Science:
 Todays most of the data is unstructured and semi structured .
 Sources of these current data are, financial logs, text file, multimedia forms,
sensors, and instruments.
 Simple BI tools not capable of processing this huge volume and variety of data.
 This is why we required more complex and advance analytical tools and algorithm
for processing, analyzing and drawing meaningful insights from it.
 Data science is all about uncovering findings from data.
 Example: Netflix .
What is Data Science?
 Turning raw data into insights to make better decision.
 Data science is blend of the various tools, algorithm, and machine
learning principles with goal to discovered hidden pattern from raw
data.
 It is art and science of extracting actionable insights from raw data.
 Data science is also known as data driven science.
Definition of a data science by famous Venn diagram. (by,Drew
Conway)
Basic areas in Data science:
 Mathematics and statistics:
 Computer programming:
 Domain knowledge:
Data science can add values in following ways:
1.It empower management and officers to make better decision.
2.It helps to direct action to trends, for defining goal.
3.It helps staff to adopt best practice and focus on issues that matter.
4.It helps to identify opportunities and decision making with quantifiable
data.
5.It helps in identification and refining of target audience for business.
What is Data science? Is it statistic or Machine Learning?
 Statistics is a tool or method for data science, while data science is
wide domain where a statistical method is essential component.
 All Statistician can not be a Data scientist and all Data scientist can
not be Statistician.
 Machine learning can be define as practice of using algorithm to use
data, learn from it and then forecast future trends for that topic.
 Data science uses Machine learning as a tool to provide insights.
 Data science is the multidisciplinary blend of data inference,
algorithm development and technology in order to solve analytically
complex problems.
Sr.No Features BI Data Science
1. Data Sources Structured(Usually
SQL,often Data
Both structured and
unstructured (logs, cloud
data,SQL,NOSQL,text)
2. Approach Statistics and visualization Statistics, Machine Learning,
Graph Analysis, NLP
3. Focus Past and present Present and future
4. Tools Pentaho, Microsoft BI,
QlikView, R
RapidMiner, BigML, Weka, R
Sr.No Machine Learning Data Science
1. A subset of AI that focuses on
narrow range of activities
Data science is not exactly subset of machine
learning but it uses machine learning to analyze
and make future prediction
2 Develop new (individual) model Explore many models, built and tune hybrids
3 Prove mathematical properties of
models
Understand empirical properties of models
4 Improve/Validate on a few,
relatively clean, small datasets.
Develop/use tools that can handle massive
datasets.
5 It produces predictions It produces insights
6 It is the part of data science Data science is an all encompassing terms that
includes aspects of machine learning for
functionality
Data Scientist:
 A data scientist is a professional responsible for collecting, analysing, and
interpreting large amount of data to identify ways to help a business improve
operation.
 Data scientist has sufficient knowledge of expertise in business needs, domain
knowledge, analytical skills and programming expertise to manage end to end
scientific methods in each state in big data.
 Responsibilities of Data scientist:
1)Collecting large amount of unruly data and transforming it into a more
usable format.
2)Solving business related problems using data driven techniques.
3)Working with variety of programming language.
4)Having a solid grasp of statistics, including statistical test and distribution.
5)Knowledge about top of analytical techniques such as, Machine learning,deep
learnig, text analytics.
6)Looking for order and pattern of data ,as well as spotting trends.
Category of Data Scientist:
 Data scientist were classified into 4 categories:
1)Data developer: Developer, Engineer
2)Data Researcher: Researcher, Scientist, Statistician
3)Data Creative: Jack of all trends, Artist, Hacker
4)Data Businessperson: Leader, Businessperson, Entrepreneur
Skills for Data Scientist:
 Programming skills: Data scientist should have command over programming
language, like R or Python and Database querying language like SQL.
 Statistics: He should have knowledge about test, distribution, maximum likelihood
estimators.
 Machine Learning: For large company with huge amount of data (e.g. Netflix,
Google, Amazon etc.),it may be essential to familiar with ML methods like k-nearest
neighbors, random forest, etc.
 Multivariable calculus and Linear Algebra: The company where product is
defined by data ,these concepts are most important.
 Data Visualization and Communication: This technique is important for
younger companies that are driven data driven decision for first time. e.g. Tableau It is
important to not just familiar with visualization tools but should also finding the principle
behind visually encoding data.
Data Science Process:
 A process of discovering useful relationship and pattern in data is ,enabled by a set
of iterative activities collectively known as the Data Science Process.
 Data science Process involves :
1)Understanding the problem
2)Preparing the data samples
3)Developing the model
4)Applying the model on a dataset to see how the model may work in real
world.
5)Deploying and maintaining the models.
Step-1:Prior knowledge:
i)It helps to define what problem is being solved, how it fits in the business context,
and what data is needed in order to solve the problem.
ii)Data science process starts with the need for analysis, a question, or a business
objective.
iii)without well define statement of the problem, it is impossible to come up with
right dataset.
iv)Data science process is going to explained using hypothetical use case.
Step-2:Data Preparation:
i)Preparing the dataset which suits task is most time consuming part of the process.
ii)This phase consist of three sub phases: Data Cleaning, Removes of false values
from data source and inconsistency across data source, data integration enrich the data
source by combining information from multiple data source, and data transformation
ensure that the data is in suitable format for use in your model.Data exploration is
concerned with building a deeper understanding of your data.
 Sampling: It is the process of selecting a subset of records as a representation of the
original dataset for use in data analysis or modeling. Sampling reduces the amount of
that need to be processed and speed up the build process of the modeling.
 Model Build process: In this process ,it is necessary to segment the dataset into training
and test samples.
 Step-3:Modeling:
i)A Model is the abstract representation of data. This step create representative model
inferred from data.
ii)Training dataset: The dataset used to create the model, with known attributes and
target, is called the training dataset.
iv)Test dataset or validation dataset: The validity of the created model will also need to
be checked with another known dataset called the test or validation dataset.
v)Building the model is the iterative process that involves selecting the variables for the
model, executing the model and model diagnostics.
 Step-4:Application:
i)Deployment: Here Model is become production ready.
ii)The Model deployment stage has to deal with :assessing model
readiness, Technical integration, response time, model maintenance,
and assimilation.
iv)Evaluation :It is the part of process where you test to see if you have a
good model or not, before deploying or presenting.
Step-5:Knowledge:
i)The data science process start with prior knowledge and end with
posterior knowledge.
Stages in Data Science project:
Stage-1:Data Acquisition:
i)Data science project begin with identifying various data sources.(e.g. logs
from webserver, social media data, data from online repositories like census dataset,
data stream from online sources via API’s, web scraping)
ii)Data acquisition involves acquiring data from all the identified internal and
external sources that answer the business questions.
iii)main job of data scientist in this step is to tracking where each data slice
comes from and whether the data slice acquired is up to date or not. It is important
track these information during entire lifecycle of a data science.
Stage-2:Data Preparation:
i)Data acquired in first step is not in a usable format to run the required
analysis and might contain missing entries, inconsistencies and semantic errors.
ii)Next, Data scientist have to clean and reformat the data by manually
editing it in the spreadsheet or by writing code. This step does not produce any
meaningful insights.
iii)Through regular data cleaning, data scientist can easily identify what fault
exist in the data acquisition process, what assumption they should make and what
model they can apply to produce analysis result.
iv)Data after reformatting can be converted to JSON, CSV or any other format
makes it easy to load into one of the data science tools.
v)Exploratory data analysis :it forms an integral part ,at this stage as
summarization of the clean data can help to identify outlier, anomalies, and patterns.
vi)Through this step, data scientist come to know what do they actually
to do with this data. This stage is the time consuming.
Stage 3:Hypothesis and Modelling:
i)This stage required writing, running, refining the program to analyse and
derive meaning full insights from data. These program may written in Python, R,
MATLAB or Perl language.
ii)Different machine learning are applied to the data to identify the machine
learning model that best fit the business needs. All the machine learning model
by train dataset.
Stage-4:Evaluation and interpretation:
i)There are different evaluation metric for different performance metric. E.g.
to predict Daily stocks then the RMSE(Root Mean Squared Error)will have to be
consider for evaluation. If model for classifying Spam email message then
performance metric like average accuracy, AUC, Log loss have to be consider.
ii)Machine learning model performance should be measured and
using validation and test sets to identify best model based on model accuracy and
over fitting.
Stage-5:Deployment:
i)Machine learning model might have to be recoded ,bcz Data scientist
favour Python programming language but production environment support Java.
ii)Models are first deployed in a pre-production or test-environment before
actually deploy them in production.
Stage-6:Operation/Maintenance:
i)This step involves developing the plan for monitoring and maintaining the
data science project in the long run.
Stage-6:Optimization:
i)It involve the retraining the machine learning model in production
new data source comes .
Thank you.

More Related Content

PPTX
Data analytics
PDF
Introduction to Big Data
PDF
Data science
PPTX
Data science applications and usecases
PPTX
Classification of data
PPTX
What is big data?
PPTX
Introduction to Data Analytics
PDF
Data Mining and Business Intelligence Tools
Data analytics
Introduction to Big Data
Data science
Data science applications and usecases
Classification of data
What is big data?
Introduction to Data Analytics
Data Mining and Business Intelligence Tools

What's hot (20)

PDF
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
PDF
data mining
PPTX
Artificial Intelligence and Cybersecurity
PDF
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
PPTX
Machine Learning Tutorial | Machine Learning Basics | Machine Learning Algori...
PPTX
Introduction of Data Science
PDF
AI for sentiment analysis - An Overview.pdf
PPTX
Machine Learning Basics
PDF
Applications of Big Data
PDF
Big Data
PPTX
1. Data Analytics-introduction
PPTX
Machine Learning and Real-World Applications
PDF
Data science and Artificial Intelligence
PDF
Data Science Project Lifecycle
PDF
Machine Learning and AI in Risk Management
PPTX
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
PPTX
Introduction to Data Mining and Data Warehousing
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
PPTX
Introduction to-machine-learning
PDF
Machine Learning for dummies!
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
data mining
Artificial Intelligence and Cybersecurity
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Machine Learning Tutorial | Machine Learning Basics | Machine Learning Algori...
Introduction of Data Science
AI for sentiment analysis - An Overview.pdf
Machine Learning Basics
Applications of Big Data
Big Data
1. Data Analytics-introduction
Machine Learning and Real-World Applications
Data science and Artificial Intelligence
Data Science Project Lifecycle
Machine Learning and AI in Risk Management
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Introduction to Data Mining and Data Warehousing
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Introduction to-machine-learning
Machine Learning for dummies!
Ad

Similar to Introduction of Data Science and Data Analytics (20)

PPTX
Ch1_Introduction to DATA SCIENCE_TYBSC(CS)_2024.pptx
PDF
CS3352-Foundations of Data Science Notes.pdf
PPTX
Data science.chapter-1,2,3
PPTX
Data science unit1
PPTX
Introductio to Data Science and types of data
PPTX
Chapter 2- Data Science and big data.pptx
PPTX
Data science and business analytics
PPTX
Introduction to Data Science.pptx
PPTX
ch2 DS.pptx
PPTX
Big data analytics(BAD601) module-1 ppt
PDF
PPTX
BADS-MBA-Unit 1 that what data science and Interpretation
PPTX
basic of data science and big data......
PPTX
Big data Analytics(BAD601) -module-1 ppt
DOCX
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
PDF
Fundamentals of data science: digital data
PPTX
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
PPTX
Data Science presentation for explanation of numpy and pandas
PPTX
introduction TO DS 1.pptxvbvcbvcbvcbvcbvcb
PDF
Introduction to Data Science: data science process
Ch1_Introduction to DATA SCIENCE_TYBSC(CS)_2024.pptx
CS3352-Foundations of Data Science Notes.pdf
Data science.chapter-1,2,3
Data science unit1
Introductio to Data Science and types of data
Chapter 2- Data Science and big data.pptx
Data science and business analytics
Introduction to Data Science.pptx
ch2 DS.pptx
Big data analytics(BAD601) module-1 ppt
BADS-MBA-Unit 1 that what data science and Interpretation
basic of data science and big data......
Big data Analytics(BAD601) -module-1 ppt
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
Fundamentals of data science: digital data
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
Data Science presentation for explanation of numpy and pandas
introduction TO DS 1.pptxvbvcbvcbvcbvcbvcb
Introduction to Data Science: data science process
Ad

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Fluorescence-microscope_Botany_detailed content
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Global journeys: estimating international migration
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Clinical guidelines as a resource for EBP(1).pdf
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Moving the Public Sector (Government) to a Digital Adoption
Fluorescence-microscope_Botany_detailed content
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Taxes Foundatisdcsdcsdon Certificate.pdf
Global journeys: estimating international migration
Supervised vs unsupervised machine learning algorithms
oil_refinery_comprehensive_20250804084928 (1).pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Launch Your Data Science Career in Kochi – 2025
Introduction-to-Cloud-ComputingFinal.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Data_Analytics_and_PowerBI_Presentation.pptx
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...

Introduction of Data Science and Data Analytics

  • 2. What is Data Science?  Data science is the deep study of the massive amount of data which involves extracting meaningful insights from raw ,structured and unstructured data that is processed using scientific methods, different technologies, and algorithm.  It is multidisciplinary field that uses tools and techniques to manipulate the data so that you can find something new and meaningful.  Data science uses most powerful hardware ,programming system, and most efficient algorithm to solve the data related problems. It is future of artificial intelligence.
  • 5. Data science refer to emerging area of work concerned with collection, preparation, analysis, visualization, management, and preservation of large collection of information.
  • 6. Data science is all about:
  • 7. Examples:  I’m sure you have seen smart watches — or maybe you use one, too. These smart gadgets can measure your sleep quality, how much you walk, your heart rate, etc.
  • 8. Tesla is famous for using data science – e.g. deep learning – for their self-driving
  • 9. Need of the data science:
  • 10. Following are some main reasons for using data science technology:  With the help of data science technology, we can convert massive amount of raw and unstructured data into meaningful insights.  Data science technology is opting by various companies, whether it is big brand or start up. Google Amazon ,Netflix etc. which handle huge amount of data, are using data science algorithm for better consumers experience.  Data science is working for automating transportation such as creating self driving car, which is feature of transportation.  Data science can help in different prediction such as various survey ,election ,flight ticket confirmation, etc.
  • 11.  Data is the oil for today's world. With the right tools, technologies, algorithms, we can use data and convert it into a distinctive business advantage  Data Science can help you to detect fraud using advanced machine learning algorithms It helps you to prevent any significant monetary losses  Allows to build intelligence ability in machines.You can perform sentiment analysis to gauge customer brand loyalty  It enables you to take better and faster decisions  Helps you to recommend the right product to the right customer to enhance your business
  • 12. Basics of data:  Data:Data are characteristics or information, usually numeric, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable.  There are 3 main category of data: 1)Structured data 2)Unstructured data 3)Semi structured data.
  • 15. Structured data:  Structured data usually resides in relational databases RDBMS. Fields store length- delineated data phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length like names are contained in records, making it a simple matter to search. Data may be human- or machine-generated as long as the data is created within an RDBMS structure. This format is eminently searchable both with human generated queries and via algorithms using type of data and field names, such as alphabetical or numeric, currency or date.  Common relational database applications with structured data include airline reservation systems, inventory control, sales transactions, and ATM activity. Structured Query Language (SQL) enables queries on this type of structured data within relational databases.
  • 16. What Is Unstructured Data?  Unstructured data is essentially everything else. Unstructured data has internal structure but is not structured via pre-defined data models or schema. It may be textual or non- textual, and human- or machine-generated. It may also be stored within a non-relational database like NoSQL.  Typical human-generated unstructured data includes: • Text files: Word processing, spreadsheets, presentations, email, logs. • Email: we sometimes refer to it as semi structured. However, its message field is unstructured and traditional analytical tools cannot parse it. • Social Media: Data from Facebook, Twitter, LinkedIn. • Website: YouTube, Instagram, photo sharing sites. • Mobile data: Text messages, locations. • Communications: Chat, phone recordings, collaboration software(like Microsoft Teams, Google Docs etc.). • Media: MP3, digital photos, audio and video files. • Business applications: MS Office documents, productivity applications.
  • 17.  Typical machine-generated unstructured data: Machine generated data is information that is automatically created by a computer, process, application, or other machine without human intervention. • Satellite imagery: Weather data, land forms, military movements. • Scientific data: Oil and gas exploration, space exploration, seismic imagery, atmospheric data. • Digital surveillance: Surveillance photos and video. • Sensor data: Traffic, weather, oceanographic sensors.
  • 20. Semi-structured data : Semi-structured data maintains internal tags and markings that identify separate data elements, which enables information grouping and hierarchies. Email is a very common example of a semi-structured data type. Email’s native metadata enables classification and keyword searching without any additional tools. Sharing sensor data is a growing use case, as are Web-based data sharing and transport: electronic data interchange (EDI), many social media platforms, document markup languages, and NoSQL databases.
  • 21. Examples of Semi-structured Data • Markup language XML This is a semi-structured document language. XML is a set of document encoding rules that defines a human- and machine-readable format. Its value is that its tag-driven structure is highly flexible, and coders can adapt it to universalize data structure, storage, and transport on the Web. • Open standard JSON (JavaScript Object Notation) JSON is another semi-structured data interchange format. Its structure consists of name/value pairs (or object, hash table, etc.) and an ordered value list (or array, sequence, list). Since the structure is interchangeable among languages, JSON excels at transmitting data between web applications and servers. • NoSQL Semi-structured data is also an important element of many NoSQL (“not only SQL”) databases. NoSQL databases differ from relational databases because they do not separate the organization (schema) from the data. This makes NoSQL a better choice to store information that does not easily fit into the record and table format, such as text with varying lengths. It also allows for easier data exchange between databases. Some newer NoSQL databases like MongoDB and Couchbase also incorporate semi-structured documents by natively storing them in the JSON format.
  • 22. Differences between Structured, Semi- structured and Unstructured data: Properties Structured data Semi-structured data Unstructured data Technology It is based on Relational database table It is based on XML/RDF(Resource Description Framework). It is based on character and binary data Transaction management Matured transaction and various concurrency techniques Transaction is adapted from DBMS not matured No transaction management and no concurrency Version management Versioning over tuples, row, tables Versioning over tuples or graph is possible Versioned as a whole Flexibility It is schema dependent and less flexible It is more flexible than structured data but less flexible than unstructured data It is more flexible and there is absence of schema Scalability It is very difficult to scale DB schema It’s scaling is simpler than structured data It is more scalable. Robustness Very robust New technology, not very spread — Query performance Structured query allow complex joining Queries over anonymous nodes are possible Only textual queries are possible
  • 23. Basics of Data Science: Providing some sort of understanding of the data. i.e the term use as information extraction. Insight:It is gained by analysing data and information to understand what is going on with particular situation. Data:It is row unorganized set of information. Data Information Insights
  • 24. Need of the Data Science:  Todays most of the data is unstructured and semi structured .  Sources of these current data are, financial logs, text file, multimedia forms, sensors, and instruments.  Simple BI tools not capable of processing this huge volume and variety of data.  This is why we required more complex and advance analytical tools and algorithm for processing, analyzing and drawing meaningful insights from it.  Data science is all about uncovering findings from data.  Example: Netflix .
  • 25. What is Data Science?  Turning raw data into insights to make better decision.  Data science is blend of the various tools, algorithm, and machine learning principles with goal to discovered hidden pattern from raw data.  It is art and science of extracting actionable insights from raw data.  Data science is also known as data driven science.
  • 26. Definition of a data science by famous Venn diagram. (by,Drew Conway)
  • 27. Basic areas in Data science:  Mathematics and statistics:  Computer programming:  Domain knowledge: Data science can add values in following ways: 1.It empower management and officers to make better decision. 2.It helps to direct action to trends, for defining goal. 3.It helps staff to adopt best practice and focus on issues that matter. 4.It helps to identify opportunities and decision making with quantifiable data. 5.It helps in identification and refining of target audience for business.
  • 28. What is Data science? Is it statistic or Machine Learning?  Statistics is a tool or method for data science, while data science is wide domain where a statistical method is essential component.  All Statistician can not be a Data scientist and all Data scientist can not be Statistician.  Machine learning can be define as practice of using algorithm to use data, learn from it and then forecast future trends for that topic.  Data science uses Machine learning as a tool to provide insights.  Data science is the multidisciplinary blend of data inference, algorithm development and technology in order to solve analytically complex problems.
  • 29. Sr.No Features BI Data Science 1. Data Sources Structured(Usually SQL,often Data Both structured and unstructured (logs, cloud data,SQL,NOSQL,text) 2. Approach Statistics and visualization Statistics, Machine Learning, Graph Analysis, NLP 3. Focus Past and present Present and future 4. Tools Pentaho, Microsoft BI, QlikView, R RapidMiner, BigML, Weka, R
  • 30. Sr.No Machine Learning Data Science 1. A subset of AI that focuses on narrow range of activities Data science is not exactly subset of machine learning but it uses machine learning to analyze and make future prediction 2 Develop new (individual) model Explore many models, built and tune hybrids 3 Prove mathematical properties of models Understand empirical properties of models 4 Improve/Validate on a few, relatively clean, small datasets. Develop/use tools that can handle massive datasets. 5 It produces predictions It produces insights 6 It is the part of data science Data science is an all encompassing terms that includes aspects of machine learning for functionality
  • 31. Data Scientist:  A data scientist is a professional responsible for collecting, analysing, and interpreting large amount of data to identify ways to help a business improve operation.  Data scientist has sufficient knowledge of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage end to end scientific methods in each state in big data.  Responsibilities of Data scientist: 1)Collecting large amount of unruly data and transforming it into a more usable format. 2)Solving business related problems using data driven techniques. 3)Working with variety of programming language. 4)Having a solid grasp of statistics, including statistical test and distribution. 5)Knowledge about top of analytical techniques such as, Machine learning,deep learnig, text analytics. 6)Looking for order and pattern of data ,as well as spotting trends.
  • 32. Category of Data Scientist:  Data scientist were classified into 4 categories: 1)Data developer: Developer, Engineer 2)Data Researcher: Researcher, Scientist, Statistician 3)Data Creative: Jack of all trends, Artist, Hacker 4)Data Businessperson: Leader, Businessperson, Entrepreneur
  • 33. Skills for Data Scientist:  Programming skills: Data scientist should have command over programming language, like R or Python and Database querying language like SQL.  Statistics: He should have knowledge about test, distribution, maximum likelihood estimators.  Machine Learning: For large company with huge amount of data (e.g. Netflix, Google, Amazon etc.),it may be essential to familiar with ML methods like k-nearest neighbors, random forest, etc.  Multivariable calculus and Linear Algebra: The company where product is defined by data ,these concepts are most important.  Data Visualization and Communication: This technique is important for younger companies that are driven data driven decision for first time. e.g. Tableau It is important to not just familiar with visualization tools but should also finding the principle behind visually encoding data.
  • 35.  A process of discovering useful relationship and pattern in data is ,enabled by a set of iterative activities collectively known as the Data Science Process.  Data science Process involves : 1)Understanding the problem 2)Preparing the data samples 3)Developing the model 4)Applying the model on a dataset to see how the model may work in real world. 5)Deploying and maintaining the models.
  • 36. Step-1:Prior knowledge: i)It helps to define what problem is being solved, how it fits in the business context, and what data is needed in order to solve the problem. ii)Data science process starts with the need for analysis, a question, or a business objective. iii)without well define statement of the problem, it is impossible to come up with right dataset. iv)Data science process is going to explained using hypothetical use case. Step-2:Data Preparation: i)Preparing the dataset which suits task is most time consuming part of the process. ii)This phase consist of three sub phases: Data Cleaning, Removes of false values from data source and inconsistency across data source, data integration enrich the data source by combining information from multiple data source, and data transformation ensure that the data is in suitable format for use in your model.Data exploration is concerned with building a deeper understanding of your data.
  • 37.  Sampling: It is the process of selecting a subset of records as a representation of the original dataset for use in data analysis or modeling. Sampling reduces the amount of that need to be processed and speed up the build process of the modeling.  Model Build process: In this process ,it is necessary to segment the dataset into training and test samples.  Step-3:Modeling: i)A Model is the abstract representation of data. This step create representative model inferred from data. ii)Training dataset: The dataset used to create the model, with known attributes and target, is called the training dataset. iv)Test dataset or validation dataset: The validity of the created model will also need to be checked with another known dataset called the test or validation dataset. v)Building the model is the iterative process that involves selecting the variables for the model, executing the model and model diagnostics.
  • 38.  Step-4:Application: i)Deployment: Here Model is become production ready. ii)The Model deployment stage has to deal with :assessing model readiness, Technical integration, response time, model maintenance, and assimilation. iv)Evaluation :It is the part of process where you test to see if you have a good model or not, before deploying or presenting. Step-5:Knowledge: i)The data science process start with prior knowledge and end with posterior knowledge.
  • 39. Stages in Data Science project:
  • 40. Stage-1:Data Acquisition: i)Data science project begin with identifying various data sources.(e.g. logs from webserver, social media data, data from online repositories like census dataset, data stream from online sources via API’s, web scraping) ii)Data acquisition involves acquiring data from all the identified internal and external sources that answer the business questions. iii)main job of data scientist in this step is to tracking where each data slice comes from and whether the data slice acquired is up to date or not. It is important track these information during entire lifecycle of a data science.
  • 41. Stage-2:Data Preparation: i)Data acquired in first step is not in a usable format to run the required analysis and might contain missing entries, inconsistencies and semantic errors. ii)Next, Data scientist have to clean and reformat the data by manually editing it in the spreadsheet or by writing code. This step does not produce any meaningful insights. iii)Through regular data cleaning, data scientist can easily identify what fault exist in the data acquisition process, what assumption they should make and what model they can apply to produce analysis result. iv)Data after reformatting can be converted to JSON, CSV or any other format makes it easy to load into one of the data science tools. v)Exploratory data analysis :it forms an integral part ,at this stage as summarization of the clean data can help to identify outlier, anomalies, and patterns. vi)Through this step, data scientist come to know what do they actually to do with this data. This stage is the time consuming.
  • 42. Stage 3:Hypothesis and Modelling: i)This stage required writing, running, refining the program to analyse and derive meaning full insights from data. These program may written in Python, R, MATLAB or Perl language. ii)Different machine learning are applied to the data to identify the machine learning model that best fit the business needs. All the machine learning model by train dataset. Stage-4:Evaluation and interpretation: i)There are different evaluation metric for different performance metric. E.g. to predict Daily stocks then the RMSE(Root Mean Squared Error)will have to be consider for evaluation. If model for classifying Spam email message then performance metric like average accuracy, AUC, Log loss have to be consider. ii)Machine learning model performance should be measured and using validation and test sets to identify best model based on model accuracy and over fitting.
  • 43. Stage-5:Deployment: i)Machine learning model might have to be recoded ,bcz Data scientist favour Python programming language but production environment support Java. ii)Models are first deployed in a pre-production or test-environment before actually deploy them in production. Stage-6:Operation/Maintenance: i)This step involves developing the plan for monitoring and maintaining the data science project in the long run. Stage-6:Optimization: i)It involve the retraining the machine learning model in production new data source comes .