SlideShare a Scribd company logo
Programa de Física
Docente: Carlos Andrés Vidal Betancourt
Física Computacional 1
S2 - Programming with Data
Overview
Chapter 2 – Programming with Data
2.1 Introduction
2.2 The computing environment
2.3 Best practices
2.4 Data-centric coding
2.5 Getting help
2.6 Conclusion
Sean Raleigh
Westminster College
… a quote …
2.1 Introduction
<<The most important tool in the data science tool belt is the computer. No amount of statistical or mathematical
knowledge will help you analyze data if you cannot store, load, and process data using technology.>>
<<The aim of this chapter is to introduce you to some aspects of computing and computer programming that are
important for data science applications.>>
<<A project that can be reproduced is one that bundles together the raw data along with all the code used to
take that data through the entire pipeline of loading, processing, cleaning, transforming, exploring,
summarizing, visualizing, and analyzing it.>>
https://github.com/VectorPosse/Programming_with_Data
2.2 The computing environment
<<The choice of hardware for doing data science depends heavily on the task at hand.>>
ASUS
Rock Strick
X399
Processor:
AMD RYZEN - Threadripper
16 Cores / 32 Threads.
f=4.5 GHz / Cache 132 MB
RAM - DDR5
64 GB / 3.2 GHz
Video Card – NVIDIA
Titan 2070X – 8 GB DDR6
21 GHz / 2304 Cores
SSD 1TB
Read/Write 3.5 GB/s
Master cooler
(Graphene)
Hardware
UPS
10 kVA
EVGA - 800 Watts
2.2 The computing environment
<<One common definition of big data is any data that is too big to fit in the memory your computer has.>>
Running a series of sequential simulations of VASP on Miztli
2.2 The computing environment
<<A lot of serious computing is still done at the command line.>>
How to crop pages of PDF to the greatest enclosing box?
https://www.baeldung.com/linux/pdf-files-crop-
cli#:~:text=To%20crop%20PDF%20pages%2C%20we,by
%20the%20poppler%2Dutils%20package.
2.2 The computing environment
1. Easy to learn
2. Free and open source
3. Third party modules
4. Strong community
5. Compatibility
6. Libraries
7. Speed
<<Python is a general-purpose programming language that was
designed to emphasize code readability and simplicity. While
not originally built for data science applications per se, various
libraries augment Python’s native capabilities: for example,
pandas for storing and manipulating tabular data, NumPy for
efficient arrays, SciPy for scientific programming, and scikit-
learn for machine learning.>>
2.2 The computing environment
IDE:
1. Syntax highlighting
2. Linters clean up code
3. Debugging tools
4. Project management
5. Code completion
6. Version control
<< Notebooks are especially valuable in educational settings. Rather than
having two documents—one containing code and the other explaining the
code, usually requiring awkward references to line numbers in a different
file, notebooks allow students to see code embedded in narrative
explanations that appear right before and after the code. >>
2.3 Best practices
“Coding like poetry should be short and concise.” ― Santosh Kalwar
“Code is like humor. When you must explain it, it’s bad.” – Cory House
“Make it work, make it right, make it fast.” – Kent Beck
1. Write readable code
2. Don’t repeat yourself
<< Abstracting tasks into functions ultimately makes your code more
readable; rather than seeing the guts of a function repeated throughout
a computation, we see them defined and named once, and then that
descriptive name repeated throughout the computation, which makes
the meaning of the code much more obvious.>>
3. Set seeds for
random processes
<< Two computers running the same pseudorandom-
number-generating algorithm starting with the same seed
will produce identical sequences of pseudorandom values.>>
2.3 Best practices
4. Profile, benchmark and
optimize judiciously
6. Don’t rely on black boxes
5. Test your code
2.3 Best practices
<< The key problem is that there is no single algorithm that performs best under all circumstances. Every data problem
presents unique challenges. A good data scientist will be familiar not only with a variety of algorithms, but also with the
circumstances under which those algorithms are appropriate. They need to understand any assumptions or conditions
that must apply. They must know how to interpret the output and ascertain to what degree the results are reliable and
valid. Many of the chapters of this book are specifically designed to help the reader avoid some of the pitfalls of using the
wrong algorithms at the wrong times. >>
<< It may not be necessary in all cases to scrutinize every line of code that implements an algorithm. But it is worthwhile
to find a paper that explains the main ideas and theory behind it. By virtue of their training, physicists are in a great
position to read technical literature and make sense of it. >>
<< Another suggestion for using algorithms appropriately is to use some fake data—perhaps data that is simulated to
have certain properties—to test algorithms that are new to you. That way you can check that the algorithms generate the
results you expect in a more controlled environment. >>
Theoretical
Experimental
Computational
Physics
2.4 Data-centric coding
<< XML stands for eXtensible Markup Language and uses tags, like HTML
does. It’s a fun exercise to rename a Microsoft Excel file to have a .zip
extension, unzip it, and explore the underlying XML files. >>
read_excel -> function in pandas on Python
<<Even easier, if you can open the spreadsheet in Excel, you can export it in a
plain text format that’s easier to parse.>>
1. Obtaining data
<< The same file shown in three different plain text formats: CSV (left),
TSV (center), and fixed-width (right) with fields of size 13, 6, and 10.>>
<< The go-to web scraping tool in Python is Beautiful Soup.>>
Database: SQL, short for Structured Query Language.
NoSQL databases use a variety of systems to store data, including key-value pairs, document stores, and graphs.
2.4 Data-centric coding
2. Data
structures
2.4 Data-centric coding
<< Processing matrices is easy due to advanced linear algebra libraries that make matrix operations very efficient. Python
has the numpy library that defines array-like structures like matrices. >>
A list in Python with three elements: a list of ten numbers, a list of
two strings, and a dictionary with five key-value pairs.
<< Pandas library that was built to handle tabular data. Each column (of one specific type) is called a Series, and a
collection of Series is called a DataFrame. >>
2. Data
structures
2.4 Data-centric coding
<< When we obtain data, it’s almost never in a form that is suitable for doing immediate analysis.>>
1. Each set of related observations forms a table.
2. Each row in a table represents an observation.
3. Each column in a table represents a variable.
<< Often, the most time-consuming task in the data pipeline is tidying the data, also called cleaning, wrangling, munging,
or transforming. Every dataset comes with its own unique data-cleaning challenges, but there are a few common
problems one can look for.>>
3. Cleaning Data
Tidy Data
Missing
Data
Data
Values
Outliers Other
issues
2.4 Data-centric coding
<< The matplotlib library is somewhat analogous to base R graphics: hard to use, but with much more flexibility and fine
control. Other popular and easier-to-use options are seaborn and Bokeh. You can also use the ggplot package,
implemented to emulate R’s ggplot2. >>
4. Exploratory Data Analysis (EDA)
2.5 Getting help
2.5 Getting help
2.5 Conclusion
Find a project to start working on it…
<< Find some data and try to clean it. Raw data is plentiful on the Internet. You might try Kaggle or the U.S. government
site data.gov. (The latter has lots of data in weird formats to give you some practice importing from a wide variety of
file types.) You can find data on any topic by typing that topic in any search engine and appending the word “data.” >>
<< Try some web scraping. Find a web page that interests you, preferably one with some cool data presented in a
tabular format. (Be sure to check that it’s okay to scrape that site. “Open” projects like Wikipedia are safe places to
start.) Find a tutorial for a popular web scraping tool and mimic the code you see there, adapting it to the website
you’ve chosen. Along the way, you’ll likely have to learn a little about HTML and CSS. Store the scraped data in a data
format like a data frame that is idiomatic in your language of choice.>>
S2-Programming_with_Data_Computational_Physics.pdf
Ad

Recommended

Data Munging in concepts of data mining in DS
Data Munging in concepts of data mining in DS
nazimsattar
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
VANITHA S.docx.pptxdata science with python
VANITHA S.docx.pptxdata science with python
ksaravanakumar450
 
Software Engineering for Data Scientists (MEAP V2) Andrew Treadway
Software Engineering for Data Scientists (MEAP V2) Andrew Treadway
carvobunns30
 
Data science presentation
Data science presentation
MSDEVMTL
 
Data Science Accelerator Program
Data Science Accelerator Program
GoDataDriven
 
Get Data Science with Python 1st Edition Coll. free all chapters
Get Data Science with Python 1st Edition Coll. free all chapters
bagzimanki03
 
Building Data Apps with Python
Building Data Apps with Python
Benjamin Bengfort
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
Benjamin Bengfort
 
Complete Introduction To DataScience PPT
Complete Introduction To DataScience PPT
ARUN R S
 
Data collection and enhancement
Data collection and enhancement
ankit_ppt
 
Data Science with Python 1st Edition Coll.
Data Science with Python 1st Edition Coll.
leyitoqata
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
hkabir55
 
High Performance Python 2nd Edition Micha Gorelick
High Performance Python 2nd Edition Micha Gorelick
danuzakhiem
 
UNIT-IV-II IT-Python Libraries for Data Wrangling
UNIT-IV-II IT-Python Libraries for Data Wrangling
hemalathab24
 
Data Science with Python 1st Edition Coll. download pdf
Data Science with Python 1st Edition Coll. download pdf
ollerpudi
 
Clean code in Jupyter notebooks
Clean code in Jupyter notebooks
Katerina Nerush
 
Class 12 Ip Whole Text Book Preeti Arora
Class 12 Ip Whole Text Book Preeti Arora
VaibhavGour7
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
priyanka rajput
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev
 
On the code of data science
On the code of data science
Gael Varoquaux
 
PYTHON FOR DATA SCIENCE- EXPLAINED IN 6 EASY STEPS
PYTHON FOR DATA SCIENCE- EXPLAINED IN 6 EASY STEPS
USDSI
 
Class 01 - Intro.pdf
Class 01 - Intro.pdf
JonathanArp3
 
Software Engineering For Data Scientists Meap V2 Chapters 1 To 7 Of 14 Andrew...
Software Engineering For Data Scientists Meap V2 Chapters 1 To 7 Of 14 Andrew...
fathikparve
 
python-for-advanced-data-science-techniques-and-best-practices-20240911071850...
python-for-advanced-data-science-techniques-and-best-practices-20240911071850...
abhishekprasadabhima
 
Introduction to data science
Introduction to data science
Mahir Haque
 
udacity-dandsyllabus
udacity-dandsyllabus
Bora Yüret
 
“THE BEST CLASS IN SCHOOL”. _
“THE BEST CLASS IN SCHOOL”. _
Colégio Santa Teresinha
 
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
RAKESH SAJJAN
 

More Related Content

Similar to S2-Programming_with_Data_Computational_Physics.pdf (20)

Building Data Apps with Python
Building Data Apps with Python
Benjamin Bengfort
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
Benjamin Bengfort
 
Complete Introduction To DataScience PPT
Complete Introduction To DataScience PPT
ARUN R S
 
Data collection and enhancement
Data collection and enhancement
ankit_ppt
 
Data Science with Python 1st Edition Coll.
Data Science with Python 1st Edition Coll.
leyitoqata
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
hkabir55
 
High Performance Python 2nd Edition Micha Gorelick
High Performance Python 2nd Edition Micha Gorelick
danuzakhiem
 
UNIT-IV-II IT-Python Libraries for Data Wrangling
UNIT-IV-II IT-Python Libraries for Data Wrangling
hemalathab24
 
Data Science with Python 1st Edition Coll. download pdf
Data Science with Python 1st Edition Coll. download pdf
ollerpudi
 
Clean code in Jupyter notebooks
Clean code in Jupyter notebooks
Katerina Nerush
 
Class 12 Ip Whole Text Book Preeti Arora
Class 12 Ip Whole Text Book Preeti Arora
VaibhavGour7
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
priyanka rajput
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev
 
On the code of data science
On the code of data science
Gael Varoquaux
 
PYTHON FOR DATA SCIENCE- EXPLAINED IN 6 EASY STEPS
PYTHON FOR DATA SCIENCE- EXPLAINED IN 6 EASY STEPS
USDSI
 
Class 01 - Intro.pdf
Class 01 - Intro.pdf
JonathanArp3
 
Software Engineering For Data Scientists Meap V2 Chapters 1 To 7 Of 14 Andrew...
Software Engineering For Data Scientists Meap V2 Chapters 1 To 7 Of 14 Andrew...
fathikparve
 
python-for-advanced-data-science-techniques-and-best-practices-20240911071850...
python-for-advanced-data-science-techniques-and-best-practices-20240911071850...
abhishekprasadabhima
 
Introduction to data science
Introduction to data science
Mahir Haque
 
udacity-dandsyllabus
udacity-dandsyllabus
Bora Yüret
 
Building Data Apps with Python
Building Data Apps with Python
Benjamin Bengfort
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
Benjamin Bengfort
 
Complete Introduction To DataScience PPT
Complete Introduction To DataScience PPT
ARUN R S
 
Data collection and enhancement
Data collection and enhancement
ankit_ppt
 
Data Science with Python 1st Edition Coll.
Data Science with Python 1st Edition Coll.
leyitoqata
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
hkabir55
 
High Performance Python 2nd Edition Micha Gorelick
High Performance Python 2nd Edition Micha Gorelick
danuzakhiem
 
UNIT-IV-II IT-Python Libraries for Data Wrangling
UNIT-IV-II IT-Python Libraries for Data Wrangling
hemalathab24
 
Data Science with Python 1st Edition Coll. download pdf
Data Science with Python 1st Edition Coll. download pdf
ollerpudi
 
Clean code in Jupyter notebooks
Clean code in Jupyter notebooks
Katerina Nerush
 
Class 12 Ip Whole Text Book Preeti Arora
Class 12 Ip Whole Text Book Preeti Arora
VaibhavGour7
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
priyanka rajput
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev
 
On the code of data science
On the code of data science
Gael Varoquaux
 
PYTHON FOR DATA SCIENCE- EXPLAINED IN 6 EASY STEPS
PYTHON FOR DATA SCIENCE- EXPLAINED IN 6 EASY STEPS
USDSI
 
Class 01 - Intro.pdf
Class 01 - Intro.pdf
JonathanArp3
 
Software Engineering For Data Scientists Meap V2 Chapters 1 To 7 Of 14 Andrew...
Software Engineering For Data Scientists Meap V2 Chapters 1 To 7 Of 14 Andrew...
fathikparve
 
python-for-advanced-data-science-techniques-and-best-practices-20240911071850...
python-for-advanced-data-science-techniques-and-best-practices-20240911071850...
abhishekprasadabhima
 
Introduction to data science
Introduction to data science
Mahir Haque
 
udacity-dandsyllabus
udacity-dandsyllabus
Bora Yüret
 

Recently uploaded (20)

“THE BEST CLASS IN SCHOOL”. _
“THE BEST CLASS IN SCHOOL”. _
Colégio Santa Teresinha
 
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
RAKESH SAJJAN
 
LDM Recording Presents Yogi Goddess by LDMMIA
LDM Recording Presents Yogi Goddess by LDMMIA
LDM & Mia eStudios
 
Community Health Nursing Approaches, Concepts, Roles & Responsibilities – Uni...
Community Health Nursing Approaches, Concepts, Roles & Responsibilities – Uni...
RAKESH SAJJAN
 
How payment terms are configured in Odoo 18
How payment terms are configured in Odoo 18
Celine George
 
Code Profiling in Odoo 18 - Odoo 18 Slides
Code Profiling in Odoo 18 - Odoo 18 Slides
Celine George
 
LDMMIA Practitioner Student Reiki Yoga S2 Video PDF Without Yogi Goddess
LDMMIA Practitioner Student Reiki Yoga S2 Video PDF Without Yogi Goddess
LDM & Mia eStudios
 
LDMMIA Yoga S10 Free Workshop Grad Level
LDMMIA Yoga S10 Free Workshop Grad Level
LDM & Mia eStudios
 
ENGLISH-5 Q1 Lesson 1.pptx - Story Elements
ENGLISH-5 Q1 Lesson 1.pptx - Story Elements
Mayvel Nadal
 
University of Ghana Cracks Down on Misconduct: Over 100 Students Sanctioned
University of Ghana Cracks Down on Misconduct: Over 100 Students Sanctioned
Kweku Zurek
 
K12 Tableau User Group virtual event June 18, 2025
K12 Tableau User Group virtual event June 18, 2025
dogden2
 
ENGLISH_Q1_W1 PowerPoint grade 3 quarter 1 week 1
ENGLISH_Q1_W1 PowerPoint grade 3 quarter 1 week 1
jutaydeonne
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 6-14-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 6-14-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
How to Manage Different Customer Addresses in Odoo 18 Accounting
How to Manage Different Customer Addresses in Odoo 18 Accounting
Celine George
 
2025 June Year 9 Presentation: Subject selection.pptx
2025 June Year 9 Presentation: Subject selection.pptx
mansk2
 
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Pragya - UEM Kolkata Quiz Club
 
How to Customize Quotation Layouts in Odoo 18
How to Customize Quotation Layouts in Odoo 18
Celine George
 
SCHIZOPHRENIA OTHER PSYCHOTIC DISORDER LIKE Persistent delusion/Capgras syndr...
SCHIZOPHRENIA OTHER PSYCHOTIC DISORDER LIKE Persistent delusion/Capgras syndr...
parmarjuli1412
 
Publishing Your Memoir with Brooke Warner
Publishing Your Memoir with Brooke Warner
Brooke Warner
 
Paper 107 | From Watchdog to Lapdog: Ishiguro’s Fiction and the Rise of “Godi...
Paper 107 | From Watchdog to Lapdog: Ishiguro’s Fiction and the Rise of “Godi...
Rajdeep Bavaliya
 
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
Environmental Science, Environmental Health, and Sanitation – Unit 3 | B.Sc N...
RAKESH SAJJAN
 
LDM Recording Presents Yogi Goddess by LDMMIA
LDM Recording Presents Yogi Goddess by LDMMIA
LDM & Mia eStudios
 
Community Health Nursing Approaches, Concepts, Roles & Responsibilities – Uni...
Community Health Nursing Approaches, Concepts, Roles & Responsibilities – Uni...
RAKESH SAJJAN
 
How payment terms are configured in Odoo 18
How payment terms are configured in Odoo 18
Celine George
 
Code Profiling in Odoo 18 - Odoo 18 Slides
Code Profiling in Odoo 18 - Odoo 18 Slides
Celine George
 
LDMMIA Practitioner Student Reiki Yoga S2 Video PDF Without Yogi Goddess
LDMMIA Practitioner Student Reiki Yoga S2 Video PDF Without Yogi Goddess
LDM & Mia eStudios
 
LDMMIA Yoga S10 Free Workshop Grad Level
LDMMIA Yoga S10 Free Workshop Grad Level
LDM & Mia eStudios
 
ENGLISH-5 Q1 Lesson 1.pptx - Story Elements
ENGLISH-5 Q1 Lesson 1.pptx - Story Elements
Mayvel Nadal
 
University of Ghana Cracks Down on Misconduct: Over 100 Students Sanctioned
University of Ghana Cracks Down on Misconduct: Over 100 Students Sanctioned
Kweku Zurek
 
K12 Tableau User Group virtual event June 18, 2025
K12 Tableau User Group virtual event June 18, 2025
dogden2
 
ENGLISH_Q1_W1 PowerPoint grade 3 quarter 1 week 1
ENGLISH_Q1_W1 PowerPoint grade 3 quarter 1 week 1
jutaydeonne
 
How to Manage Different Customer Addresses in Odoo 18 Accounting
How to Manage Different Customer Addresses in Odoo 18 Accounting
Celine George
 
2025 June Year 9 Presentation: Subject selection.pptx
2025 June Year 9 Presentation: Subject selection.pptx
mansk2
 
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Pragya - UEM Kolkata Quiz Club
 
How to Customize Quotation Layouts in Odoo 18
How to Customize Quotation Layouts in Odoo 18
Celine George
 
SCHIZOPHRENIA OTHER PSYCHOTIC DISORDER LIKE Persistent delusion/Capgras syndr...
SCHIZOPHRENIA OTHER PSYCHOTIC DISORDER LIKE Persistent delusion/Capgras syndr...
parmarjuli1412
 
Publishing Your Memoir with Brooke Warner
Publishing Your Memoir with Brooke Warner
Brooke Warner
 
Paper 107 | From Watchdog to Lapdog: Ishiguro’s Fiction and the Rise of “Godi...
Paper 107 | From Watchdog to Lapdog: Ishiguro’s Fiction and the Rise of “Godi...
Rajdeep Bavaliya
 
Ad

S2-Programming_with_Data_Computational_Physics.pdf

  • 1. Programa de Física Docente: Carlos Andrés Vidal Betancourt Física Computacional 1 S2 - Programming with Data
  • 2. Overview Chapter 2 – Programming with Data 2.1 Introduction 2.2 The computing environment 2.3 Best practices 2.4 Data-centric coding 2.5 Getting help 2.6 Conclusion Sean Raleigh Westminster College
  • 4. 2.1 Introduction <<The most important tool in the data science tool belt is the computer. No amount of statistical or mathematical knowledge will help you analyze data if you cannot store, load, and process data using technology.>> <<The aim of this chapter is to introduce you to some aspects of computing and computer programming that are important for data science applications.>> <<A project that can be reproduced is one that bundles together the raw data along with all the code used to take that data through the entire pipeline of loading, processing, cleaning, transforming, exploring, summarizing, visualizing, and analyzing it.>> https://github.com/VectorPosse/Programming_with_Data
  • 5. 2.2 The computing environment <<The choice of hardware for doing data science depends heavily on the task at hand.>> ASUS Rock Strick X399 Processor: AMD RYZEN - Threadripper 16 Cores / 32 Threads. f=4.5 GHz / Cache 132 MB RAM - DDR5 64 GB / 3.2 GHz Video Card – NVIDIA Titan 2070X – 8 GB DDR6 21 GHz / 2304 Cores SSD 1TB Read/Write 3.5 GB/s Master cooler (Graphene) Hardware UPS 10 kVA EVGA - 800 Watts
  • 6. 2.2 The computing environment <<One common definition of big data is any data that is too big to fit in the memory your computer has.>> Running a series of sequential simulations of VASP on Miztli
  • 7. 2.2 The computing environment <<A lot of serious computing is still done at the command line.>> How to crop pages of PDF to the greatest enclosing box? https://www.baeldung.com/linux/pdf-files-crop- cli#:~:text=To%20crop%20PDF%20pages%2C%20we,by %20the%20poppler%2Dutils%20package.
  • 8. 2.2 The computing environment 1. Easy to learn 2. Free and open source 3. Third party modules 4. Strong community 5. Compatibility 6. Libraries 7. Speed <<Python is a general-purpose programming language that was designed to emphasize code readability and simplicity. While not originally built for data science applications per se, various libraries augment Python’s native capabilities: for example, pandas for storing and manipulating tabular data, NumPy for efficient arrays, SciPy for scientific programming, and scikit- learn for machine learning.>>
  • 9. 2.2 The computing environment IDE: 1. Syntax highlighting 2. Linters clean up code 3. Debugging tools 4. Project management 5. Code completion 6. Version control << Notebooks are especially valuable in educational settings. Rather than having two documents—one containing code and the other explaining the code, usually requiring awkward references to line numbers in a different file, notebooks allow students to see code embedded in narrative explanations that appear right before and after the code. >>
  • 10. 2.3 Best practices “Coding like poetry should be short and concise.” ― Santosh Kalwar “Code is like humor. When you must explain it, it’s bad.” – Cory House “Make it work, make it right, make it fast.” – Kent Beck 1. Write readable code 2. Don’t repeat yourself << Abstracting tasks into functions ultimately makes your code more readable; rather than seeing the guts of a function repeated throughout a computation, we see them defined and named once, and then that descriptive name repeated throughout the computation, which makes the meaning of the code much more obvious.>> 3. Set seeds for random processes << Two computers running the same pseudorandom- number-generating algorithm starting with the same seed will produce identical sequences of pseudorandom values.>>
  • 11. 2.3 Best practices 4. Profile, benchmark and optimize judiciously 6. Don’t rely on black boxes 5. Test your code
  • 12. 2.3 Best practices << The key problem is that there is no single algorithm that performs best under all circumstances. Every data problem presents unique challenges. A good data scientist will be familiar not only with a variety of algorithms, but also with the circumstances under which those algorithms are appropriate. They need to understand any assumptions or conditions that must apply. They must know how to interpret the output and ascertain to what degree the results are reliable and valid. Many of the chapters of this book are specifically designed to help the reader avoid some of the pitfalls of using the wrong algorithms at the wrong times. >> << It may not be necessary in all cases to scrutinize every line of code that implements an algorithm. But it is worthwhile to find a paper that explains the main ideas and theory behind it. By virtue of their training, physicists are in a great position to read technical literature and make sense of it. >> << Another suggestion for using algorithms appropriately is to use some fake data—perhaps data that is simulated to have certain properties—to test algorithms that are new to you. That way you can check that the algorithms generate the results you expect in a more controlled environment. >> Theoretical Experimental Computational Physics
  • 13. 2.4 Data-centric coding << XML stands for eXtensible Markup Language and uses tags, like HTML does. It’s a fun exercise to rename a Microsoft Excel file to have a .zip extension, unzip it, and explore the underlying XML files. >> read_excel -> function in pandas on Python <<Even easier, if you can open the spreadsheet in Excel, you can export it in a plain text format that’s easier to parse.>> 1. Obtaining data << The same file shown in three different plain text formats: CSV (left), TSV (center), and fixed-width (right) with fields of size 13, 6, and 10.>> << The go-to web scraping tool in Python is Beautiful Soup.>> Database: SQL, short for Structured Query Language. NoSQL databases use a variety of systems to store data, including key-value pairs, document stores, and graphs.
  • 14. 2.4 Data-centric coding 2. Data structures
  • 15. 2.4 Data-centric coding << Processing matrices is easy due to advanced linear algebra libraries that make matrix operations very efficient. Python has the numpy library that defines array-like structures like matrices. >> A list in Python with three elements: a list of ten numbers, a list of two strings, and a dictionary with five key-value pairs. << Pandas library that was built to handle tabular data. Each column (of one specific type) is called a Series, and a collection of Series is called a DataFrame. >> 2. Data structures
  • 16. 2.4 Data-centric coding << When we obtain data, it’s almost never in a form that is suitable for doing immediate analysis.>> 1. Each set of related observations forms a table. 2. Each row in a table represents an observation. 3. Each column in a table represents a variable. << Often, the most time-consuming task in the data pipeline is tidying the data, also called cleaning, wrangling, munging, or transforming. Every dataset comes with its own unique data-cleaning challenges, but there are a few common problems one can look for.>> 3. Cleaning Data Tidy Data Missing Data Data Values Outliers Other issues
  • 17. 2.4 Data-centric coding << The matplotlib library is somewhat analogous to base R graphics: hard to use, but with much more flexibility and fine control. Other popular and easier-to-use options are seaborn and Bokeh. You can also use the ggplot package, implemented to emulate R’s ggplot2. >> 4. Exploratory Data Analysis (EDA)
  • 20. 2.5 Conclusion Find a project to start working on it… << Find some data and try to clean it. Raw data is plentiful on the Internet. You might try Kaggle or the U.S. government site data.gov. (The latter has lots of data in weird formats to give you some practice importing from a wide variety of file types.) You can find data on any topic by typing that topic in any search engine and appending the word “data.” >> << Try some web scraping. Find a web page that interests you, preferably one with some cool data presented in a tabular format. (Be sure to check that it’s okay to scrape that site. “Open” projects like Wikipedia are safe places to start.) Find a tutorial for a popular web scraping tool and mimic the code you see there, adapting it to the website you’ve chosen. Along the way, you’ll likely have to learn a little about HTML and CSS. Store the scraped data in a data format like a data frame that is idiomatic in your language of choice.>>