Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

PythonPro

29 Articles
Divya Anne Selvaraj
04 Feb 2025
11 min read
Save for later

PythonPro #61: Meta’s Llama Flaw, Codon’s NumPy Boost, and Web2Vec for Automated Website Analysis

Divya Anne Selvaraj
04 Feb 2025
11 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#61Meta’s Llama Flaw, Codon’s NumPy Boost, and Web2Vec for Automated Website AnalysisHi ,In today’sExpert Insight we bring you an excerpt from the recently published book, Generative AI on Google Cloud with LangChain, which discusses how LLMs generate plausible but sometimes false responses (hallucinations), and demonstrates how structured prompting with LangChain can help mitigate the issue.News Highlights: Meta's Llama flaw exposes AI servers to remote code execution via Python’s pickle; Codon’s 2025 update brings a faster, open-source NumPy with GPU support; Codegen, a Python library for code refactoring and analysis, is now open source.My top 5 picks from today’s learning resources:Decorator JITs - Python as a DSL⚡100 Common Python Mistakes🐍9 Statistical Distributions Every Data Scientist Should Know📊Create an Adaptive Customer Behavior Analytics Dashboard with Claude AI and Python📈Security and cryptography algorithms: A guide🔐And, in From the Cutting Edge, we introduce Web2Vec, a Python library that automates website analysis by extracting over 200 structured parameters through web crawling and direct feature extraction, with potential applications in cybersecurity, SEO, and machine learning.Stay awesome!Divya Anne SelvarajEditor-in-ChiefSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsMeta's Llama Framework Flaw Exposes AI Systems to Remote Code Execution Risks: The critical vulnerability exposed AI inference servers to remote code execution due to unsafe deserialization with Python’s pickle module.Codon in 2025: New compiler-optimized NumPy implementation. Switching to an open source license: This update leverages Codon’s multithreading, GPU capabilities, and compiler optimizations, achieving significant speed improvements over standard NumPy.Codegen is now open source: Codegen is a Python library for advanced code manipulation, enabling large-scale refactoring, pattern enforcement, and static analysis without requiring deep AST knowledge.💼Case Studies and Experiments🔬How I Built a Python RL Trading Bot That Simulated 1150% Profit: Describes building a bot that identifies potential short squeezes using Financial Modeling Prep’s Fail-to-Deliver and historical stock price data.An empirical study of developers’ challenges in implementing Workflows as Code: A case study on Apache Airflow: Analyzes 1,000 Stack Overflow posts to categorizes challenges into workflow definition, execution, environment setup, quality assurance, security, and optimization, identifying key pain points.📊AnalysisObservations: Using Python with DeepSeek-R1: Explores using DeepSeek-R1 for AI tasks, covering API integration, response streaming, and Retrieval-Augmented Generation (RAG) while analyzing its reasoning process.Decorator JITs - Python as a DSL: Covers AST-based, bytecode-based, and tracing JITs, showing how they convert Python code into optimized LLVM IR for execution.🎓Tutorials and Guides🤓Security and cryptography algorithms: A guide: Coversblock and stream ciphers, hashing, key exchange, public key encryption, and cryptographic libraries, including practical examples using Python.TLS and networking: Explains TLS handshakes, encryption, certificate verification, networking layers, HTTP protocols, sockets, firewalls, and secure app deployment. Also discusses mutual TLS, DNS, and network security.On Shared Birthdays (and a Bit on Pythagorean Triplets) • Pythonic Code: Demonstrates probability concepts with itertools, collections, and datetime,simulates birthday collisions, analyzes probability with brute-force, and more.Make Sick Beats with Python: Explains how to build a simple drum machine in Python using the pygame library, covering setting up the environment, storing and playing sounds, representing music in code, and more.Create an Adaptive Customer Behavior Analytics Dashboard with Claude AI and Python: Demonstrates building a dashboard which analyzes uploaded CSV data, generates Python scripts, executes them, and creates visualizations.Nine Pico PIO Wats with MicroPython (Part 2): Through debugging techniques and real-world examples, it demonstrates workarounds for PIO limitations while building a theremin-like musical instrument.Managing Magento Configurations with PyGento: A Powerful Python CLI Tool for Developers: Explains how the tool integrates with PyGento, provides database access via SQLAlchemy, and automates tasks like searching, viewing, and editing Magento settings without using the admin panel.🔑Best Practices and Advice🔏AI Python Libraries: A centralized resource listing 1,037 libraries with descriptions and use cases for AI development, covering deep learning, machine learning, NLP, and data science.Python Code for Automated Log Analysis & Alerting: Covers parsing system, firewall, and web server logs to detect suspicious activity like brute force attacks and malicious IPs.9 Statistical Distributions Every Data Scientist Should Know: Introduces key statistical distributions, explains their characteristics, and provides practical examples.Crafting a Winning Conference Talk: Lessons from a PyCon US Reviewer:Provides guidance based on the author's experience as a PyCon US reviewer, outlining common mistakes and recommendations for writing a strong conference talk proposal.100 Common Python Mistakes: Covers logic bugs, inefficient code, non-Pythonic practices, and best practices for readability and performance, using clear examples for each mistake.🔍From the Cutting Edge: Web2Vec — A Python Library for Website-to-Vector Transformation💥In "Web2Vec: A Python Library for Website-to-Vector Transformation," D. Frąszczak and E. Frąszczak introduce Web2Vec, a Python library that converts websites into structured vector representations. The library automates feature extraction from web pages, integrating over 200 parameters from website content and structure to enable efficient analysis.ContextWebsite processing involves extracting and transforming web data for analysis. This includes web crawling, which systematically navigates websites to collect data, and web scraping, which extracts specific information from web pages. Website feature extraction identifies key attributes such as structure, security settings, and external references, while vectorisation converts unstructured data into numerical formats for computational analysis.The internet hosts over a billion websites, with millions actively generating data daily. Extracting insights is essential for market research, cybersecurity, and machine learning. While many research papers discuss web data collection, they often rely on custom scripts, leading to inefficiencies. Existing services like WHOIS, SimilarWeb, and Google Search Index provide valuable data but restrict free API access, requiring users to parse raw HTML instead. Web2Vec addresses these challenges with a unified, open-source solution for automated website data extraction and analysis.Key FeaturesWeb2Vec offers a structured approach to web data collection and analysis through:Automated Website Crawling – Uses Scrapy-based spiders to extract data from single pages or entire websites.Comprehensive Feature Extraction – Captures 211 parameters, including URL lexical features, HTML content, HTTP response details, SSL certificates, WHOIS data, and traffic metrics.Flexible Data Processing – Supports active crawling and pre-generated HTML snapshots.Cybersecurity Applications – Detects phishing sites, misinformation, and suspicious activity by integrating services like PhishTank and OpenPhish.Graph-Based Analysis – Visualises website relationships through network graphs for deeper insights.Open-Source & Extensible – Available on PyPI (pip install web2vec), with community support for updates and improvements.What This Means for YouWeb2Vec is a valuable tool for professionals and researchers working with web data. Data scientists can automate website feature extraction for large-scale analysis, while cybersecurity professionals can detect phishing and misinformation using structured data and threat intelligence services. SEO and marketing professionals can benefit from its ability to analyse rankings, metadata, and traffic sources. Developers and web scraping practitioners can replace custom scripts with Web2Vec’s unified framework, reducing redundancy and improving efficiency.Examining the DetailsWeb2Vec processes websites using two primary methods: direct extraction from pre-generated HTML snapshots and active web crawling with Scrapy-based spiders. It extracts 211 parameters across multiple categories, converting website characteristics into structured vector representations. The extracted data, stored in JSON format, includes details such as URL, title, HTML response, and status codes. Users can either crawl entire websites using the Web2VecSpider or extract specific parameters through direct method invocation. Web2Vec also supports visualising website relationships using directed graphs, making it particularly useful for cybersecurity applications such as phishing detection and misinformation analysis.You can learn more by reading the entire paper or accessing the library on GitHub.🧠 Expert insight💥Here’s an excerpt from “Chapter 3: Grounding Responses” in the book, Generative AI on Google Cloud with LangChain by Leonid Kuligin, Jorge Zaldívar, and Maximilian Tschochohei.Overcoming hallucinations in LLMsLLMs are trained on large amounts of publicly available data (please take a look atAppendix 1 for more details on how LLMs are trained). By design, they absorb the information they are explicitly given in the prompt and the information they have seen during thetraining.By default, LLMs don’t have access to any external information (except for what they have already memorized), and in most cases, they’re autoregressive models (they predict output words or tokens one by one by looking at the previous input, and this limits their reasoning capabilities). We’ll see some examples of how we can expand an LLM’s reasoning capabilities with agentic workflows in Chapters 9,10,and11.Simply put, LLMs will use the information from their training to respond to a prompt. LLMs effectively reproduce human language and, therefore, their answers sound very credible even if they are just a probabilistic continuation of the prompt. This can be correct if the training corpus has the right information – or it can just be plausible, but factually wrong. If it’s the latter, the LLM is saidto hallucinate.Let’s take asimple example:Prompt: Tell me about the black cucumbers.Response:Black CucumbersBlack cucumbers, scientifically known as Cucumis sativus var. Nigra, are an intriguing variety of cucumber characterized by their unique dark green to black skin. They are not genetically modified but instead occur naturally due to a higher concentration of chlorophyll.Origin and CultivationBlack cucumbers are native to Europe and Asia, particularly Russia and China. They are grown in warm, moist climates and require well-drained soil rich in organic matter. The fruits typically mature within 60-70 days after planting.Physical CharacteristicsSkin: Deep green to black, with occasional lighter... (The response was truncated because it has reached the token limit. Try to increase the token limit if you need a longer response.)At least to my knowledge, black cucumbers don’t exist. But the LLM has produced a response based on a description of actual cucumbers that looks plausible and real. In the summer of 2023, a US lawyer used an LLM to answer legal questions. The LLM cited non-existing cases, but they looked so trustworthy that the lawyer used them in court and got intoproblems [1].Sometimes hallucinations can be overcome by prompt engineering. Let’s look at the followingprompt template:from langchain.chains import LLMChainfrom langchain.prompts.prompt import PromptTemplatefrom langchain_google_vertexai import VertexAIllm = VertexAI(model_name="gemini-1.0-pro", temperature=0.8, max_output_tokens=128)template = """Describe {plant}.First, think whether {plant} exist.If they {plant} don't exist, answer "I don't have enough information about {plant}".Otherwise, give their title, a short summary and then talk about origin and cultivation.After that, describe their physical characteristics."""prompt_template = PromptTemplate( input_variables=["plant"], template=template,)chain = LLMChain(llm=llm, prompt=prompt_template)chain.run(plant="black cucumbers")If we run this chain, we’ll get arelevant answer:I don't have enough information about black cucumbers.You can double-check and ask the question about green cucumbers to make sure that the LLM will give a correct answer withthis prompt.HallucinationsHallucinations are one of the key problems that the industry is facing atthe moment.The good news: There are ways to significantly reduce hallucination rates, and we’re going to discuss them in this and thenext chapters.The bad news: Anygenerative AI(GenAI) might produce hallucinations, and you need to evaluate and monitor them during application development. We’ll talk about evaluation inChapter 14.Generative AI on Google Cloud with LangChainwas published in December 2024. Packt library subscribers can continue reading the entire book for free.Get the eBook for $35.99 $24.99Get the Print Book for $44.99And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 32067

Divya Anne Selvaraj
28 Aug 2024
14 min read
Save for later

PythonPro #44: Generative AI with PyTorch, uv Update, Choosing the Best Visualization Type, and FastAPI for Rapid Development

Divya Anne Selvaraj
28 Aug 2024
14 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#44:Generative AI with PyTorch, uv Update, Choosing the Best Visualization Type, and FastAPI for Rapid DevelopmentHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Generative AI Foundations in Python, which provides a hands-on guide to implementing generative AI models—GANs, diffusion models, and transformers—using PyTorch and the diffusers library.News Highlights:Theuv Python packaging tool now offers comprehensive project management, tool installation, and support for single-file scripts; and Tach, written in Rust, enforces strict interfaces and dependency management for PythonHere are my top 5 picks from our learning resources today:Visualisation 101 - Choosing the Best Visualisation Type📊Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTube Videos🎥🤖FastAPI Tutorial - Build APIs with Python in Minutes🚀Flatten JSON data with different methods using Python📦Linear Algebra Concepts Every Data Scientist Should Know📐And, in today’sFeatured Study, we introduce PyRoboCOP, a Python-based package designed for optimizing robotic control and collision avoidance in complex environments.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:We have covered all requests made so far this month, in this issue.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️Newsuv: Unified Python packaging:The tool now offers end-to-end project management, tool installation, Python bootstrapping, and support for single-file scripts with embedded dependencies, all within a unified, fast, and reliable interface.Tach - Strict interfaces and dep management forPython, written in Rust:Inspired by modular monolithic architecture, Tach allows you to define dependencies and ensures that modules only import from authorized dependencies.💼Case Studies and Experiments🔬Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTubeVideos:Details downloading video segments, capturing screenshots, extracting code from images using GPT, and then explaining the code with an LLM.Packaging Python and PyTorch for a Machine Learning Application:Discusses the challenges of packaging Python and PyTorch for the Transformer Lab application, aiming for a seamless user experience across various operating systems.📊Analysis🎥Charlie Marsh on Astral, uv, and the Python packaging ecosystem:Discusses insights on the development of Astral's uv tool, a cargo-like tool for Python, following a significant upgrade.CPython Compiler Hardening:Outlines the author’s process of selecting and testing compiler options, addressing challenges like excessive warnings, performance impacts, and developing tools to track and manage these warnings🎓Tutorials and Guides🤓Flatten JSON data with different methods using Python:Techniques discussed include usingpandas'json_normalize, recursive functions, theflatten_jsonlibrary, custom functions, and tools like PySpark and SQL.FastAPI Tutorial - Build APIs with Python in Minutes:Guides you through setting up a development environment, creating a FastAPI app, building a logistic regression classifier, defining data models with Pydantic, and setting up API endpoints for predictions.What's the deal with setuptools, setup.py, pyproject.toml, and wheels?:Provides a detailed explanation of Python packaging tools and practices, offering insights and recommendations for how to approach packaging in modern projects.Python's Preprocessor:Debunks the myth that Python lacks a preprocessor by demonstrating how Python can be extended and customized through the use of custom codecs and path configuration files.📖Open Access Book |Kalman and Bayesian Filters in Python:Addresses the need for a practical introduction to Kalman filtering, offering accessible explanations and examples, along with exercises with answers and supporting libraries.Python Backend Development - A Complete Guide for Beginners:Provides a step-by-step guide to building web applications, including advanced topics like asynchronous programming, performance optimization, and real-time data handling.Working with Excel Spreadsheets in Python:Focuses on automating tasks using theopenpyxlmodule.Read to learn about reading, writing, modifying, and formatting Excel files, and advanced features like plotting charts and integrating images.🔑Best Practices and Advice🔏Visualisation 101 - Choosing the Best Visualisation Type:Explores how visualizations improve data-driven decisions, focusing on understanding context, audience, and visual perception.Readto learn how to implement visualizations.Simone's Creative Cooking Club • If You Haven't Got a Clue What "Pass by Value" or "Pass by Reference" Mean, Read On…:Demonstrates how Python handles function arguments, particularly mutable and immutable objects.How I ask GPT-4 to make tiny Python scripts in practice:Succinctly describes starting with a basic script, then converting it into a command-line interface using click, and adding features like stdin/stdout handling and error logging.Linear Algebra Concepts Every Data Scientist Should Know:Introduces key concepts such as vectors, vector operations, vector spaces, and matrices, with visual explanations and code examples to demonstrate their application in real-world data science tasks.🎥Python From a Java Developer's Perspective:Provides guidance for Java developers to write Python code effectively.Watch to learn how to smoothly transition between Java and Python while leveraging your existing Java knowledge.🔍Featured Study: Mastering Robotic Control with PyRoboCOP for Complex Tasks💥In “PyRoboCOP: Python-based Robotic Control & Optimization Package for Manipulation and Collision Avoidance” Raghunathan et al. introduce a Python-based software package designed for the optimisation and control of robotic systems. The package excels in handling complex interactions like contact and collision avoidance, crucial for autonomous robotic manipulation.ContextRobotic systems often operate in environments with numerous obstacles and objects, making it essential to model and optimise these interactions mathematically. These interactions, defined by complementarity constraints, are challenging to manage because they do not follow standard optimisation assumptions. Most existing physics engines simulate these interactions but do not offer real-time optimisation capabilities.PyRoboCOPaddresses this gap by providing a flexible and user-friendly package that allows robots to reason about their environment and optimise their behaviour, which is critical for achieving autonomous manipulation tasks.Key Features of PyRoboCOPPyRoboCOP is characterised by its ability to automatically reformulate complex mathematical constraints and integrate seamlessly with powerful optimisation tools. Key features include:Automatic Reformulation of Complementarity Constraints:Handles difficult constraints that describe object interactions.Direct Transcription via Orthogonal Collocation:Converts DAEs into a solvable set of nonlinear equations.Integration with ADOL-C and IPOPT:Supports automatic differentiation and efficient optimisation.Built-in Support for Contact and Obstacle Avoidance Constraints:Simplifies the setup of complex robotic tasks.Flexible User Interface:Allows for customisation and adaptation to various robotic systems.What This Means for YouThe package is particularly relevant for researchers, developers, and engineers working in the field of robotics, especially those involved in designing autonomous systems that require precise control and optimisation. PyRoboCOP’s ability to handle complex robotic interactions makes it a valuable tool for developing real-time, model-based control solutions in environments where contact and collision avoidance are critical.Examining the DetailsPyRoboCOP's performance was rigorously tested across several robotic scenarios, including planar pushing, car parking, and belt drive unit assembly. In a planar pushing task, PyRoboCOP optimised the robot's trajectory, balancing a normal force of 0.5 N and a friction coefficient of 0.3, successfully navigating from (0,0,0)(0,0,0)(0,0,0) to (0.5,0.5,0)(0.5,0.5,0)(0.5,0.5,0) and (−0.1,−0.1,3π/2)(−0.1,−0.1,3π/2)(−0.1,−0.1,3π/2). In a car parking scenario, the software optimised movement from (1,4,0,0)(1,4,0,0)(1,4,0,0) to (2,2.5,π/2,0)(2,2.5,π/2,0)(2,2.5,π/2,0), effectively avoiding obstacles. PyRoboCOP also managed the complex task of assembling a belt drive unit, demonstrating its ability to handle intricate manipulations. When benchmarked againstCasADiandPyomo, PyRoboCOP showed comparable performance, solving an acrobot system in a mean time of 2.282 seconds with 1,296 variables, versus CasADi's 1.175 seconds with 900 variables and Pyomo's 2.374 seconds with 909 variables.You can learn more by reading the entirepaperor access the packagehere.🧠 Expert insight 📚Tasks💥Here’s an excerpt from “Chapter 2: Surveying GenAI Types and Modes: An Overview of GANs, Diffusers, and Transformers” in the book,Generative AI Foundations in PythonbyCarlos Rodriguez, published in July 2024.Applying GAI models – image generation using GANs, diffusers, and transformersIn this hands-on section…You’ll get a first-hand experience and deep dive into theactual implementation of generative models, specifically GANs, diffusion models, and transformers….I'm a new paragraph block.We’ll be utilizing the highly versatilePyTorchlibrary, a popular choice among machine learning practitioners, to facilitate our operations.PyTorchprovides a powerful and dynamic toolset to define and compute gradients, which is central to trainingthese models.In addition, we’ll also use thediffuserslibrary. It’s a specialized library that provides functionality to implement diffusion models. This library enables us to reproduce state-of-the-art diffusion models directly from our workspace. It underpins the creation, training, and usage of denoising diffusion probabilistic models at an unprecedented level of simplicity, without compromising themodels’ complexity.Through this practical session, we’ll explore how to operate and integrate these libraries and implement and manipulate GANs, diffusers, and transformers using the Python programming language. This hands-on experience will complement the theoretical knowledge we have gained in the chapter, enabling us to see these models in action in thereal world….Working with Jupyter Notebook and Google ColabJupyter notebooks enable live code execution, visualization, and explanatory text, suitable for prototyping and data analysis. Google Colab, conversely, is a cloud-based version of Jupyter Notebook, designed for machine learning prototyping. It provides free GPU resources and integrates with Google Drive for file storage and sharing. We’ll leverage Colab as our prototyping environmentgoing forward.Stable diffusion transformerWe begin with a pre-trained stable diffusion model, a text-to-image latent diffusion model created by researchers and engineers from CompVis, Stability AI, and LAION (Patil et al., 2022). The diffusion process is used to draw samples from complex, high-dimensional distributions, and when it interacts with the text embeddings, it creates a powerful conditional imagesynthesis model.The term “stable” in this context refers to the fact that during training, a model maintains certain properties that stabilize the learning process. Stable diffusion models offer rich potential to create entirely new samples from a given data distribution, based ontext prompts.Again, for our practical example, we will Google Colab to alleviate a lot of initial setups. Colab also provides all of the computational resources needed to begin experimenting right away. We start by installing some libraries, and with three simple functions, we will build out a minimalStableDiffusionPipelineusing a well-established open-source implementation of the stablediffusion method.First, let’s navigate to our pre-configured Python environment, Google Colab, and install thediffusersopen-source library, which will provide most of the key underlying components we need forour experiment.In the first cell, we install all dependencies using the followingbashcommand. Note the exclamation point at the beginning of the line, which tells our environment to reach down to its underlying process and install the packageswe need:!pip install pytorch-fid torch diffusers clip transformers accelerateNext, we import the libraries we’ve just installed to make them available to ourPython program:from typing import Listimport torchimport matplotlib.pyplot as pltfrom diffusers import StableDiffusionPipeline, DDPMSchedulerNow, we’re ready for our three functions, which will execute the three tasks – loading the pre-trained model, generating the images based on prompting, and renderingthe images:def load_model(model_id: str) -> StableDiffusionPipeline:"""Load model with provided model_id."""return StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, revision="fp16", use_auth_token=False).to("cuda")def generate_images(pipe: StableDiffusionPipeline, prompts: List[str]) -> torch.Tensor:"""Generate images based on provided prompts."""with torch.autocast("cuda"):images = pipe(prompts).imagesreturn imagesdef render_images(images: torch.Tensor):"""Plot the generated images."""plt.figure(figsize=(10, 5))for i, img in enumerate(images):plt.subplot(1, 2, i + 1)plt.imshow(img)plt.axis("off")plt.show()In summary,load_modelloads a machine learning model identified bymodel_idonto a GPU for faster processing. Thegenerate_imagesfunction takes this model and a list of prompts to create our images. Within this function, you will notice torch.autocast("cuda"), which is a special command that allows PyTorch (our underlying machine learning library) to perform operations faster while maintaining accuracy. Lastly, the render_images function displays these images in a simple grid format, making use of the matplotlib visualization library to renderour output.With our functions defined, we select our model version, define our pipeline, and execute our imagegeneration process:# Executionmodel_id = "CompVis/stable-diffusion-v1-4"prompts = ["A hyper-realistic photo of a friendly lion","A stylized oil painting of a NYC Brownstone"]pipe = load_model(model_id)images = generate_images(pipe, prompts)render_images(images)The output inFigure 2.1is a vivid example of the imaginativeness and creativity we typically expect from human art, generated entirely by the diffusion process. Except, how do we measure whether the model was faithful to thetext provided?Figure 2.1: Output for the prompts “A hyper-realistic photo of a friendly lion” (left) and “A stylized oil painting of a NYC Brownstone” (right)The next step is to evaluate the quality and relevance of our generated images in relation to the prompts. This is where CLIP comes into play. CLIP is designed to measure the alignment between text and images by analyzing their semantic similarities, giving us a true quantitative measure of the fidelity of our synthetic images tothe prompts.Scoring with the CLIP modelCLIP is trained to understand the relationship between text and images by learning to place similar images and text near each other in a shared space. When evaluating a generated image, CLIP checks how closely the image aligns with the textual description provided. A higher score indicates a better match, meaning the image accurately represents the text. Conversely, a lower score suggests a deviation from the text, indicating a lesser quality or fidelity to the prompt, providing a quantitative measure of how well the generated image adheres to theintended description.Again, we will import thenecessary libraries:from typing import List, Tuplefrom PIL import Imageimport requestsfrom transformers import CLIPProcessor, CLIPModelimport torchWe begin by loading the CLIP model, processor, andnecessary parameters:# ConstantsCLIP_REPO = "openai/clip-vit-base-patch32"def load_model_and_processor(model_name: str) -> Tuple[CLIPModel, CLIPProcessor]:"""Loads the CLIP model and processor."""model = CLIPModel.from_pretrained(model_name)processor = CLIPProcessor.from_pretrained(model_name)return model, processorNext, we define a processing function to adjust the textual prompts and images, ensuring that they are in the correct format forCLIP inference:def process_inputs(processor: CLIPProcessor, prompts: List[str],images: List[Image.Image]) -> dict:"""Processes the inputs using the CLIP processor."""return processor(text=prompts, images=images,return_tensors="pt", padding=True)In this step, we initiate the evaluation process by inputting the images and textual prompts into the CLIP model. This is done in parallel across multiple devices to optimize performance. The model then computes similarity scores, known as logits, for each image-text pair. These scores indicate how well each image corresponds to the text prompts. To interpret these scores more intuitively, we convert them into probabilities, which indicate the likelihood that an image aligns with any of thegiven prompts:def get_probabilities(model: CLIPModel, inputs: dict) -> torch.Tensor:"""Computes the probabilities using the CLIP model."""outputs = model(**inputs)logits = outputs.logits_per_image# Define temperature - higher temperature will make the distribution more uniform.T = 10# Apply temperature to the logitstemp_adjusted_logits = logits / Tprobs = torch.nn.functional.softmax(temp_adjusted_logits, dim=1)return probsLastly, we display the images along with their scores, visually representing how well each image adheres to theprovided prompts:def display_images_with_scores(images: List[Image.Image], scores: torch.Tensor) -> None:"""Displays the images alongside their scores."""# Set print options for readabilitytorch.set_printoptions(precision=2, sci_mode=False)for i, image in enumerate(images):print(f"Image {i + 1}:")display(image)print(f"Scores: {scores[i, :]}")print()With everything detailed, let’s execute the pipelineas follows:# Load CLIP modelmodel, processor = load_model_and_processor(CLIP_REPO)# Process image and text inputs togetherinputs = process_inputs(processor, prompts, images)# Extract the probabilitiesprobs = get_probabilities(model, inputs)# Display each image with corresponding scoresdisplay_images_with_scores(images, probs)We now have scores for each of our synthetic images that quantify the fidelity of the synthetic image to the text provided, based on the CLIP model, which interprets both image and text data as one combined mathematical representation (or geometric space) and can measuretheir similarity.Figure 2.2: CLIP scoresFor our “friendly lion,” we computed scores of 83% and 17% for each prompt, which we can interpret as an 83% likelihood that the image aligns with thefirst prompt.Packt library subscribers cancontinue readingthe entire book for free. You can buyGenerative AI Foundations in Pythonby Carlos Rodriguez,here.Get the eBook for $31.99$21.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 23348

Divya Anne Selvaraj
24 Sep 2024
12 min read
Save for later

PythonPro #48: Python 3.13 JIT, Boosting Model Inference, and FastAPI Best Practices

Divya Anne Selvaraj
24 Sep 2024
12 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#48:Python 3.13 JIT, Boosting Model Inference, and FastAPI Best Practices3 Days. 25+ AI Experts. 30+ Sessions.Join the Generative AI In Action conference from Nov 11-13 (LIVE | Virtual) and gain insights from top AI leaders across over 30 sessions. Explore key topics including GenAI tools, AI Agents, Open-Source LLMs, Small Language Models, LLM fine-tuning, and many more! This is your opportunity to dive deep into cutting-edge AI strategies and technologies.Save 40% with our Early Bird offer using code BIGSAVE40 – don’t miss out!Secure Your Seat Today!Hi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Machine Learning and Generative AI for Marketing, which discusses how to create effective prompts for Zero-Shot Learning to generate high-quality marketing content.News Highlights: Opik, a new open-source LLM evaluation tool, integrates with CI/CD, and Model2Vec, a newly launched library, boosts CPU inference 500x and cuts model size by 15x.Here are my top 5 picks from our learning resources today:Frankenstein’s Ice cream shop🍦Python 3.13 Preview: Free Threading and a JIT Compiler⚙️Graph RAG into Production — Step-by-Step🧩FastAPI Best Practices and Design Patterns - Building Quality Python APIs🛠️From Spreadsheets to SDMX Effortless with Python and .Stat Suite📊And, today’s Featured Study, examines the performance of open-source models like Mistral and LLaMa and provides insights into the hardware needed for efficient deployment, using GPUs and optimisation techniques such as quantification.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:With this issue, we have finished covering all content requests made through the September feedback survey. Stay tuned for next month's survey.Sign Up|AdvertiseWhat changed in the way you code for 2024? What has happened in the tech world in the last months?Take this shorter version of the Developer Nation survey, learn about new tools, influence the future of development and share your insights with the world!What’s in it for you?A virtual goody bag packed with cool resourcesThe more questions you answer the more chances you have to win amazing prizes including aSamsung Galaxy Watch 7!Take the Survey now!🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsOpik, an open source LLM evaluation framework: The platform can be used for developing, evaluating, and monitoring LLM applications and offers features such as LLM call tracing, annotation, automated evaluation, and integration into CI/CD pipelines.Model2Vec: Distill a Small Fast Model from any Sentence Transformer: Model2Vec is a Python library that distills sentence transformers into small static embeddings, making inference 500x faster on CPU and reducing model size by 15x.💼Case Studies and Experiments🔬Integrated Python and GIS Approach for Geomorphometric Investigation of Man River Basin, Western Madhya Pradesh, India: Analyzes the tectonic influence on the Man River Basin's development using satellite imagery, GIS software, and Python to compute and study geomorphometric indices.Frankenstein’s Ice cream shop:Details how to automate the cleaning of messy Excel sheets using Python's Pandas library, focusing on a made-up ice cream sales commission dataset.📊AnalysisThe Python Package Index Should Get Rid Of Its Training Wheels: Discusses the challenges of PyPI's exponentially growing storage needs, particularly due to prebuilt binaries and suggests leveraging modern build tools.UV — I am (somewhat) sold: Initially skeptical, the author of this article found UV useful for handling multiple Python versions, dependency management, and simplifying their development setup.🎓Tutorials and Guides🤓Python 3.13 Preview: Free Threading and a JIT Compiler: Demonstrates the key new features in Python 3.13, including free threading, which makes the GIL optional, and a JIT compiler that compiles Python code into machine code.Graph RAG into Production — Step-by-Step: Discusses how to implement Graph Retrieval-Augmented Generation (Graph RAG) in production using a fully serverless, parallelized approach without using a graph database.Python Virtual Environments: A Primer: Covers how to create, activate, and manage virtual environments, explaining their importance for isolating dependencies, avoiding conflicts, and ensuring reproducibility.Python for Network Programming — A Beginner’s Overview: Explains key concepts such as sockets, TCP, and UDP protocols, and walks you through practical examples of building TCP and UDP client-server applications.Mastering ChatGPT’s Function Call API - The Smart Way and the… Not-So-Smart Way (in Python): Explains how to use ChatGPT's function call API for automating tasks in Python.Git With Python HowTo GitPython Tutorial And PyGit2 Tutorial: Covers installation, exception handling, and common tasks like cloning, committing, branching, tagging, and pushing changes.🎥Program a RAG LLM Chat App with LangChain + Streamlit + *o1, GTP-4o and Claude 3.5 API: Covers loading custom documents, integrating website content into LLM queries, and creating a web app that enables users to interact with GPT-4 and Claude models.🔑Best Practices and Advice🔏Counting Sheep with Contracts in Python: Discusses using code contracts to enhance software development by ensuring preconditions and postconditions are met, making the code safer and easier to maintain.FastAPI Best Practices and Design Patterns - Building Quality Python APIs: Discusses applying SOLID principles and design patterns like DAO and Service Layer to build clean, maintainable, and scalable APIs using FastAPI.Recently I read a few articles and have a few questions: Covers managing dependencies without tools like Poetry, and handling Python version installations, particularly when a preferred version lacks an official installer.Unlocking the Magic of Docstrings: Introduces the power of Python docstrings for documenting code, enhancing readability, and providing functionality like automatic documentation generation and testing.From Spreadsheets to SDMX Effortless with Python and .Stat Suite: Highlights the importance of SDMX adoption for efficient data sharing among institutions and presents a step-by-step case study using World Bank data.🔍Featured Study: Deploying Open-Source Large Language Models Efficiently💥The study "Deploying Open-Source Large Language Models: A Performance Analysis", conducted by Bendi-Ouis et al., compares the performance of open-source large language models. The study aims to assist organisations in evaluating the hardware requirements for efficiently deploying models like Mistral and LLaMa.ContextSince the release of ChatGPT in November 2023, there has been growing interest in deploying large language models. Many organisations and institutions are keen to harness LLMs, but the computational demands remain a challenge. While proprietary models require substantial resources, open-source models like Mistral and LLaMa provide alternatives that may be deployed with less hardware. This study explores how different hardware configurations and optimisation techniques, such as quantification, can make these models more accessible for public and private entities.Key FindingsThe study used two types of GPUs: NVIDIA V100 16GB and NVIDIA A100 40GB, with tests conducted on models like Mistral-7B, Codestral-22B, Mixtral-8x7B, Mixtral-8x22B, and LLaMa-3-70B.Mistral-7B generated 119 tokens in 1.9 seconds with one request, but 72.1 seconds with 128 requests on two V100 16GB GPUs.Codestral-22B produced 63 tokens in 2.3 seconds with one request but took 96.2 seconds with 128 requests on an A100 40GB GPU.Larger models like Mixtral-8x22B and LLaMa-3-70B faced slower generation times as context size and simultaneous requests increased.Quantifying models to 4 or 6 bits helped reduce the memory load while maintaining performance, with negligible loss in accuracy for models with up to 70 billion parameters.What This Means for YouFor organisations and developers seeking to deploy LLMs, this analysis provides valuable insights into the hardware requirements and optimisation techniques necessary for efficient deployment. With moderate hardware investments, open-source models can perform competitively, reducing dependency on proprietary systems and enabling better control over digital resources. This ensures digital sovereignty and cost-effective deployment of advanced AI technologies.Examining the DetailsThe researchers focused on GPU performance and model quantification to measure how efficiently LLMs could be deployed. Using vLLM, a Python library designed for inference optimisation, the study tested multiple models and configurations. For instance, Mistral-7B, when run on two V100 16GB GPUs, showed an increase in response time with higher numbers of simultaneous requests, highlighting the challenge of scaling for larger user bases.Quantification emerged as a key method to reduce computational load, allowing models to use less memory by lowering precision from 16 or 32 bits to 4 or 8 bits. This method was effective for larger models, maintaining performance without significant loss in accuracy.The study concluded that, although proprietary solutions like ChatGPT require significant resources, open-weight models like Mistral and LLaMa can deliver strong performance with commercially available GPUs. By deploying these models with vLLM and quantification techniques, organisations can achieve scalable, efficient AI deployment without excessive hardware costs.You can learn more by reading the entire paper here.🧠 Expert insight💥Here’s an excerpt from “Chapter 9: Creating Compelling Content with Zero-Shot Learning” in the book, Machine Learning and Generative AI for Marketing by Yoon Hyup Hwang and Nicholas C. Burtch, published in August 2024.Creating an effective promptBeing able to debug and troubleshoot code is an important skill to have. When you develop code, it seldom does what you need it to do the first time. You needCreating an effective prompt is the most crucial step in leveraging ZSL for marketing copy. In ZSL, the prompt effectively becomes the instruction manual for a model, telling it what kind of content to generate, as well as its style, tone, and substance.The following are some guidelines around how to formulate prompts that will elicit the best possible marketing copy content from the model:Clarity: Ensure that your prompt is specific about what you want, whether it’s a product description, headline, or call to action.Contextual: Provide sufficient background to guide a model. For eco-friendly products, mention key selling points like sustainability or biodegradability.Creative: While clarity is crucial, leaving room for creativity can yield surprising and innovative results. Phrases like “Imagine...” or “Create a story where...” can be particularly powerful.Concise: Lengthy prompts can dilute the focus. Aim for brevity while including essential details, ensuring that a model stays on topic.In the following sections, we will illustrate the impact of prompt quality through examples, with different types of marketing copy. While good prompts elicit detailed, relevant, and engaging content, poor prompts can lead to vague and uninspiring outputs. To generate these responses, we will define the following function:def generate_response(prompt, model="gpt-4", max_tokens=100, temperature=0.7, n=1): response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], max_tokens=max_tokens, n=n, temperature=temperature)return response.choices[0].message.contentThis function will be used with different prompt types in the examples that follow.Example 1: Product descriptionsIn this example, we will generate product descriptions for our e-commerce brand, which is launching new lines of eco-friendly kitchenware.The following is an example of a poor prompt:poor_product_description = "Talk about bamboo cutlery."generate_response(poor_product_description)This produces:Bamboo cutlery is a type of eating utensil made from bamboo, a fast-growing renewable resource. This form of cutlery includes items such as forks, knives, spoons, and chopsticks. They are an eco-friendly alternative to plastic disposable cutlery due to their reusable nature and biodegradable properties. Bamboo cutlery is lightweight and durable, making it a popular choice for camping, picnics, and travel.Now, let’s look at the following example of a good prompt:good_product_description = "Write a captivating description for a bamboo cutlery set designed for eco-conscious consumers, emphasizing its sustainability and style."generate_response(good_product_description)This prompt produces the following output:Discover the perfect harmony of sophistication and sustainability with our Bamboo Cutlery Set. Designed for the eco-conscious consumer, this set is not only a statement of your commitment to the environment but also a reflection of your exquisite taste. Each piece is crafted from 100% organic bamboo, a renewable resource that grows significantly faster than traditional hardwoods. This set boasts a sleek, minimalist design that perfectly complements any table setting, from casual picnic to elegant dinner parties.From a marketing perspective, this example demonstrates the significance of creating detailed and audience-specific prompts with clear requirements when using ZSL for product descriptions, as well as how this leads to more specificity in the generated response. However, it is worth noting that older consumers may value more straightforward, factual information and, therefore, may favor the more generic prompt’s response from an engagement standpoint. Tailoring GenAI outputs at the level of the individual consumer can be crucial as well and is a topic discussed inChapter 11.Packt library subscribers can continue reading the entire book for free. You can buy Machine Learning and Generative AI for Marketing,here.Get the eBook for $39.99 $27.98!Other Python titles from Packt at 30% offGet the eBook for $35.99 $24.99!Get the eBook for $25.99 $17.99!Get the eBook for $35.99 $24.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 22143

Divya Anne Selvaraj
29 Oct 2024
11 min read
Save for later

PythonPro #53: FastAPI on Docker, Python-CUDA Integration with Numbast, and Concurrent Requests with httpx vs aiohttp

Divya Anne Selvaraj
29 Oct 2024
11 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#53FastAPI on Docker, Python-CUDA Integration with Numbast, and Concurrent Requests with httpx vs aiohttpHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, FastAPI Cookbook, which explains how to deploy FastAPI apps using Docker, covering Dockerfile creation, image building, and container generation.News Highlights: Numbast simplifies Python-CUDA C++ integration by auto-generating Numba bindings for CUDA functions; and DJ Beat Drop enhances Django’s new developer onboarding with a streamlined project initializer.Concurrent Requests in Python: httpx vs aiohttp🚦Python Thread Safety: Using a Lock and Other Techniques🔐Time-Series Data Meets Blockchain: Storing Time-Series Data with Solidity, Ganache and Python⛓️Let's Eliminate General Bewilderment • Python's LEGB Rule, Scope, and Namespaces🧩Optimization of Iceberg Table In AWS Glue🧊And, today’s Featured Study, introduces LSS-SKAN, a Kolmogorov–Arnold Network (KAN) variant that uses a single-parameter function (Shifted Softplus) for efficient accuracy and speed.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:Thank you to those who participated in this month's survey. With this issue, we have tried to fulfill at least one request made by each participant. Keep an eye out for next month's survey.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsBridging the CUDA C++ Ecosystem and Python Developers with Numbast: Numbast streamlines the integration of CUDA C++ libraries with Python by automatically generating Numba bindings for CUDA functions.Improving the New Django Developer Experience: Introduces DJ Beat Drop as a streamlined project initializer to improve the onboarding experience for new Django developers.💼Case Studies and Experiments🔬Concurrent Requests in Python: httpx vs aiohttp: Describes how switching from the httpx to aiohttp library resolved high-concurrency issues and improved stability in a computer vision application.From Python to CPU instructions: Part 1: Explains how rewriting a Python program in C exposes low-level details Python abstracts away, particularly highlighting the manual effort required for tasks like input handling.📊AnalysisPython 3.13, what didn't make the headlines: highlights Python 3.13's understated but impactful improvements, focusing on debugging enhancements, filesystem fixes, and minor concurrency updates.When should you upgrade to Python 3.13?: Advises waiting until December 2024 for Python 3.13 upgrades to ensure compatibility with libraries, tools, and bug-fix improvements.🎓Tutorials and Guides🤓Python Thread Safety: Using a Lock and Other Techniques: Explains how to address issues like race conditions and introduces synchronization techniques such as semaphores to ensure safe, concurrent code execution.Time-Series Data Meets Blockchain: Storing Time-Series Data with Solidity, Ganache and Python: Walks you through the steps to set up Ethereum locally, deploy a smart contract, and store and retrieve data points.Beautiful Soup: Build a Web Scraper With Python: Covers how to inspect site structure, scrape HTML content, and parse data using Requests and Beautiful Soup to build a script that extracts and displays job listings.🎥Advanced Web Scraping Tutorial! (w/ Python Beautiful Soup Library): Covers Requests to retrieve and parse data, especially from dynamic pages like Walmart's, with enhancements like using modified headers.Fuzzy regex matching in Python: Introduces the orc library to simplify fuzzy matching by providing a human-friendly interface that highlights edits and can invert changes, enhancing usability for complex text correction tasks.Achieving Symmetrical ManyToMany Filtering in Django Admin: Covers using Django's RelatedFieldWidgetWrapper and a custom ModelForm , allowing for consistent filtering on both sides of a ManyToMany relationship.Get started with the free-threaded build of Python 3.13: Details installation, usage in Python programs, compatibility with C extensions, and how to detect GIL status programmatically.🔑Best Practices and Advice🔏Let's Eliminate General Bewilderment • Python's LEGB Rule, Scope, and Namespaces: Details how variables are resolved in local, enclosing, global, and built-in scopes, using accessible examples to clarify potential pitfalls.🎥Robust LLM pipelines (Mathematica, Python, Raku): Given the unreliable and often slow nature of LLMs, this presentation outlines methods to enhance pipeline efficiency, robustness, and usability.A new way of Python Debugging with the Frame Evaluation API: Introduces Python's Frame Evaluation API, a tool that allows real-time monitoring and control of program execution at the frame level.Buffers on the edge: Python and Rust: Explains how Python's buffer protocol, which enables memory sharing between objects, can lead to undefined behavior due to data races in C, and the challenges Rust faces in maintaining soundness.Optimization of Iceberg Table In AWS Glue: Discusses how AWS Glue offers built-in optimization, but a Python-based solution using boto3 and Athena SQL scripts provides customizable, cost-effective automation.🔍Featured Study: LSS-SKAN💥In "LSS-SKAN: Efficient Kolmogorov–Arnold Networks based on Single-Parameterized Function," Chen and Zhang from South China University of Technology present a refined Kolmogorov–Arnold Network (KAN) variant. Their study introduces an innovative design principle for neural networks, improving accuracy and computational speed while ensuring greater model interpretability.ContextKANs are neural networks based on the Kolmogorov-Arnold theorem, which breaks down complex, multivariate functions into simpler univariate ones, aiding in better visualisation and interpretability. This makes them valuable in critical decision-making applications, where understanding a model's decision process is crucial. Unlike typical neural networks like Multilayer Perceptrons (MLPs), which rely on opaque linear and activation functions, KANs assign functions to network edges, creating a more interpretable structure. Over time, several KAN variants, such as FourierKAN and FastKAN, have emerged, each with unique basis functions to balance speed and accuracy.LSS-SKAN builds on these advancements with the Efficient KAN Expansion (EKE) Principle, a new approach that scales networks using fewer complex basis functions, allocating parameters to the network's size instead. This principle is central to LSS-SKAN's efficiency and demonstrates how a simpler basis function can yield high accuracy with reduced computational cost.Key Features of LSS-SKANEKE Principle: Scales the network by prioritising size over basis function complexity, making LSS-SKAN faster and more efficient.Single-Parameter Basis Function: Utilises the Shifted Softplus function, requiring only one learnable parameter for each function, which simplifies the network and reduces training time.Superior Accuracy: Outperforms KAN variants, showing a 1.65% improvement over Spl-KAN, 2.57% over FastKAN, 0.58% over FourierKAN, and 0.22% over WavKAN on the MNIST dataset.Reduced Training Time: Achieves significant reductions in training time, running 502.89% faster than MLP+rKAN and 41.78% faster than MLP+fKAN.What This Means for YouFor those working in machine learning or fields requiring interpretable AI, LSS-SKAN offers a practical solution to enhance neural network accuracy and speed while maintaining transparency in model decision-making. LSS-SKAN is particularly beneficial in applications involving image classification, scientific computing, or scenarios demanding high interpretability, such as medical or financial sectors where model explainability is crucial.Examining the DetailsThe researchers conducted detailed experiments using the MNIST dataset to measure LSS-SKAN’s performance against other KAN variants. They tested both short-term (10-epoch) and long-term (30-epoch) training cycles, focusing on two key metrics: accuracy and execution speed.Through these tests, LSS-SKAN consistently outperformed other KAN models in accuracy, achieving a 1.65% improvement over Spl-KAN, 2.57% over FastKAN, and 0.58% over FourierKAN, while also running 502.89% faster than MLP+rKAN and 41.78% faster than MLP+fKAN.The LSS-SKAN Python library is available on GitHub, along with experimental code, so you can replicate and build on their findings. They recommend a learning rate between 0.0001 and 0.001 for best results, particularly due to KANs’ sensitivity to learning rate adjustments.You can learn more by reading the entire paper and accessing LSS-SKAN.🧠 Expert insight💥Here’s an excerpt from “Chapter 12: Deploying and Managing FastAPI Applications” in the book, FastAPI Cookbook by Giunio De Luca, published in August 2024.Running FastAPI applications in Docker containersDockeris a useful tool that lets developers wrap applications with their dependencies into a container. This method makes sure that the application operates reliably in different environments, avoiding the commonworks on my machine issue. In this recipe, we will see how to make aDockerfile and run a FastAPI application inside a Docker container. By the end of this guide, you will know how to put your FastAPI application into a container, making it more flexible and simpler to deploy.Getting readyYou will benefit from some knowledge of container technology, especially Docker, to follow the recipe better. But first, check thatDocker Engineis set up properly on your machine. You can see how to do it at thislink:https://docs.docker.com/engine/install/.If you use Windows, it is better to installDocker Desktop, which is a Docker virtual machine distribution with a built-ingraphical interface.Whether you have Docker Engine or Docker Desktop, make sure the daemon is running by typingthis command:$ docker imagesIf you don’t see any error about the daemon, that means that Docker is installed and working on the machine. The way to start the Docker daemon depends on the installation you choose. Look at the related documentation to see how todo it.You can use the recipe for your applications or follow along with theLive Applicationapplication that we introduced in the first recipe, which we are using throughoutthe chapter.How to do it…It is not very complicated to run a simple FastAPI application in a Docker container. The process consists ofthree steps:Createthe Dockerfile.Buildthe image.Generatethe container.Then, you just have to run the container to have theapplication working.Creating the DockerfileThe Dockerfile contains the instructions needed to build the image from an operating system and the file we wantto specify.It is good practice to create a separate Dockerfile for the development environment. We will name itDockerfile.devand place it under the projectroot folder.We start the file by specifying the base image, which will beas follows:FROM python:3.10This will pull an image from the Docker Hub, which already comes with Python 3.10 integrated. Then, we create a folder called/codethat will hostour code:WORKDIR /codeNext, we copyrequirements.txtinto the image and install the packages insidethe image:COPY ./requirements.txt /code/requirements.txtRUN pip install --no-cache-dir -r /code/requirements.txtThepip installcommand runs with the--no-cache-dirparameter to avoidpipcaching operations that wouldn’t be beneficial inside a container. Also, in a production environment, for larger applications, it is recommended to pin fixed versions of the packages inrequirements.txtto avoid potential compatibility issues due topackage upgrades.Then, we can copy theappfolder containing the application into the image with thefollowing command:COPY ./app /code/appFinally, we define the server startup instructionas follows:CMD ["fastapi", "run", "app/main.py", "--port", "80"]This is all we need to create ourDockerfile.devfile.Building the imageOnce we haveDockerfile.dev, we can build the image. We can do it by running the following from the command line at the project rootfolder level:$ docker build -f Dockerfile.dev -t live-application .Since we named our DockerfileDockerfile.dev, we should specify it in an argument. Once the build is finished, you can check that the image has been correctly built by runningthe following:$ docker images live-applicationYou should see the details of the image on the output printlike this:REPOSITORY TAG IMAGE ID CREATED SIZElive-application latest 7ada80a535c2 43 seconds ago 1.06GBWith the image built, we can proceed with creating thecontainer creation.Creating the containerTo create the container and run it; simply runthe following:$ docker run -p 8000:80 live-applicationThis will create the container and run it. We can see the container by runningthe following:$ docker ps -aSince we didn’t specify a container name, it will automatically affect a fancy name. Mine, for example,isbold_robinson.Open the browser onhttp://localhost:8000and you will see the home page response ofour application.This is all you need to run a FastAPI application inside a Docker container. Running a FastAPI application in a Docker container is a great way to use the advantages of both technologies. You can easily scale, update, and deploy your web app withminimal configuration.See alsoThe Dockerfile can be used to specify several features of the image. Check the list of commands in the official documentation:Dockerfilereference:https://docs.docker.com/reference/dockerfile/Docker CLI documentation:https://docs.docker.com/reference/cli/docker/FastAPI in Containers - Docker:https://fastapi.tiangolo.com/deployment/docker/FastAPI Cookbook was published in August 2024.Get the eBook for $35.99 $24.99!Get the Print Book for $44.99 $30.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 20920

Divya Anne Selvaraj
22 Oct 2024
11 min read
Save for later

PythonPro #52: AI-Powered Vulnhuntr for Python, SageMaker Core SDK, and Exploring User Behaviour with Python

Divya Anne Selvaraj
22 Oct 2024
11 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#52AI-Powered Vulnhuntr for Python, SageMaker Core SDK, and Exploring User Behaviour with PythonHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Building AI Applications with OpenAI APIs - Second Edition, which discusses how to create a language translation desktop app using OpenAI's ChatGPT API and Microsoft Word.News Highlights: Protect AI to release Vulnhuntr, an AI tool for detecting Python zero-day vulnerabilities; Amazon launches SageMaker Core, a Python SDK simplifying machine learning with object-oriented interfaces; and PyCharm becomes the official IDE of OpenCV as JetBrains joins as a Silver Member.Comprehensive Python Cheatsheet📚Exploring User Behavior: A Python Case Study of Bike-Sharing Company Dataset🚴‍♂️Python's property(): Add Managed Attributes to Your Classes🔧Python approach to the Semantic Web: exploring linked data and RDF🌐Assert vs. Raise: When to Use Each in Your ML/AI Projects⚠️And, today’s Featured Study, presents ChangeGuard, a tool designed to compare code behaviour before and after changes to detect functionality modifications.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month's survey is still live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsOpen source LLM tool primed to sniff out Python zero-days: Researchers with Seattle-based Protect AI will soon release Vulnhuntr, an AI-powered open-source tool that uses Claude AI to detect zero-day vulnerabilities in Python codebases by analyzing entire call chains for security issues.Introducing SageMaker Core: A new object-oriented Python SDK for Amazon SageMaker: The SDK will simplify the machine learning lifecycle by replacing complex JSON structures with object-oriented interfaces.Press Release: PyCharm Becomes Official IDE of OpenCV, JetBrains Joins as Silver Member: As a Silver Member, JetBrains will financially support OpenCV, ensuring its resources remain free.💼Case Studies and Experiments🔬Part 2: Data Quality Dashboard: A Visual Approach to Monitoring Expectations in Databricks: Explains how to quickly identify issues using graphical representations like pie charts and bar charts.Exploring User Behavior: A Python Case Study of Bike-Sharing Company Dataset: UsesPython to uncover user behaviour patterns and develop strategies to convert casual riders into annual members.📊Analysis🎥Russell Keith-Magee on Beeware, packaging, GUI & money in Python: Focuses on the challenges of cross-platform Python packaging, particularly for desktop and mobile platforms and discusses how BeeWare helps developers.Should you use uv’s managed Python in production?: Advises careful consideration of uv’s production readiness, noting recent improvements but recommending thorough evaluation based on project-specific risks.🎓Tutorials and Guides🤓Python's property(): Add Managed Attributes to Your Classes: Covers creating read-only, read-write, and computed properties, logging, and more, while maintaining a stable public API for your classes.A Multi-Agent AI Chatbot App using Databutton and Swarm: Explains how different agents can collaborate and hand off tasks, with an example of a multi-agent healthcare chatbot that connects users to specialized agents.Understanding Pluggable Authentication Module (PAM) and Creating a Custom One in Python: Covers PAM’s architecture, module stacks, and control flags and walks you through building and integrating a custom PAM.Python approach to the Semantic Web: exploring linked data and RDF: Covers creating RDF triples, querying SPARQL endpoints, and visualizing relationships using NetworkX.Understanding Web Scraping in Python and Scrapy: Explains what web scraping is, its significance, and the tools required, such as BeautifulSoup, Requests, and Scrapy.🎥A hand-holding guide to writing FUSE-based filesystems in Python: Covers the process of creating Python-based FUSE file systems, from basic functionality to more advanced features like file attributes.Adding syntax to the cpython interpreter: Demonstrates how to add new syntax to Python, specifically making ternary statements default to None when no else condition is provided, similar to Ruby.🔑Best Practices and Advice🔏What I Learned from Making the Python Backend for YouTube Transcript Optimizer: Explains the process of building the Python backend for a YouTube Transcript Optimizer using FastAPI and SQLmodel.Comprehensive Python Cheatsheet: An extensive resource covering a wide array of Python topics, including syntax, data structures, and advanced concepts.How to Use Lambda Functions in Python: Covers their syntax, common use cases with functions like map(), filter(), and sorted(), along with advantages, limitations, and best practices for effective use in simplifying code.Assert vs. Raise: When to Use Each in Your ML/AI Projects: Discusses when to use assert for internal checks during development and raise for handling user-facing errors in ML/AI projects to ensure robust error handling.Structural Pattern Matching in Python: Explores customizing pattern matching for classes, extracting nested data, and common limitations in Python’s implementation.🔍Featured Study: ChangeGuard - Validating Code Changes via Pairwise Learning-Guided Execution💥In "ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution," Gröninger et al. present a tool called ChangeGuard, which compares code behaviour before and after changes to determine whether the modifications alter functionality.ContextValidating whether code changes preserve intended behaviour is a key challenge in software development, particularly when changes are deep within complex projects. Developers may make modifications to improve readability, performance, or to fix bugs, but unintended changes in functionality can lead to errors. Current methods, such as regression testing, often fail to catch these subtle changes. This study is relevant because it introduces a more reliable approach—ChangeGuard, which uses pairwise learning-guided execution. This approach involves running two versions of a code snippet simultaneously and predicting values to ensure the code runs correctly, even in complex scenarios.Key Featured of ChangeGuardPairwise learning-guided execution: Simultaneously executes old and new versions of code to compare their runtime behaviour.Value injection: Predicts and injects missing or uninitialised values, ensuring the code executes smoothly and reaches all relevant paths.High precision and recall: Achieves 77.1% precision and 69.5% recall in identifying behaviour-altering code changes.Extensive evaluation: Tested on 224 manually annotated code changes and datasets generated by automated refactoring tools.Outperforms regression tests: Traditional regression tests only achieved 7.6% recall in identifying semantics-changing code modifications.What This Means for YouThis paper will be most useful for software developers, especially those working with large and complex codebases. It provides practical insights into validating code changes more effectively than existing methods, offering a way to catch unintended behaviour early in the development process. Developers using automated refactoring tools or large language models like GPT-4 will particularly benefit from ChangeGuard's ability to detect subtle, behaviour-altering modifications.Examining the DetailsChangeGuard's methodology is based on pairwise learning-guided execution, an extension of an existing technique. It predicts missing values dynamically, ensuring more execution paths are covered than previous approaches. The tool was evaluated on 224 annotated code changes from popular Python open-source projects, showing high accuracy in detecting semantics changes. Additionally, ChangeGuard was applied to automated refactoring tools and large language models like GPT-3.5 and GPT-4, where it found 87 out of 187 and 143 out of 258 code changes to unexpectedly alter behaviour. This comprehensive testing provides strong evidence for ChangeGuard's reliability and robustness.You can learn more by reading the entire paper and accessing ChangeGuard.🧠 Expert insight💥Here’s an excerpt from “Chapter 6: Language Translation Desktop App with the ChatGPT API and Microsoft Word” in the book, Building AI Applications with OpenAI APIs - Second Edition by Martin Yanev, published in October 2024.Integrating the ChatGPT API with Microsoft OfficeIn this section, we will explore how to set up our project and install thedocxPython library to extract text fromWorddocuments. Thedocx library is a Python package that allows us to read and writeMicrosoft Word (.docx) files and provides a convenient interface to access information stored inthese files.The first step is to initiate your work by creating a new directory calledTranslation Appand loading it with VSCode. This will enable you to have a dedicated area to craft and systematize your translation app code. Activate your virtual environment from the terminal window following the steps outlined inChapter 1,Getting Started with the ChatGPT API forNLP Tasks.To run the language translation desktop app, you will need to install thefollowing libraries:openai: Theopenailibrary allows you to interact with the OpenAI API and perform variousNLP tasksdocx: Thedocxlibrary allows you to read and write Microsoft Word.docxfilesusing Pythontkinter: Thetkinterlibrary is a built-in Python library that allows you to createGraphical User Interfaces(GUIs) for yourdesktop appAstkinteris a built-in library, there is no need for installation since it already exists within your Python environment. To install theopenaianddocxlibraries, access the VSCode terminal, and then execute thefollowing commands:pip install openaipip install python-docxTo access and read the contents of a Word document, you will need to create a sample Word file inside your project. Here are the steps to create a newWord file:In your project, right-click on the project directory, selectNew Folder, and nameitfiles.Right-click on thefilesfolder and selectNew File.In the edit field that appears, enter a filename with the.docxextension – forexample,info.docx.Press theEnterkey to createthe file.Once the file is created, open it usingMicrosoft Word.You can now add some text or content to this file, which we will later access and read using thedocxlibrary in Python. For this example, we have created an article about New York City. You can find the complete article here:https://en.wikipedia.org/wiki/New_York_City. However, you can choose any Word document containing text that you wantto analyze:The United States’ most populous city, often referred to as New York City or NYC, is New York. In 2020, its population reached 8,804,190 people across 300.46 square miles, making it the most densely populated major city in the country and over two times more populous than the nation’s second-largest city, Los Angeles. The city’s population also exceeds that of 38 individual U.S. states. Situated at the southern end of New York State, New York City serves as the Northeast megalopolis and New York metropolitan area’s geographic and demographic center - the largest metropolitan area in the country by both urban area and population. Over 58 million people also live within 250 miles of the city. A significant influencer on commerce, health care and life sciences, research, technology, education, politics, tourism, dining, art, fashion, and sports, New York City is a global cultural, financial, entertainment, and media hub. It houses the headquarters of the United Nations, making it a significant center for international diplomacy, and is often referred to as theworld’s capital.Now that you have created the Word file inside your project, you can move on to the next step, which is to create a new Python file calledapp.pyinside theTranslation Approot directory. This file will contain the code to read and manipulate the contents of the Word file using thedocxlibrary. With the Word file and the Python file in place, you are ready to start writing the code to extract data from the document and use it inyour application.To test whether we can read Word files with thedocx-pythonlibrary, we can implement the following code in ourapp.pyfile:import docxdoc = docx.Document("<full_path_to_docx_file>")text = ""for para in doc.paragraphs: text += para.textprint(text)Make sure to replace<full_path_to_docx_file>with the actual path to your Word document file. Obtaining the file path is a simple task, achieved by right-clicking on your.docxfile in VSCode and selecting theCopy Relative Pathoption from thedrop-down menu.Once you have done that, run theapp.pyfile and verify the output. This code will read the contents of your Word document and print them to the console. If the text extraction works correctly, you should see the text of your document printed in the console (seeFigure 6.1). Thetextvariable now holds the data frominfo.docxas aPython string.Figure 6.1 – Word text extraction console outputPackt library subscribers can continue reading the entire book for free. You can buy Building AI Applications with OpenAI APIs - Second Edition,here.Get the eBook for $31.99 $21.99!Get the Print Book for $39.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 19345

Divya Anne Selvaraj
01 Oct 2024
10 min read
Save for later

PythonPro #49: Cool Python 3.13 Features, Azure LLM Deployment, and Great Expectations vs Pandas profiling

Divya Anne Selvaraj
01 Oct 2024
10 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#49:Cool Python 3.13 Features, Azure LLM Deployment, and Great Expectations vs Pandas profilingHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Python Data Cleaning and Preparation Best Practices, which compares Pandas profiling and Great Expectations for data profiling and analysis.News Highlights: DJP a Pluggy-based plugin system for Django launches, easing integration; and PondRAT malware, hidden in Python packages, targets developers in a supply chain attack.Here are my top 5 picks from our learning resources today:Python 3.13: Cool New Features for You to Try✨Deploy Python LLM Apps on Azure Web App (GPT-4o Azure OpenAI and SSO auth)🤖Data Visualization with Matplotlib and Seaborn - A Comprehensive Guide to Plot Types🎨The Anna Karenina Principle in Code Quality - Addressing PySpark Challenges with PyASTrX🔥Refactoring Python with 🌳 Tree-sitter & Jedi🧙‍♂️And, today’s Featured Study, introduces sbijax, a Python package built on JAX for efficient neural simulation-based inference (SBI), offering a wide range of algorithms, a user-friendly interface, and tools for efficient and scalable Bayesian analysis.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month's survey is now live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsDJP - A plugin system for Django: This new system based on Pluggy, simplifies plugin integration by automating configuration. Read to learn how to set up DJP, create plugins, and view examples like django-plugin-blog.New PondRAT Malware Hidden in Python Packages Targets Software Developers: North Korean-linked threat actors are using poisoned Python packages to gain access to supply chains via developers' systems.💼Case Studies and Experiments🔬Python for Inversive and Hyperbolic Geometry: Introduces a Python library which provides classes and utilities for visualizing inversive and hyperbolic geometry using the Poincaré disc model.Detecting Marathon Cheaters - Using Python to Find Race Anomalies: Covers scraping race data, using speed thresholds and z-scores to filter participants with "superhuman" splits, and analyzing these splits for suspicious activity.📊AnalysisPython 3.13: Cool New Features for You to Try: Releasing today, Python 3.13, introduces several improvements, including an enhanced REPL, clearer error messages, and progress on removing the GIL).Understanding Inconsistencies in IP Address Classification Across Programming Languages: Discusses how these inconsistencies can cause security vulnerabilities, particularly in cloud environments prone to SSRF.🎓Tutorials and Guides🤓🎥Deploy Python LLM Apps on Azure Web App (GPT-4o Azure OpenAI and SSO auth): Explains how to deploy a Streamlit web application into Azure Cloud using Azure App Service Plan and Azure Web App.How Data Platforms Work: Uses Python with Apache Arrow to demonstrate data models, builds an example data system through query plans, and provides code examples for creating, filtering, and projecting datasets.Data Visualization with Matplotlib and Seaborn - A Comprehensive Guide to Plot Types: Covers line plots, bar plots, scatter plots, histograms, box plots, heatmaps, and pair plots, each illustrated with examples.Instrumenting CPython with DTrace and SystemTap: Covers enabling embedded markers (or probes) in CPython for tracing function calls, garbage collection, and module imports and provides examples and scripts.Forecasting in Excel using Techtonique's Machine Learning APIs under the hood: discusses how to use Techtonique's machine learning APIs through Excel for tasks like forecasting, data visualization, and predictive analytics.Implementing Anthropic's Contextual Retrieval with Async Processing: Explains Anthropic's Contextual Retrieval technique, which enhances RAG systems by adding context to document chunks to improve search accuracy.What’s Inside a Neural Network?: Explains how to visualize the error surface of a neural network using PyTorch and Plotly by walking you through from generating synthetic data to visualizing training steps.🔑Best Practices and Advice🔏What Can A Coffee Machine Teach You About Python's Functions?: Explains how Python functions work, from defining parameters to calling functions and handling return values, through an accessible, relatable analogy.Refactoring Python with 🌳 Tree-sitter & Jedi: Explores a method to refactor Python code across multiple files by renaming a pytest fixture using Tree-sitter to parse function definitions and Jedi to rename identifiers.Ensuring a block is overridden in a Django template: Shows how to prevent missing titles in Django templates by adding a custom template tag that raises an exception if a block is not overridden.The Anna Karenina Principle in Code Quality - Addressing PySpark Challenges with PyASTrX: Discusses how to identify and block bad coding practices in PySpark, such as using withColumn within loops.What is a Pure Function in Python?: Explains pure functions in Python, which produce the same output for the same input without affecting external variables and enable writing clean, predictable, and easy-to-test code.🔍Featured Study: Simulation-based Inference with the Python Package sbijax💥"Simulation-based Inference with the Python Package sbijax" by Dirmeier et al., introduces sbijax, a Python package for neural simulation-based inference (SBI). The paper outlines the package’s implementation of advanced Bayesian inference methodologies using JAX for computational efficiency.ContextSBIis a technique for Bayesian inference when the likelihood function is too complex to compute directly. By using neural networks as surrogates, SBI approximates complex Bayesian posterior distributions, which describe the probability of model parameters given observed data. Neural density estimation, a modern approach to SBI, refers to using neural networks to model these complex distributions accurately. The sbijax package enables this inference process by offering a range of neural inference methods, and it is built on JAX. JAX is a Python library that provides efficient automatic differentiation and parallel computation on both CPUs and GPUs. This makes sbijax particularly relevant for statisticians, data scientists, and modellers working with complex Bayesian models.Key Features of sbijaxWide Range of SBI Algorithms: sbijax implements state-of-the-art methods, including Neural Likelihood Estimation (NLE), Neural Posterior Estimation (NPE), Neural Likelihood-Ratio Estimation (NRE), and Approximate Bayesian Computation (ABC).Computational Efficiency with JAX: Written entirely in JAX, sbijax achieves rapid neural network training and parallel execution on hardware like CPUs and GPUs, often outperforming PyTorch.User-Friendly Interface: Provides simple APIs to construct and train models, simulate data, perform inference, and visualise results.Diagnostic Tools: Offers model diagnostics and visualisation via ArviZ InferenceData objects for easy exploration and analysis of posterior samples.Flexible Model Specification: Supports customisable neural networks and integration with the broader JAX ecosystem for advanced model building.What This Means for Yousbijax is most useful for computational modellers, data scientists, and statisticians who require efficient and flexible tools for Bayesian inference. Its user-friendly interface, coupled with computational efficiency, makes it practical for those working with high-dimensional or complex simulation models.Examining the DetailsThe authors validate sbijax by showcasing its implementation in different SBI methods and comparing performance against conventional tools. The package provides sequential inference capabilities, combining both neural density estimation techniques and traditional ABC. The authors demonstrate sbijax’s functionality by training models using real and synthetic data, then sampling from the posterior distributions. In a benchmark example with a bivariate Gaussian model, sbijax successfully approximates complex posterior distributions using various algorithms like NLE and SMC-ABC.The paper details the efficiency and accuracy of sbijax, backed by empirical evaluations that show JAX's computational advantage over other libraries like PyTorch. Its consistent performance across various SBI tasks underscores its reliability and broad applicability in Bayesian analysis.You can learn more by reading the entire paper or accessing the sbijax documentation here.🧠 Expert insight💥Here’s an excerpt from “Chapter 3: Data Profiling – Understanding Data Structure, Quality, and Distribution” in the book, Python Data Cleaning and Preparation Best Practices by Maria Zervou, published in September 2024.Comparing Great Expectations and pandas profiler – when to use whatPandas profiling and Great Expectations are both valuable tools for data profiling and analysis, but they have different strengths and use cases.Here’s a comparison between thetwo tools.Table 3.2 – Great Expectations and pandas profiler comparisonPandas profiling is well suited for quick data exploration and initial insights, while Great Expectations excels in data validation, documentation, and enforcing data quality rules. Pandas profiling is more beginner-friendly and provides immediate insights, while Great Expectations offers more advanced customization options and scalability for larger datasets. The choice between the two depends on the specific requirements of the project and the level of data qualitycontrol needed.As the volume of data increases, we need to make sure that the choice of tools we’ve made can scale as well. Let’s have a look at how we can do this withGreat Expectations.Great Expectations and big dataDistributed processing frameworks: Great Expectations integrates seamlessly with popular distributed processing frameworks, such as Apache Spark. By leveraging the parallel processing capabilities of these frameworks, Great Expectations can distribute the data validation workload across a cluster, allowing for efficient processingand scalability.Partitioning and sampling: Great Expectations simplifies the process of partitioning and sampling large datasets and enhancing performances and scalability. Unlike the manual partitioning required in tools such as pandas profiling, Great Expectations automates the creation of data subsets or partitions for profiling and validation. This feature allows you to validate specific subsets or partitions of the data, rather than processing the entire dataset at once. By automating the partitioning process, Great Expectations streamlines the profiling workflow and eliminates the need for manual chunk creation, saving timeand effort.Incremental validation: Instead of revalidating the entire big dataset every time, Great Expectations supports incremental validation. This means that as new data is ingested or processed, only the relevant portions or changes need to be validated, reducing the overall validation time and effort. This is a great trick to reduce the time it takes to check the whole data and optimizefor cost!Caching and memoization: Great Expectations incorporates caching and memoization techniques to optimize performance when repeatedly executing the same validations. This can be particularly beneficial when working with large datasets, as previously computed results can be stored and reused, minimizingredundant computations.Cloud-based infrastructure: Leveraging cloud-based infrastructure and services can enhance scalability for Great Expectations. By leveraging cloud computing platforms, such as AWS or Azure, you can dynamically scale resources to handle increased data volumes andprocessing demandsEfficient data storage: Choosing appropriate data storage technologies optimized for big data, such as distributed file systems or columnar databases, can improve the performance and scalability of Great Expectations. These technologies are designed to handle large-scale data efficiently and provide faster access for validation andprocessing tasks.NoteWhile Great Expectations offers scalability options, the specific scalability measures may depend on the underlying infrastructure, data storage systems, and distributed processing frameworks employed in your bigdata environment.Packt library subscribers can continue reading the entire book for free. You can buy Python Data Cleaning and Preparation Best Practices,here.Get the eBook for $35.99 $24.99!Other Python titles from Packt at 30% offGet the eBook for $59.99!Get the eBook for $27.99 $18.99!Get the eBook for $35.99 $17.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 18435
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
Divya Anne Selvaraj
13 May 2025
9 min read
Save for later

PythonPro #70: Python Hits All-Time High, New Type Checker ‘ty’, SQL-Ready ML Pipelines, and Debugging RAG with raggy

Divya Anne Selvaraj
13 May 2025
9 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#70Python Hits All-Time High, New Type Checker ‘ty’, SQL-Ready ML Pipelines, and Debugging RAG with raggyLive Webinar | Scale AppSec with Security Champions – May 15Security Champions programs are a proven way to scale AppSec across dev teams. Join Snyk’s live webinar on May 15 @ 11AM ET✓ Defining the role of security champions✓ Designing a scalable, tailored program✓ Recognizing, rewarding & growing your champions🎓 BONUS: Earn CPE credits for attending!Save your spot now!Hi ,Welcome to a brand new issue of PythonPro!News Highlights: Python hits an all-time high in the Tiobe Index, solidifying its dominance; Astral unveils ty, a fast new type checker built to scale alongside Ruff and UV; Python 3.14 enters beta with t-strings and key PEPs for type checking and debugging; Orbital lets developers run scikit-learn pipelines as pure SQL directly inside databases.My top 5 picks from today’s learning resources:What’s Happening to Embeddings During Training?🧠How to Build an MCP Server in 5 Lines of Python🔌Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines🎥Engineer Python projects like a PRO🛠️Top Python Code Quality Tools to Improve Your Development Workflow🧹And, in From the Cutting Edge, we introduce raggy, a developer tool that enables real-time, interactive debugging of Retrieval-Augmented Generation (RAG) pipelines by combining a Python library of composable components with a visual interface for rapid iteration and evaluation.Stay awesome!Divya Anne SelvarajEditor-in-ChiefPractical workshops and technical sessions with 20+ ML engineers and researchers.• Sebastian Raschka: Live AMA on Large Language Models• Khuyen Tran: GPTs for time series forecasting• Luca Massaron, Thomas Nield, and others: Applied ML at scaleUse code EARLY40 for 40% off.Register with EARLY40Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython popularity climbs to highest ever – Tiobe: Python has reached its highest-ever Tiobe Index rating at 25.35% in May 2025, surpassing all languages since Java’s 2001 peak.ty: Astral's New Type Checker (Formerly Red-Knot) - Talk Python to Me Ep. 506:Developed as a complement to Astral’s popular toolsRuffandUV,tyaims to offer faster, scalable, and more beginner-friendly type checking. It focuses on performance, better editor integration, and smoother adoption in large codebases. TY will be released as a standalone tool, not a drop-in replacement for MyPy or Pyright.Python's T-Strings Coming Soon and Other Python News for May 2025: Python 3.14 enters beta with PEP 750 introducing reusable template strings (t-strings) and PEPs 751, 768, and 781 enhancing dependency tracking, debugging safety, and type-checking support.Orbital for Python released: Orbital converts trained scikit-learn pipelines into pure SQL, enabling machine learning model execution directly within databases—no Python runtime needed.💼Case Studies and Experiments🔬An Empirical Study on the Performance and Energy Usage of Compiled Python Code: Evaluates Python compilers across seven benchmarks using eight compilation tools. Codon, PyPy, and Numba showed over 90% improvement in speed and energy, while Nuitka reduced memory use consistently.I Taught My Fridge Inventory to Text Me When I’m Out of Milk: Combines a Raspberry Pi, Python, OCR (Tesseract), and Twilio to automate fridge inventory tracking.📊AnalysisWhat’s Happening to Embeddings During Training?: Investigates how embedding vectors evolve during training by analyzing metrics like Gini index, Hoyer sparsity, vector entropy, and spectral entropy.PyTorch Tensors Explained: Explains how PyTorch handles tensors—covering memory layout, strides, and autograd—to help developers understand efficient tensor operations and automatic differentiation.🎓Tutorials and Guides🤓How to Build an MCP Server in 5 Lines of Python: Shows you how to turn a Python function into an LLM-compatible tool by launching an MCP server using Gradio in just five lines of code. It covers setup, deployment, and integration with MCP clients like Claude Desktop and Cursor.Data Profiling in Python: common ways to explore your data (part 2): Introduces practical techniques for data profiling, focusing on using value_counts() to analyze categorical variables and understand dataset composition.5 steps to N-body simulation: Teaches beginners to build efficient N-body gravity simulations in Python through initial setup, implementing gravity, basic simulation, higher-order methods, and adaptive time-stepping.Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines: This new Python framework enables real-time video analytics using Python tools, and supports object detection, tracking, captioning, and more.The Python Profilers: Explains how to use Python’s deterministic profilers—cProfile and profile —to analyze performance by measuring function call frequency and duration and covers usage examples.Automating code deletion with Gemini (and a little Python): Details how the author used Gemini 2.0 Flash and Python to automate the removal of outdated docgen code from 235 GN build files after migrating Pigweed’s documentation system to Bazel.📖Open Source Book | Causal Inference for The Brave and True by Matheus Facure Alves: Offers a Python-based, practical introduction to causal inference, balancing rigorous theory with humour and real-world examples. Part I covers foundational methods like causal graphs and regression; Part II explores modern, tech-focused approaches like CATE and meta-learners.🔑Best Practices and Advice🔏Engineer Python projects like a PRO: Guides AI engineers on structuring Python projects using modern tools like uv, ruff, and Docker Compose, while advocating for a monorepo setup to improve code quality, reproducibility, and scalability in real-world development.Top Python Code Quality Tools to Improve Your Development Workflow: Covers linters, formatters, type checkers, security scanners, test coverage, profiling, and CI/CD integration.Kate and Python language server:Explains how to configure the python-lsp-server in the Kate editor to work smoothly with Python virtual environments by using a custom bash script (pylsp_in_env) and enabling the ruff plugin for linting."AI Coffee" Grand Opening This Monday • A Story About Parameters and Arguments in Python Functions:Uses a coffee shop analogy to explain Python function parameters, covering positional and keyword arguments,*args and**kwargs, default values, and more.What does @Slot() do?: Explains the role of the @Slot() decorator in PySide6, showing that while it's optional for most signal-slot connections, it's required for thread-safe execution and slightly improves memory efficiency.🔍From the Cutting Edge: Raggy–RAG Without the Lag💥In RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation Pipelines, Lauro et al. introduce raggy, a developer tool designed to simplify debugging and iterative development of Retrieval-Augmented Generation (RAG) pipelines. The study comes from researchers at the University of Pittsburgh and UC Berkeley.ContextRAG is a technique that combines a retriever and an LLM to generate responses based on external documents. It's widely used to build AI assistants that require domain-specific knowledge, with 86% of enterprise LLM deployments reportedly using it as of 2024.However, RAG pipelines are notoriously hard to debug. Retrieval and generation are deeply intertwined, and developers must tune many parameters (chunk size, retrieval method, prompt wording, etc.) while enduring long feedback loops, often involving time-intensive re-indexing. Existing tools don’t support rapid iteration or show how changes in one part affect the whole pipeline.Key Features of raggyComposable Python primitives for defining RAG pipelines (e.g., Query, Retriever, LLM, Answer).Interactive debugging interface that visualises chunk retrieval quality and generated outputs.Real-time parameter editing for chunk size, retrieval methods, LLM prompts, and more.Versioned checkpoints to rollback and test alternative pipeline states.Support for manual overrides, allowing direct selection of chunks or editing of LLM responses.Evaluation tools, including the ability to save “golden” answers and compare outputs.What This Means for Youraggy is especially relevant for machine learning engineers, LLM application developers, and data scientists working on question-answering systems, enterprise chatbots, or knowledge-intensive assistants. With raggy, you can debug your RAG pipeline interactively, isolate root causes of errors, and iterate without costly delays. It is designed to fit within Python-based workflows and support both experienced and novice developers alike.Examining the DetailsTo evaluate raggy’s effectiveness, the authors conducted a user study involving 12 developers with prior experience building production-grade RAG pipelines. Participants were asked to improve a baseline question-answering system over a corpus of 220 hospital documents. The study followed a think-aloud protocol, with participants engaging in tasks such as debugging poorly performing queries, handling noisy inputs, and rejecting irrelevant questions. The authors observed that developers consistently started by validating the retrieval component—manually inspecting and adjusting chunk size, retrieval methods, or number of chunks—before moving on to LLM generation. This retriever-first strategy persisted even when LLM components preceded retrieval in the pipeline, underscoring the centrality of retrieval quality in RAG debugging.raggy’s low-latency feedback was particularly well received. On average, 71.3% of parameter changes would have required document re-indexing in traditional workflows, yet participants could implement and test these changes instantly within raggy. The tool’s pre-materialisation of hundreds of vector indexes (across chunk sizes and retrieval methods) and its checkpointing mechanism for preserving intermediate pipeline states enabled this rapid iteration. Participants also appreciated how the tool integrated seamlessly with their existing Python code, automatically generating an interactive UI without requiring manual configuration. This reduced context switching and allowed them to stay focused on the debugging task.You can learn more by reading the entire paper or looking at the source code on GitHub.And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 18298

Divya Anne Selvaraj
17 Sep 2024
12 min read
Save for later

PythonPro #47: Python 3.13 Features, AI Debugging with Copilot, and Building Neural Networks from Scratch

Divya Anne Selvaraj
17 Sep 2024
12 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#47:Python 3.13 Features, AI Debugging with Copilot, and Building Neural Networks from ScratchHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, AI-Assisted Programming for Web and Machine Learning, which discusses how Copilot can assist in debugging and troubleshooting by adding error-handling features.News Highlights: DBOS Transact launches with durable Python workflow recovery; Python in Excel now live for data analysis; Python 3.13 is coming October 2024 with new interpreter, JIT, and more; and Hackers use fake coding tests on GitHub to target Python developers.Here are my top 5 picks from our learning resources today:How Does AI Work? Create a Neural Network from Scratch🤖Spam Mail Detection - Machine Learning with Python✉️Django from first principles🌱How to Use Conditional Expressions With NumPy where()🔄Why Learn Python Concurrency⚙️And, today’s Featured Study, introduces ComplexCodeEval, a benchmark designed to evaluate large code models (LCMs) in complex development environments.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.: This month’ssurvey is still live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.Sign Up|Advertise @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } 🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsDBOS Transact: Ultra-Lightweight Durable Execution for Python Workflows launched: The library ensures programs automatically resume from their last completed step after crashes or interruptions.Python in Excel – Available Now: Microsoft 365 users can now integrate Python libraries for advanced data analysis, visualization, and machine learning within Excel.What’s New In Python 3.13: The version, releasing on October 1, 2024, will include a new interactive interpreter, experimental free-threaded mode, a JIT compiler, enhanced error messages, and updates to the standard library.Fake password manager coding test used to hack Python developers: Posing as recruiters, the hackers use GitHub-hosted projects to infect victims' systems and pressure them to bypass security checks.💼Case Studies and Experiments🔬How Does AI Work? Create a Neural Network from Scratch: Explains how to build a basic neural network using Python, to predict house prices, while covering core concepts like gradient descent, backpropagation, and more.Text mining in Python - case-study with “Romeo and Juliet” from Project Gutenberg:Walks you through the steps of accessing the text, cleaning it, tokenizing words, analyzing word frequency, and visualizing the results.📊AnalysisStreamlit vs Gradio - The Ultimate Showdown for Python Dashboards: Evaluates their ease of use, customization options, deployment flexibility, and suitability for complex data visualization or rapid prototyping tasks.It’s time to stop using Python 3.8: Emphasizes the importance of upgrading from Python 3.8, which reaches end-of-life in October 2024, meaning no more bug or security fixes.🎓Tutorials and Guides🤓Understanding Proximal Policy Optimization (PPO) - A Game-Changer in AI Decision-Making Explained for RL Newcomers: explains PPO, detailing its key concepts, practical implementation, and how it improves decision-making stability and efficiency in AI systems.Use Python for Earth Engine Analysis, Save Directly to Your Local Drive: Explains how to use the Python library geemap for interacting with Google Earth Engine (GEE) to process and analyze satellite imagery.Django from first principles: A series on building a Django project starting with a single file, gradually expanding as necessary to manage complexity, simplifying Django for beginners by focusing on essential components first.Injecting syscall faults in Python and Ruby: Discusses how to simulate syscall failures in Python and Ruby using Cirron, a tool that integrates with strace to inject errors, delays, and signals into system calls.Deploying a Django app with Kamal, AWS ECR, and Github Actions: Covers setting up a VPS and preparing it for Kamal, creating a Dockerfile for containerized apps, and configuring a deployment pipeline.Implementing the Singleton Pattern in FastAPI for Efficient Database Management: Demonstrates how to implement the Singleton Pattern for efficient database management, particularly while handling expensive resources.Spam Mail Detection - Machine Learning with Python: Explains how to use a supervised learning approach with a dataset from Kaggle, analyzing email length, applying logistic regression, and creating a scanner to detect spam.🔑Best Practices and Advice🔏Let’s build and optimize a Rust extension for Python: Explains how to build and optimize a Rust extension for Python to improve performance and memory efficiency.Why Learn Python Concurrency: Explains how concurrent, parallel, and asynchronous execution allow programs to fully utilize modern hardware, improve performance, and scale more effectively.Therac-25, LLMs and the Zen of Python: Discusses the dangers of relying on LLMs to rewrite code across languages without understanding the underlying principles and context, drawing a parallel to the infamous Therac-25 disaster.Using Python's pip to Manage Your Projects' Dependencies: discusses using Python's pip to for installing and uninstalling packages, and handling errors.How to Use Conditional Expressions With NumPy where(): Explains how to work with multiple conditions, array broadcasting, and common pitfalls when using np.where() in data manipulation.🔍Featured Study: ComplexCodeEval - Benchmarking Large Code Models in Practice💥In ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code, Feng et al. introduce a new benchmark for assessing large code models (LCMs). The paper focuses on evaluating LCMs in real-world coding scenarios involving complex tasks and avoiding data leakage.ContextLCMs are AI models trained to handle coding tasks like code generation, completion, test case creation, and API recommendation. Existing benchmarks tend to evaluate LCMs on limited tasks, such as standalone code generation, without capturing the broader, more diverse challenges developers face. Additionally, they often overlook data leakage, where models are tested on data already seen during training, resulting in inflated performance scores.ComplexCodeEval is a comprehensive benchmark designed to test LCMs on multiple coding tasks and scenarios, reflecting real-world programming challenges. It assesses how well LCMs perform in contexts that include dependencies on third-party libraries and the need to create test functions and recommend APIs.Key Featured of ComplexCodeEvalReal-World Data: Uses 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories.Multiple Tasks: Evaluates LCMs on code generation, completion, API recommendation, and test case generation.Rich Context: Each sample includes function signatures, docstrings, API references, and test functions.Data Leakage Prevention: Multiple timestamps (creation, update) ensure the benchmark avoids testing on training data.Variety of Models Tested: Ten popular LCMs, including StarCoder2, CodeLlama, DeepSeek-Coder, and GPT-3.5-Turbo, were evaluated.What This Means for YouThis study is valuable for programmers and software engineers who use AI coding tools. ComplexCodeEval highlights which models perform best for tasks like generating Java code or recommending Python APIs, making it easier to select the right tools for complex programming tasks. It provides a realistic assessment of LCMs, avoiding inflated scores from limited or synthetic benchmarks.For developers working on AI models, the study offers insights into how additional contextual information, such as dependencies and function histories, can significantly improve model performance.Examining the DetailsTo create ComplexCodeEval, the authors sourced Java and Python samples from GitHub repositories that relied on popular third-party libraries. Each sample was annotated with relevant metadata like API references, docstrings, and timestamps, simulating real-world coding tasks.Ten LCMs, including StarCoder2, CodeLlama, DeepSeek-Coder, and GPT-3.5-Turbo, were tested on four tasks: code generation, code completion, API recommendation, and test case generation. CodeLlama-34B achieved the highest CodeBLEU score of 34.08 for Java code generation, and Python API recommendation saw an F1 score of 52.24.The researchers tested the impact of adding context to the inputs provided to LCMs. Starting with basic function signatures and docstrings, they added more context (e.g., dependencies and library imports) and found that full context improved average CodeBLEU scores by 70.73% in Java and 31.90% in Python.To assess data leakage, the team compared model performance on data created before and after the models’ knowledge cut-off dates. They found models performed better on leaked data, with average CodeBLEU scores increasing by 1.22 points in Java and 3.10 points in Python, demonstrating the importance of preventing data leakage in evaluations.You can learn more by reading the entirepaper and accessing the ComplexCodeEvalGithub repository.🧠 Expert insight💥Here’s an excerpt from “Chapter 20: Increasing Efficiency with GitHub Copilot” in the book, AI-Assisted Programming for Web and Machine Learning by Christoffer Noring, Anjali Jain, Marina Fernandez, Ayşe Mutlu, and Ajit Jaokar, published in August 2024.Debugging and troubleshootingBeing able to debug and troubleshoot code is an important skill to have. When you develop code, it seldom does what you need it to do the first time. You needto be able to debug and troubleshoot code to figure out what’s wrong with it. The same applies if you have a reported bug; you need to be able to diagnose at what point it goes wrong. Copilot can help, and a good approach is using Copilot chat and asking it to add support for debugging and troubleshooting.The following is a piece of sample code you’ve been given. It’s a REST API in the framework Flask for Python. It’s working code, but it might not be so easy to work with from a debugging and troubleshooting standpoint:from flask import Flask, jsonify, requestimport sqlite3app = Flask(__name__)# Create a connection to the databaseconn = sqlite3.connect('products.db')c = conn.cursor()# Create a table for products if it doesn't existc.execute('''CREATE TABLE IF NOT EXISTS products (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, price REAL)''')conn.commit()@app.route('/products', methods=['GET'])def get_products(): # Retrieve all products from the database c.execute('SELECT * FROM products') products = c.fetchall() # Convert the products to a JSON response response = [] for product in products: response.append({ 'id': product[0], 'name': product[1], 'price': product[2] }) return jsonify(response)@app.route('/products', methods=['POST'])def add_product(): # Retrieve the product data from the request body data = request.get_json() name = data['name'] price = data['price'] # Insert the product into the database c.execute('INSERT INTO products (name, price) VALUES (?, ?)', (name, price)) conn.commit() # Return a success message return jsonify({'message': 'Product added successfully'})if __name__ == '__main__': app.run(debug=True)The above code works, but it doesn’t support debugging and troubleshooting; let’s see how we can add support for that by using Copilot Chat:[Prompt]Add support for troubleshooting and debugging[End of prompt]Copilot chat answers us with both text and code:[Prompt response]Sure, here’s an updated version of the Flask API that includes support for troubleshooting and debugging:from flask import Flask, jsonify, requestimport sqlite3app = Flask(__name__)# Create a connection to the databaseconn = sqlite3.connect('products.db')c = conn.cursor()# Create a table for products if it doesn't existc.execute('''CREATE TABLE IF NOT EXISTS products (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, price REAL)''')conn.commit()@app.route('/products', methods=['GET'])def get_products(): try: # Retrieve all products from the database c.execute('SELECT * FROM products') products = c.fetchall() # Convert the products to a JSON response response = [] for product in products: response.append({ 'id': product[0], 'name': product[1], 'price': product[2] }) return jsonify(response) except Exception as e: # Log the error and return an error message app.logger.error(str(e)) return jsonify({'error': 'An error occurred while retrieving products'})@app.route('/products', methods=['POST'])def add_product(): try: # Retrieve the product data from the request body data = request.get_json() name = data['name'] price = data['price'] # Insert the product into the database c.execute('INSERT INTO products (name, price) VALUES (?, ?)', (name, price)) conn.commit() # Return a success message return jsonify({'message': 'Product added successfully'}) except Exception as e: # Log the error and return an error message app.logger.error(str(e)) return jsonify({'error': 'An error occurred while adding the product'})if __name__ == '__main__': app.run(debug=True)The improvements are:Added atry/exceptblock around the code in theget_products()function.Addedloggerstatements to log errors.Addeddebug=Trueto theapp.run()statement.These changes will help us troubleshoot and debug the code.[End of response]Packt library subscribers can continue reading the entire book for free. You can buy AI-Assisted Programming for Web and Machine Learning,here.Get the eBook for $38.99 $26.99!Other Python titles from Packt at upto 30% offGet the eBook for $35.99 $24.99!Get the Print Book for $44.99 $32.99Get the eBook for $35.99 $24.99!Get the Print Book for $44.99 $32.99Get the eBook for $43.99 $29.99!Get the Print Book for $54.99 $40.99Print discounts end in 5 days on the 22nd of September, 2024. @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}} @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} }
Read more
  • 0
  • 0
  • 17810

Divya Anne Selvaraj
08 Oct 2024
12 min read
Save for later

PythonPro #50: Python 3.13 Arrives, Offensive Security Practices, and Jupyter Notebook Tips

Divya Anne Selvaraj
08 Oct 2024
12 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#50:Python 3.13 Arrives, Offensive Security Practices, and Jupyter Notebook TipsHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Offensive Security Using Python, which briefly discusses key practices such as input validation, secure authentication, session management, secure coding techniques, and the implementation of security headers.News Highlights: Python 3.13.0, released yesterday, adds an interactive interpreter, free-threaded mode, JIT compiler, and iOS/Android support; and Rev's Reverb models for ASR and diarization outperform other open-source models.Here are my top 5 picks from our learning resources today:10 Jupyter Notebook Features You Didn’t Know Exist📓A Guide to Modern Python String Formatting Tools🔠Modeling customers' decisions in Python with the Choice-Learn package🛍️Understanding Logarithmic Plots in Matplotlib: semilogx, semilogy, and loglog📈Best practices for securely consuming open source in Python — Ciara Carey🔐And, today’s Featured Study, evaluates the performance of AI models in geospatial code generation, revealing significant challenges in handling complex tasks, specific data formats, and specialised libraries.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month's survey is still live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython3.13.0 Is Released: Released on October 7, 2024, the version includes a new interactive interpreter, free-threaded mode, and JIT compiler, and support for iOS and Android platforms.Introducing Reverb: The Future of Open-Source automatic speech recognition (ASR) and Diarization: Rev's new open-source models for ASR and speech diarization, built using Rev’s extensive human-transcribed English speech dataset, outperforms existing open-source models.💼Case Studies and Experiments🔬Using Kolmogorov-Arnold Networks (KAN) and Backtesting to Predict Stock Prices: Discusses predicting stock prices, focusing on deep learning models trained on historical data from Yahoo Finance.🎥Marketing Media Mix Models with Python & PyMC: a Case Study [PyCon DE & PyData Berlin 2024]: discusses how machine learning models can optimize marketing investments by analyzing various channels.📊Analysis10 Jupyter Notebook Features You Didn’t Know Exist: Discusses features including magic commands, interactive widgets, auto-reload for modules, in-notebook documentation, and collapsible headings.I Used Claude.ai to Create a Discord Bot — Here’s What I Learned About the State of AI Code Writing: Discusses the author's experience using Claude to rapidly generate Python code for a bot that deletes old Discord messages.🎓Tutorials and Guides🤓A Guide to Modern Python String Formatting Tools: Explains how to format values, create custom format specifiers, and embed expressions in strings. Read to learn practical techniques for dynamic string manipulation.DuckDB in Python in the Browser with Pyodide, PyScript, and JupyterLite: Shows you how to run DuckDB in Python within a browser environment and embed interactive Python environments in web pages.Tutorial: Creating a Twitter (X) Bot using Python: Explains how to build and deploy a Python-based Twitter (X) bot that autonomously tweets updates, including progress graphs, using the X API.Distilling python functions into LLM: Explains how to use the Instructor library to distill Python functions into a language model, enabling fine-tuning for function emulation using Pydantic type hints.Getting Started with Powerful Data Tables in Your Python Web Apps: Demonstrates building a finance app that fetches stock data, displays it interactively, and includes features like sorting, and graph visualization.Modeling customers decisions in Python with the Choice-Learn package: Introduces the Choice-Learn Python package, which simplifies implementing discrete choice models like Conditional Logit to predict customer decisions.Optimizing Inventory Management with Reinforcement Learning: A Hands-on Python Guide:Outlines how Q-learning helps balance holding and stockout costs by developing an optimal ordering policy.🔑Best Practices and Advice🔏Speeding up CRC-32 calculations in Mojo: Discusses speeding up CRC-32 calculations in Mojo, achieving an 18x improvement over Python's native implementation and reaching 3x slower performance compared to zlib library.Bad Schemas could break your LLM Structured Outputs: Explains how choosing the right response model dramatically impacts the performance of language models like GPT-4o and Claude, especially when using JSON mode or Tool Calling.Implementing a Python Singleton with Decorators: Explains how a decorator ensures only one instance of a class is created, using a _SingletonWrapper class to handle instantiation and simplifies global access.🎥Best practices for securely consuming open source in Python — Ciara Carey: Introduces a framework called Secure Supply Chain Consumption Framework (S2C2F) to help organizations improve open-source security.Understanding Logarithmic Plots in Matplotlib: semilogx, semilogy, and loglog: Walks you through plotting data with a logarithmic x-axis, y-axis, and both axes, respectively, and provides code snippets to generate these plots.🔍Featured Study: Current AI Models Fall Short in Geospatial Code Generation💥In "Evaluation of Code LLMs on Geospatial Code Generation," Gramacki et al. introduce a benchmark to assess LLMs' ability to handle tasks involving spatial reasoning and data processing.ContextLLMs generate code based on natural language inputs and are effective in general programming tasks, particularly in data science. Geospatial data science is a field focused on analysing spatial data tied to locations. It relies on libraries like GeoPandas and Shapely for tasks such as geo-coding, spatial analysis, and data visualisation. However, the domain poses unique challenges for LLMs due to the need for spatial reasoning and the use of specialised tools, making evaluation in this area crucial. As geospatial applications expand in industries such as urban planning and environmental science, reliable AI assistance is becoming increasingly important.Key FindingsLLMs underperform in geospatial tasks: Models like Code Llama and Starcoder2 show reduced accuracy compared to their performance in general coding.Starcoder2-7B leads but struggles: It achieved a pass@1 score of 32.47%, highlighting the difficulty of geospatial tasks even for top-performing models.Complex tasks pose a challenge: Single-step tasks had a 45.45% pass@1 success rate, but multi-step tasks were far more difficult, scoring only 15.15%.Data format matters: Models handled GeoDataFrames better than other formats like GeoJSON, showing varying levels of tool proficiency.Limited tool support: Libraries like MovingPandas and OSMNX, crucial for geospatial analysis, were inadequately supported by the models.What This Means for YouThis study is relevant for geospatial programmers and data scientists seeking to automate coding tasks. Current LLMs are not yet reliable for complex geospatial tasks, highlighting a need for models specifically trained for the domain. Developers and researchers can benefit by focusing on improving AI models to better support geospatial data science workflows.Examining the DetailsThe authors created a benchmark dataset categorising tasks by complexity, data format, and tool usage. The dataset includes 77 samples to test LLM performance on tasks like spatial reasoning and tool implementation. Evaluation metrics focused on accuracy and pass@1, with the results highlighting the models' struggles in handling geospatial problems. Libraries like GeoPandas and H3 were used to evaluate the models, while more complex tools like MovingPandas exposed the models' weaknesses.This rigorous benchmark, publicly available for future research, sets a foundation for improving geospatial code generation in LLMs. The study’s methodology ensures it reflects real-world geospatial coding challenges, offering valuable insights for the development of more domain-specific AI tools.You can learn more by reading the entire paper and accessing the benchmark dataset: geospatial-code-llms-dataset.🧠 Expert insight💥Here’s an excerpt from “Chapter 3: An Introduction to Web Security with Python” in the book, Offensive Security Using Python by Rejah Rehim and Manindar Mohan, published in September 2024.Proactive web security measures with PythonPython has developed as a versatile widely used programming language in the field of modern software development. Its ease of use, readability, and rich library support have made it a popular choice for developingweb-based applications in a variety of industries. Python frameworks such as Django, Flask, and Pyramid have enabled developers to create dynamic and feature-rich web applications with speed and agility.However, as Python web apps become more popular, there is a corresponding increase in the sophistication and diversity of attacks targeting these applications. Cybersecurity breaches can jeopardize valuable user data, interfere with corporate operations, and damage an organization’s brand. Python web applications become vulnerable to a variety of security vulnerabilities, including SQL injection, XSS, andcross-site request forgery(CSRF). The consequences of these vulnerabilities can be severe, demanding an effectivecybersecurity strategy.Developers must be proactive to counteract this. By implementing security practices such as input validation, output encoding, and other secure coding guidelines early in the development lifecycle, developers can reduce the attack surface and improve the resilience of their Pythonweb applications.Although we are only discussing Python-based applications here, these practices are universal and should be implemented in web applications built with anytechnology stack.To protect against a wide range of cyber threats, it is critical to implement strong best practices. This section explains key security practices that developers should follow while developingweb apps.Input validation and data sanitizationUserinput validationis essential for preventing code injection attacks. Malicious inputs can exploit vulnerabilities and cause unwanted commands to be executed. Properdata sanitizationguarantees that user inputs are handled as data rather than executable code by eliminating or escaping special characters. Using libraries such asinput()and frameworks such as Flask’srequestobject can help validate and sanitizeincoming data.Secure authentication and authorizationRestricting unauthorized access requires effective authentication and authorization procedures. Password hashing, which uses algorithms such asbcryptorArgon2, adds an extra degree of security by ensuring that plaintext passwords are never saved.Two-factor authentication(2FA) adds an additional verification step to user authentication, increasing security.Role-Based Access Control(RBAC) allows developers to provide specific permissions to different user roles, guaranteeing that users only access functionality relevant totheir responsibilities.Secure session managementKeeping user sessions secure is critical for avoiding session fixation and hijacking attempts. Using secure cookies with theHttpOnlyandSecurecharacteristics prohibits client-side script access and ensures that cookies are only sent over HTTPS. Session timeouts and measures such as session rotation can improve session securityeven further.Secure coding practicesFollowing secure coding practices reduces a slew of possible vulnerabilities. Parameterized queries, made possible by libraries such assqlite3, protect against SQL injection by separating data from SQL commands. Output encoding, achieved with techniques such ashtml.escape(), avoids XSS threats by converting user inputs to innocuous text. Similarly, omitting functions such aseval()andexec()avoids uncontrolled code execution, lowering the likelihood of codeinjection attacks.Implementing security headersSecurity headersare a fundamental component of web application security. They are HTTP response headers that provide instructions to web browsers, instructing them on how to behave when interacting with the web application. Properly configured security headers can mitigate various web vulnerabilities, enhance privacy, and protect against commoncyber threats.Here is an in-depth explanation of implementing security headers to enhance webapplication security:Content Security Policy (CSP): CSP is a security feature that helps prevent XSS attacks. By defining and specifying which resources (scripts, styles, images, etc.) can be loaded, CSP restricts script execution to trusted sources. Implementing CSP involves configuring theContent-Security-Policy HTTP header in your web server. This header helps prevent inline scripts and unauthorized script sources from being executed, reducing the risk of XSS attacks significantly. An example of the CSP header is as follows:Content-Security-Policy: default-src 'self'; script-src 'self' www.google-analytics.com;HTTP Strict Transport Security (HSTS): HSTS is a security feature that ensures secure, encrypted communication between the web browser and the server. It preventsMan-in-the-Middle(MITM) attacks by enforcing the use of HTTPS. Once a browser has visited a website with HSTS enabled, it will automatically establish a secure connection for all future visits, even if the user attempts to access the site via HTTP.An example HSTS header isas follows:Strict-Transport-Security: max-age=31536000; includeSubDomains; preload;X-Content-Type-Options: TheX-Content-Type-Optionsheader prevents browsers from interpreting files as a different media type also known as aMultipurpose Internet Mail Extensions(MIME) type. It mitigates attacks such as MIME sniffing, where an attacker can trick a browser into interpreting content in an unintended way, potentially leading to security vulnerabilities.An exampleX-Content-Type-Optionsheader isas follows:X-Content-Type-Options: nosniffX-Frame-Options: TheX-Frame-Options header prevents clickjacking attacks by denying the browser permission to display a web page in a frame or iframe. This header ensures that your web content cannot be embedded within malicious iframes, protecting against UIredressing attacks.An exampleX-Frame-Optionsheader isas follows:X-Frame-Options: DENYReferrer-Policy: TheReferrer-Policyheader controls what information is included in theReferrer header when a user clicks on a link that leads to another page. By setting an appropriate referrer policy, you can protect sensitive information, enhance privacy, and reduce the risk ofdata leakage.An exampleReferrer-Policyheader isas follows:Referrer-Policy: strict-origin-when-cross-originPackt library subscribers can continue reading the entire book for free. You can buy Offensive Security Using Python,here.Get the eBook for $39.99 $27.98!Get the Print Book for $49.99 $34.98!Other Python titles from Packt at 30% offGet the eBook for $39.99 $27.98!Get the eBook for $35.99 $24.99!Get the eBook for $27.99 $18.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 16666

Divya Anne Selvaraj
11 Sep 2024
2 min read
Save for later

Master Python for Data, AI, and API Development

Divya Anne Selvaraj
11 Sep 2024
2 min read
New Python books—designed for today’s needsMaster Python for Data, AI, and API DevelopmentHi ,Python powers some of the fastest-growing fields in tech today. According to the latest Python Developer Survey results, 47% of Python users apply it in data analysis, 42% in machine learning, and 39% in web development. With Python’s influence only expanding, staying ahead means mastering these key areas.Packt's August 2024 releases offer the practical expertise you need to enhance your Python skills, whether you're working with big data, building machine learning models, or developing high-performance APIs.Python Feature Engineering Cookbook - Third Editionby Soledad GalliA complete guide to crafting powerful features for your machine learning modelsEquips you with practical techniques for handling complex datasets, to craft features that will improve model performance.Learn to impute missing values, transform numerical variables, and extract powerful features from complex datasets like time series and transactional data.Get the eBook for $35.99 $24.99!Get the Print Book for $44.99!Polars Cookbook by Yuki KakegawaOver 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.xOptimise data analysis tasks with Python Polars, a blazingly fast alternative to pandas.Ideal for data professionals looking to improve performance across a variety of datasets, solve common data problems, perform complex transformations, and analyse time-series data.Get the eBook for $35.99 $24.99!Get the Print Book for $44.99!FastAPI Cookbook by Giunio De LucaDevelop high-performance APIs and web applications with PythonFastAPI is gaining ground rapidly, with 25% of Python developers now using it for web development.Learn how to use FastAPI’s modern, async-friendly features, and take your backend development to the next level with custom middleware and WebSockets.Get the eBook for $35.99 $24.99!Get the Print Book for $44.99!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 15911
Divya Anne Selvaraj
15 Oct 2024
11 min read
Save for later

PythonPro #51: Python 3.13 REPL Enhancements, Python 3.12 vs. 3.13, and Visualizing Named Entities in Text

Divya Anne Selvaraj
15 Oct 2024
11 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#51Notion for StartupsThousands of startups use Notion as a connected workspace to create and share docs, take notes, manage projects, and organize knowledge—all in one place.We’re offering 6 months of new Plus plans, including unlimited Notion AI so you can try it all for free!To redeem the Notion for Startups offer:1. Submit an application using our custom link: https://ntn.so/packt and select Packt on the partner list.2. Include our partner key: STARTUP4110P19151Get your Free 6-month Notion Plus Acceess!Hi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Python Natural Language Processing Cookbook - Second Edition, which explains how to use the displaCy library from spacy to visualize named entities in text.News Highlights: PEP 762 in Python 3.13 adds multi-line editing, syntax highlighting, and custom commands to the REPL, and Pyinstrument 5 introduces a flamegraph timeline view for better code execution visualization.Here are my top 5 picks from our learning resources today:Python 3.12 vs Python 3.13 – performance testing⚡️Exploring Infrastructure as Code (IaC) with Python: AWS CDK, Terraform CDK, and Pulumi🏗️lintsampler : a new way to quickly get random samples from any distribution🎲Python and SysV shared memory🧠Gradient-Boosting anything (alert: high performance)🚀And, today’s Featured Study, presents a method using LLMs to generate precise, transparent code transformations, improving accuracy and efficiency for compiler optimizations and legacy refactoring.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month's survey is still live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Your cloud deserves dedicated data protection94% of cloud tenants were targeted last year, and 62% were successfully compromised.The hard truth is that organizations are having a hard time securing their cloud data—and cyberattackers are ready to exploit that challenge.Here’s a handy resource you’ll want with you as you map out your plan: Orchestrating the Symphony of Cloud Data Security.You’ll learn how to: Overcome the challenges of securing data in the cloud, Navigate multi cloud data security, and Balance data security with cloud economicsDownload Your Complimentary Copy NowSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPEP 762 – REPL-acing the default REPL: As of Python 3.13, the default REPL has been replaced with a Python-based version (PEP 762), offering modern features like multi-line editing, syntax highlighting, and custom commands.Pyinstrument 5 - Flamegraphs for Python: The new version of the Python statistical profiler introduces a new flamegraph-style timeline view for visualizing code execution, improves on previous timeline modes, and more.💼Case Studies and Experiments🔬Moving all our Python code to a monorepo: pytendi: Describes the migration of Attendi’s Python codebase into a monorepo using the Polylith architecture to improve code discoverability, reusability, and developer experience.How Maintainable is Proficient Code? A Case Study of Three PyPI Libraries: Aims to help you recognize when proficient coding might hinder future maintenance efforts.📊AnalysisIn the Making of Python Fitter and Faster: Provides insights into how Python's evolving interpreter architecture enhances execution speed, memory efficiency, and overall performance for modern applications.Python 3.12 vs Python 3.13 – performance testing: Tests on AMD Ryzen 7000 and Intel 13th-gen processors show Python 3.13 generally performs faster, especially in asynchronous tasks, but there are slowdowns in certain areas.🎓Tutorials and Guides🤓Build a Contact Book App With Python, Textual, and SQLite: Covers creating the app’s text-based interface (TUI), setting up a SQLite database for contact storage, and integrating both elements.Syntactic Sugar: Why Python Is Sweet and Pythonic: Covers various Pythonic constructs like operators, assignment expressions, loops, comprehensions, and decorators, and shows how they simplify code.The Ultimate Guide to Error Handling in Python: Provides a comprehensive guide to Python error handling, exploring common patterns like "Look Before You Leap" (LBYL) and "Easier to Ask Forgiveness than Permission" (EAFP).Exploring Infrastructure as Code (IaC) with Python: AWS CDK, Terraform CDK, and Pulumi: Explains how Python integrates with IaC tools to automate cloud infrastructure management.Web scraping of a dynamic website using Python with HTTP Client: Walks you through analyzing sites with JavaScript-rendered content and using the Crawlee framework to extract data in JSON format.lintsampler : a new way to quickly get random samples from any distribution: Introduces a Python package designed to easily and efficiently generate random samples from any probability distribution.Mastering Probability with Python: A Step-by-Step Guide with Simulations:Through examples like coin tosses, dice rolls, and event probabilities, this tutorial guides you on how to simulate and analyze real-world scenarios.🔑Best Practices and Advice🔏What's In A List—Yes, But What's *Really* In A List: Explains common pitfalls when multiplying lists and why it matters when working with mutable versus immutable data types.Yes, you need to duplicate your frontend business logic on the server: Explains why backend validation is essential to protect data integrity, regardless of frontend sophistication.Python and SysV shared memory: Explains how to wrap C functions like shmget, shmat, and shmctl for shared memory management, handling void pointers, and performing basic operations like writing to shared memory.Gradient-Boosting anything (alert: high performance): Explores using Gradient Boosting with various machine learning models, adapting LSBoost in the Python package mlsauce for both regression and classification tasks.Code Generation with ChatGPT o1-preview as a Story of Human-AI Collaboration: Through experiments in Python and C++, the author demonstrates that human-AI collaboration improves code generation, specifically in building sentiment analysis tools.🔍Featured Study: Don't Transform the Code, Code the Transforms💥In "Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs," researchers from Meta, Cummins et al., introduce a novel method called Code the Transforms (CTT), which leverages LLMs to generate precise code transformations rather than directly rewriting code.ContextCode transformation refers to rewriting or optimising existing code, a task essential for compiler optimisations, legacy code refactoring, or performance improvements. Traditional rule-based approaches to code transformations are difficult to implement and maintain. LLMs offer the potential to automate this process, but direct code rewriting by LLMs lacks precision and is challenging to debug. This study introduces the CTT method, where LLMs generate the transformation logic, making the process more transparent and adaptable.Key Featured of the CTT MethodChain-of-thought process: The method synthesises code transformations by iterating through input/output examples to create a precise transformation logic rather than rewriting code directly.Improved transparency and adaptability: The generated transformations are explicit, making them easier to inspect, debug, and modify when necessary.Higher precision: The method achieved perfect precision in 7 out of 16 Python code transformations, significantly outperforming traditional direct rewriting approaches.Reduced computational costs: By generating transformation logic instead of rewriting code, the method requires less compute and review effort compared to direct LLM rewriting.Iterative feedback loop: The method incorporates execution and feedback to ensure the generated transformations work as expected, leading to more reliable outcomes.What This Means for YouThis study is particularly beneficial for software engineers, developers, and those working on compiler optimisations or legacy code refactoring. By using this method, teams can reduce the time spent on manual code review and debugging, while improving the precision of code transformations.Examining the DetailsThe study's methodology involved testing 16 different Python code transformations across a variety of tasks, ranging from simple operations like constant folding to more complex transformations such as converting dot products to PyTorch API calls. The CTT method achieved an overall F1 score of 0.97, compared to the 0.75 achieved by the direct rewriting method. The precision of transformations ranged from 93% to 100%, with tasks like dead code elimination and redundant function elimination reaching near-perfect performance. In contrast, the traditional direct LLM rewriting approach showed an average precision of 60%, and was prone to more frequent errors, requiring manual correction.You can learn more by reading the entire paper.🧠 Expert insight💥Here’s an excerpt from “Chapter 7: Visualizing Text Data” in the book, Python Natural Language Processing Cookbook - Second Edition by Zhenya Antić and Saurabh Chakravarty, published in September 2024.VisualizingNERNamed entity recognition, orNER, is a very useful tool for quickly finding people, organizations, locations, and other entities in texts. In order to visualize them better, we can use thedisplacypackage to create compelling andeasy-to-read images.After working through this recipe, you will be able to create visualizations of named entities in a text using different formatting options and save the results ina file.Getting readyThedisplaCylibrary is part of thespacypackage. You need at least version 2.0.12 of thespacypackage fordisplaCyto work. The version in thepoetryenvironment andrequirements.txtfileis 3.6.1.The notebook is locatedathttps://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter07/7.3_ner.ipynb.How to do it...We will usespacyto parse the sentence and then thedisplacyengine to visualize thenamed entities:Import bothspacyanddisplacy:import spacyfrom spacy import displacyRun the languageutilities file:%run -i "../util/lang_utils.ipynb"Define the textto process:text = """iPhone 12: Apple makes jump to 5GApple has confirmed its iPhone 12 handsets will be its first to work on faster 5G networks.The company has also extended the range to include a new "Mini" model that has a smaller 5.4in screen.The US firm bucked a wider industry downturn by increasing its handset sales over the past year.But some experts say the new features give Apple its best opportunity for growth since 2014, when it revamped its line-up with the iPhone 6."5G will bring a new level of performance for downloads and uploads, higher quality video streaming, more responsive gaming,real-time interactivity and so much more," said chief executive Tim Cook.There has also been a cosmetic refresh this time round, with the sides of the devices getting sharper, flatter edges.The higher-end iPhone 12 Pro models also get bigger screens than before and a new sensor to help with low-light photography.However, for the first time none of the devices will be bundled with headphones or a charger."""In this step, we process the text using the small model. This gives us aDocobject. We then modify the object to contain a title. This title will be part of theNER visualization:doc = small_model(text)doc.user_data["title"] = "iPhone 12: Apple makes jump to 5G"Here, we set up color options for the visualization display. We set green for theORG-labeled text and yellow for thePERSON-labeled text. We then set theoptionsvariable, which contains the colors. Finally, we use therendercommand to display the visualization. As arguments, we provide theDocobject and the options we previously defined. We also set thestyleargument to"ent", as we would like to display just entities. We set thejupyterargument toTruein order to display directly inthe notebook:colors = {"ORG": "green", "PERSON":"yellow"}options = {"colors": colors}displacy.render(doc, style='ent', options=options, jupyter=True)The output should look like that inFigure 7.4.Figure 7.4 – Named entities visualizationNow we save the visualization to an HTML file. We first define thepathvariable. Then, we use the samerendercommand, but we set thejupyterargument toFalsethis time and assign the output of the command to thehtmlvariable. We then open the file, write the HTML, and closethe file:path = "../data/ner_vis.html"html = displacy.render(doc, style="ent", options=options, jupyter=False)html_file= open(path, "w", encoding="utf-8")html_file.write(html)html_file.close()This will create an HTML file with theentities visualization.Packt library subscribers can continue reading the entire book for free. You can buy Python Natural Language Processing Cookbook - Second Edition,here.Get the eBook for $35.99 $17.99!Get the Print Book for $44.99 $30.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 15179

Divya Anne Selvaraj
10 Sep 2024
12 min read
Save for later

PythonPro #45: Outlier Detection with Boxplots, Python 3.13 Updates, and Stripe Integration for Django

Divya Anne Selvaraj
10 Sep 2024
12 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#46:Outlier Detection with Boxplots, Python 3.13 Updates, and Stripe Integration for DjangoHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published, Python Feature Engineering Cookbook - Third Edition, which discusses using boxplots and the inter-quartile range (IQR) proximity rule to visualize outliers in data distributions.Related TitlesCovers numerous tools for mastering visualization including NumPy, Pandas, SQL, Matplotlib, and SeabornIncludes an introductory chapter on Python 3 basicsFeatures companion files with numerous Python code samples and figuresGet the eBook for $54.99 $37.99!Explores cutting-edge techniques using ChatGPT/GPT-4 in harmony with Python for generating visuals that tell more compelling data storiesTackles actual data scenarios and builds your expertise as you apply learned concepts to real datasetsGet the eBook for $54.99 $37.99!Covers Python-based data visualization libraries and techniquesIncludes practical examples and Gemini-generated code samples for efficient learningIntegrates Google Gemini for advanced data visualization capabilitiesGet the eBook for $51.99 $35.99!News Highlights: Python 3.13.0rc2 released with new interpreter, free-threaded build, JIT, and incremental garbage collection; Python survey shows pip dominance, rising interest in Conda, Poetry, and uv; and PSF expands CNA role to cover Pallets Projects like Flask and Jinja.Here are my top 5 picks from our learning resources today:Breaking Bell's Inequality with Monte Carlo Simulations in Python🔗Python QuickStart for People Learning AI🤖Integrating Stripe Into A One-Product Django Python Shop🛒Python HTTP Clients -Requests vs. HTTPX vs. AIOHTTP🌐A comparison of hosts / providers for Pythonserverless functions (a.k.a. FaaS)☁️And, today’s Featured Study, explores how ChatGPT can automate and streamline Python-based federated learning algorithm development, reducing human effort and improving coding efficiency.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.: This month’ssurvey is live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.Sign Up|Advertise @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } 🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython 3.13.0rc2 released: This version introduces several major features such as a new interactive interpreter, an experimental free-threaded build mode, preliminary JIT for performance, and incremental garbage collection.Packaging Trends in Python: Highlights from the 2023 Developer Survey: Results show a strong preference for pip, with emerging interest in Conda and Poetry, and a new player, uv.Python Software Foundation (PSF) Expands CNA Scope to Include Pallets Projects: The PSF has expanded its CVE Numbering Authority role to include Pallets Projects like Flask and Jinja, ensuring better vulnerability management.💼Case Studies and Experiments🔬Lessons learnt building a real-time audio application in Python: Key learnings covered include accepting inherent latency issues, leveraging modern operating systems' efficient memory management, and utilizing web browsers as effective interfaces for real-time applications.Breaking Bell's Inequality with Monte Carlo Simulations in Python: Discusses the use of Monte Carlo simulations in Python to challenge Bell's inequality through a quantum mechanics game.📊AnalysisRust for the small things?... but what about Python?: Explores the enduring relevance of Python in data engineering, despite the allure of Rust for performance and safety.Multiversion Python Thoughts: Delves into the complexities of implementing multi-version package imports in Python, motivated by the desire to handle incompatible library versions concurrently.🎓Tutorials and Guides🤓Python QuickStart for People Learning AI: Covers Python fundamentals, including data types, loops, and functions, and provides a concrete AI project example using the OpenAI API for summarizing research papers.Lists vs Tuples in Python: Explores the characteristics, uses, and differences between lists and tuples in Python, emphasizing their ordered nature, content diversity, mutability, and appropriate usage scenarios.Layman's Guide to Python Built-in Functions: Simplifies Python's built-in functions for beginners, providing plain English explanations and straightforward examples.🎥Some tricks with UV: Demonstrates how UV not only facilitates quicker installations but also supports running Python scripts with on-the-fly dependency management.Python 3 Module of the Week: A series of articles detailing diverse library functionalities ranging from text handling, data structures, and algorithms to more complex areas like cryptography and network communication.Integrating Stripe Into A One-Product Django Python Shop: Part two of a series on creating a one-product shop using Django, htmx, and Stripe. Covers creating a Stripe account, defining a product, and configuring a webhook for transaction notifications.Practical Introduction to Polars: Compares Polars' key functionalities with Pandas, offering practical examples to help users transition from Pandas to Polars for more efficient data analysis.🔑Best Practices and Advice🔏Understanding Python's __new__ Method Through a Magical Example: Introduces Python's lesser-known .__new__()method, used for creating instances before they're initialized with .__init__().Some fun with Python Enum: Explores the Enum class introduced in Python 3.4, detailing its benefits over using literal types for type-safety and avoiding errors in code.A comparison of hosts / providers for Pythonserverless functions (a.k.a. FaaS): Discusses various providers that support Python, their development experience (DevEx), pricing models, runtime limits, and other platform products.Python HTTP Clients -Requests vs. HTTPX vs. AIOHTTP: Details each library's strengths and appropriate use cases, helping developers choose the right tool based on project needs.Shades of testing HTTP requests in Python: Covers different techniques including mocking with AsyncMockand respx, parameterizing HTTP clients for flexible testing setups, and using integration tests with a Starlette server.🔍Featured Study: Streamlining Federated Learning with Python and ChatGPT💥In PTB-FLA Development Paradigm Adaptation for ChatGPT, Popovic et al. explore how AI can be used to streamline the development of federated learning algorithms (FLAs). The study adapts a Python-based development paradigm to leverage ChatGPT for improved speed and efficiency in coding for machine learning tasks.ContextFederated Learning (FL) allows machine learning algorithms to train across decentralized data sources, such as edge devices, without sharing the raw data. PTB-FLA is a Python framework designed to ease this process by providing a structured way for developers to create these algorithms. Traditionally, this has required significant human input. With ChatGPT, the authors of this paper aimed to reduce human effort by automating much of the coding work. This study is important because it shows how LLMs can help build complex systems like FL algorithms, particularly in environments such as edge computing, where efficiency and reduced human oversight are key.Key FindingsThe adapted four-phase paradigm reduced human labour by 50%, achieving double the speed of the original development method.A new two-phase paradigm further streamlined the process, cutting human effort by 6 times compared to the original approach.ChatGPT-generated code was of higher quality, showing fewer errors compared to human-generated versions in comparable tasks.The study demonstrated a significant reduction in costs by reducing the size of ChatGPT prompts by 2.75 times.Both adapted paradigms were successfully validated using logistic regression as a case study for federated learning.What This Means for YouIf you work with machine learning, particularly in decentralized systems like IoT or edge computing, this research is highly relevant. Using ChatGPT to develop federated learning algorithms can save you substantial time by automating coding tasks that would otherwise require significant effort. By adopting the two-phase paradigm, developers can expect faster, more efficient development cycles, allowing you to focus on innovation rather than repetitive coding. This also reduces costs when using AI-assisted tools like ChatGPT, as it optimises the prompt size.Examining the DetailsThe study's methodology revolves around adapting an existing four-phase development process for federated learning into two paradigms tailored for ChatGPT. The original phases involved creating sequential code, transforming it into federated code, incorporating callbacks, and generating the final PTB-FLA code. The new two-phase paradigm simplifies this further by merging phases, allowing ChatGPT to generate the final federated code directly from the sequential code, bypassing intermediary steps. The team validated both paradigms through a case study using logistic regression. They iteratively refined the ChatGPT prompts to find the minimal context needed to achieve correct outputs, ensuring efficiency while maintaining code accuracy. The final results showed ChatGPT could develop high-quality code faster than humans, with far fewer resources.You can learn more by reading the entirepaper and accessing the PTB-FLA Github repository.🧠 Expert insight💥Here’s an excerpt from “Chapter 5: Working with Outliers” in the Python Feature Engineering Cookbook - Third Edition,by Soledad Galli, published in August 2024.Visualizing outliers with boxplots and the inter-quartile proximity ruleA common way to visualize outliers is by using boxplots. Boxplots provide a standardized display of the variable’s distribution based on quartiles. The box contains the observations within the firstand third quartiles, known as the Inter-Quartile Range(IQR). The first quartile is the value below which 25% of the observations lie (equivalent to the 25th percentile), while the third quartile is the value below which 75% of the observations lie (equivalent to the 75th percentile). The IQR is calculatedas follows:IQR = 3rd quartile - 1st quartileBoxplots also display whiskers, which are lines that protrude from each end of the box toward the minimum and maximum values and up to a limit. These limits are given by the minimum or maximum value of the distribution or, in the presence of extreme values, by thefollowing equations:upper limit = 3rd quartile + IQR × 1.5lower limit = 1st quartile - IQR × 1.5According to theIQR proximity rule, we can consider a value an outlier if it falls beyond the whisker limits determined by the previous equations. In boxplots, outliers are indicatedas dots.NoteIf the variable has a normal distribution, about 99% of the observations will be located within the interval delimited by the whiskers. Hence, we can treat values beyond the whiskers as outliers. Boxplots are, however, non-parametric, which is why we also use them to visualize outliers inskewed variables.In this recipe, we’ll begin by visualizing the variable distribution with boxplots, and then we’ll calculate the whisker’s limits manually to identify the points beyond which we could consider a value asan outlier.How to do it...We will create boxplots utilizing theseabornlibrary. Let’s begin by importing the Python libraries and loadingthe dataset:Let’s import the Python libraries andthe dataset:import matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import fetch_california_housingModify the default background fromseaborn (it makes prettier plots, but that’s subjective, of course):sns.set(style="darkgrid")Load the California house prices datasetfrom scikit-learn:X, y = fetch_california_housing( return_X_y=True, as_frame=True)Make a boxplot of theMedIncvariable to visualizeits distribution:plt.figure(figsize=(8, 3))sns.boxplot(data=X["MedInc"], orient="y")plt.title("Boxplot")plt.show()In the following boxplot, we identify the box containing the observations within the IQR, that is, the observations between the first and third quartiles. We also see the whiskers. On the left, the whisker extends to the minimum value ofMedInc; on the right, the whisker goes up to the third quartile plus 1.5 times the IQR. Values beyond the right whisker are represented as dots and couldconstitute outliers:Figure 5.1 – Boxplot of the MedInc variable highlighting potential outliers on the right tail of the distributionNoteAs shown inFigure 5.1, the boxplot returns asymmetric boundaries denoted by the varying lengths of the left and right whiskers. This makes boxplots a suitable method for identifying outliers in highly skewed distributions. As we’ll see in the coming recipes, alternative methods to identify outliers create symmetric boundaries around the center of the distribution, which may not be the best option forasymmetric distributions.Let’s now create a function to plot a boxplot next toa histogram:def plot_boxplot_and_hist(data, variable): f, (ax_box, ax_hist) = plt.subplots( 2, sharex=True, gridspec_kw={"height_ratios": (0.50, 0.85)}) sns.boxplot(x=data[variable], ax=ax_box) sns.histplot(data=data, x=variable, ax=ax_hist) plt.show()Let’s use the previous function to create the plots for theMedInc variable:plot_boxplot_and_hist(X, "MedInc")In the following figure, we can see the relationship between the boxplot and the variable’s distribution shown in the histogram. Note how most ofMedInc’s observations are located within the IQR box.MedInc’s potential outliers lie on the right tail, corresponding to people with unusuallyhigh-income salaries:Figure 5.2 – Boxplot and histogram – two ways of displaying a variable’s distribution...How it works...In this recipe, we used theboxplotmethod from Seaborn to create the boxplots and then we calculated the limits beyond which a value could be considered an outlier based on the IQRproximity rule.InFigure 5.2, we saw that the box in the boxplot forMedInc extended from approximately 2 to 5, corresponding to the first and third quantiles (you can determine these values precisely by executing X[“MedInc”].quantile(0.25)andX[“MedInc”].quantile(0.75) ). We also saw that the whiskers start at MedInc’s minimum on the left and extend up to8.013on the right (we know this value exactly because we calculated it instep 8).MedIncshowed values greater than8.013 , which were displayed in the boxplot as dots. Those are the values that could be considered outliers...Packt library subscribers cancontinue reading the entire book for free. You can buy the Python Feature Engineering Cookbook - Third Edition, by Soledad Galli,here.Get the eBook for $35.99 $24.99! @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}} @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} }
Read more
  • 0
  • 0
  • 14972

Divya Anne Selvaraj
03 Sep 2024
11 min read
Save for later

PythonPro #45: Converting DataFrames, Python Developer Survey, DBSCAN in 5 Minutes, and Web Scraping with Scrapy

Divya Anne Selvaraj
03 Sep 2024
11 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#45:Converting DataFrames, Python Developer Survey, DBSCAN in 5 Minutes, and Web Scraping with ScrapyHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published, Polars Cookbook, which shows you how to convert DataFrames and Series between Polars and pandas.News Highlights: Python Developer Survey: 55% use Linux, 6% still on Python 2; SuperTree enables interactive decision tree visuals in Jupyter; and OneBusAway launches Python and JavaScript SDKs for seamless data integration.Here are my top 5 picks from our learning resources today:Exploring the National Park Service API - Harvesting and Visualizing Data for National Parks🌲Web Scraping With Scrapy and MongoDB🕸️DBSCAN, Explained in 5 Minutes🧩Python packaging is a MESS📦Why I Still Use Python Virtual Environments in Docker🛳️And, today’s Featured Study, highlights how process mining, using tools like pm4py, can uncover insights into workflow efficiency, variability, and algorithmic performance.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.: This month’ssurvey is now live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython Developer Survey - 55% Use Linux, 6% Use Python 2: The 7th annual Python Developers Survey, which gathered responses from over 25,000 developers worldwide also found that Visual Studio Code is the leading IDE.supertree - Interactive Decision Tree Visualization: This Python package is designed to create interactive visualizations of decision trees within Jupyter Notebooks, Jupyter Lab, Google Colab, and similar environments that support HTML rendering.OneBusAway Launches Official Python and JavaScript SDKs: Developed as part of the Google Summer of Code, these SDKs simplify the incorporation of OneBusAway's data, offer consistent API usage across platforms, and include comprehensive documentation.💼Case Studies and Experiments🔬Exploring the National Park Service API - Harvesting and Visualizing Data for National Parks: Provides a step-by-step guide on accessing the API, retrieving data such as park entrance fees, and organizing it into a Pandas DataFrame for analysis.Code Without Any Syntax: Discusses an experiment in which the author uses an LLM to convert natural language instructions into functional Python code without traditional syntax.📊AnalysisMake magic with Mesop - python based web apps: Reviews Mesop, a newly released Python-based framework for building web apps. Read for tips to get started.Why I Prefer Django for My Projects: While acknowledging the strengths of Node.js and Express.js, the author of this article finds Django's holistic, secure, and efficient approach better suited to their needs in web development.🎓Tutorials and Guides🤓Web Scraping With Scrapy and MongoDB: Guides you through setting up a Scrapy project, building a web scraper, extracting data, and storing it in MongoDB. Read to also learn about testing and debugging techniques.Generate Images With DALL·E and the OpenAI API: Covers setting up the necessary environment, making API calls to create images from text prompts, handling image variations, and converting Base64 JSON responses to PNG files.Primer on Jinja Templating: Covers installation, basic usage, and advanced features like loops, conditional statements, and macros. Read to learn how to integrate Jinja with Flask to build a basic web project with dynamic web pages.How to Install Python on Your System - A Guide: Provides a comprehensive guide to installing Python on various systems, including Windows, macOS, Linux, iOS, and Android.Adventures building a spreadsheet engine in Python: Demonstrates using the Lark Python package to parse formulas and compute dependencies, employing a topological sort algorithm to determine the order of cell evaluation.How to write your first Genetic Algorithm — Knapsack Problem: Guides you through implementing a genetic algorithm using Python. Read to learn how to apply genetic algorithms to solve complex optimization problems.Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Explained in 5 Minutes: Provides a concise explanation of the DBSCAN algorithm, which identifies clusters in data based on spatial distance and detects outliers without needing to predefine the number of clusters.🔑Best Practices and Advice🔏Escaping from Anaconda's Stranglehold on macOS: Provides simple, non-technical instructions to move the .zshrc file, allowing users to switch between Anaconda and official Python installations without terminal commands.Why I Still Use Python Virtual Environments in Docker: Argues that virtual environments simplify the management of Python applications, particularly in production settings, by ensuring consistent and isolated environments across different stages of development.Python Classes - The Power of Object-Oriented Programming: Covers defining classes, creating objects, managing attributes and methods, and the benefits of using classes. Read to learn about advanced topics like inheritance.Python packaging is a MESS: Stress-tests nine Python package managers, including pip, conda, poetry, and newer tools like pixi and hatch, highlighting the historical issues and modern solutions in Python packaging.Use python -m http.server in SSL: Provides a custom script, ssl_server.py, that wraps http.server to enable serving static sites over HTTPS using a self-signed SSL certificate. Read to learn how to serve static content securely.🔍Featured Study: Mastering Robotic Control with PyRoboCOP for Complex Tasks💥In "Navigating Process Mining: A Case Study using pm4py," Kovács et al., explore the application of the pm4py library in analysing road traffic fine management processes. The study aims to demonstrate how process mining can uncover key insights into process efficiency and optimisation.ContextProcess mining is a technique that combines data mining and business process management to analyse event logs generated by information systems. It is particularly effective for uncovering hidden patterns, identifying bottlenecks, and optimising workflows. The study focuses on applying the pm4py library, an open-source Python tool, to a real-world road traffic fine management process. This approach offers a deeper understanding of process execution compared to traditional business intelligence tools.Key FindingsThe study's application of process mining to road traffic fine management revealed significant insights into process variability, algorithmic performance, and workflow complexity:Process Variants: The analysis identified 231 distinct process variants, with one variant accounting for 56,482 cases (approximately 37.6% of the total 150,370 cases), indicating a dominant workflow path.Algorithm Performance: Three process mining algorithms were evaluated:Alpha Miner: Revealed causal dependencies between activities, achieving simplicity and precision scores of 0.66.Inductive Miner: Employed a recursive approach to construct process models, scoring 0.62 in simplicity and 0.58 in precision.Heuristic Miner: Utilised heuristics to infer process models from event data, achieving a perfect precision score of 1.0 but a lower simplicity score of 0.54.Start and End Events: The process log analysis showed that 'Create Fine' was the most frequent start event, occurring 150,370 times. Multiple end events, such as 'Send Fine', 'Payment', and 'Send for Credit Collection,' were identified, indicating diverse process pathways.Process Discovery and Visualisation: The discovered models allowed a detailed understanding of workflow structures and dependencies. Each mining approach had strengths and limitations in capturing the process dynamics, with pm4py proving effective in facilitating process mining tasks.What This Means for YouThis study is relevant to data scientists, business analysts, and operations managers interested in optimising business processes. The pm4py library, as demonstrated in this case study, provides practical tools for analysing complex workflows, identifying inefficiencies, and improving operational efficiency. The insights gained can be applied to other business processes, making it a valuable resource for those aiming to enhance process performance.Examining the DetailsThe study used the pm4py library to analyse an event log related to the management of road traffic fines, covering activities such as creating fines, sending fines, adding penalties, managing appeals, and handling payments. The analysis involved three process mining algorithms—Alpha Miner, Inductive Miner, and Heuristic Miner—to discover process models from the event log data. The evaluation of simplicity and precision across these algorithms revealed that the Heuristic Miner achieved the highest precision score of 1.0, while the Alpha Miner provided a balance between simplicity and accuracy.You can learn more by reading the entirepaper and accessing the pm4py library.🧠 Expert insight💥Here’s an excerpt from “Chapter 10: Interoperability with Other Python Libraries” in the Polars Cookbook,by Yuki Kakegawa, published in August 2024.Converting to and from a pandas DataFrameMany of you have used pandas before, especially in your day-to-day work. Although pandas and Polars are often compared as one-or-the-other tools, you can use these tools to supplement each other.📚Related Titles from PacktUnderstand key data science algorithms with Python-based examplesIncrease the impact of your data science solutions by learning how to apply existing algorithmsTake your data science solutions to the next level by learning how to create new algorithmsGet the eBook for $35.99 $24.99!Conduct Bayesian data analysis with step-by-step guidanceGain insight into a modern, practical, and computational approach to Bayesian statistical modelingEnhance your learning with best practices through sample problems and practice exercisesGet the eBook for $55.99 $38.99!Polars allows you to convert between pandas and Polars DataFrames, which is exactly what we’ll cover in this recipe.Getting readyYou needpandas andpyarrowinstalled for this recipe to work. Execute the following code to make sure that you havethem installed:pip install pandas pyarrowHow to do it...Here’s how to convert to and from pandas DataFrames. We’ll first create a Polars DataFrame and then go through ways to convert back and forth between Polarsand pandas:Create a Polars DataFrame from aPython dictionary:df = pl.DataFrame({ 'a': [1,2,3], 'b': [4,5,6]})type(df)The preceding code will return thefollowing output:>> polars.dataframe.frame.DataFrameConvert a Polars DataFrame to a pandas DataFrame using the.to_pandas()method:pandas_df = df.to_pandas()type(pandas_df)The preceding code will return thefollowing output:>> pandas.core.frame.DataFrameConvert a pandas DataFrame to a Polars DataFrame using the.from_pandas()method:df = pl.from_pandas(pandas_df)type(df)The preceding code will return thefollowing output:>> polars.dataframe.frame.DataFrameIf you want to allow zero copy operations, then you need to enable theuse_pyarrow_extension_arrayparameter:df.to_pandas(use_pyarrow_extension_array=True).dtypesThe preceding code will return thefollowing output:>>a int64[pyarrow]b int64[pyarrow]dtype: objectYou can also create a Polars DataFrame by wrapping a pandas DataFrameusingpl.DataFrame():type(pl.DataFrame(pandas_df))The preceding code will return thefollowing output:>> polars.dataframe.frame.DataFrameHow it works...Polars has built-in methods to interoperate with pandas such as.from_pandas() and.to_pandas(). Each method is descriptive enough that you can see that .from_pandas() is used for reading data into Polars from pandas, whereas .to_pandas()is used to convert Polars objectsinto pandas.Theuse_pyarrow_extension_arrayparameter of the.to_pandas()method uses PyArrow-supported arrays instead of NumPy arrays for the columns within the pandas DataFrame. This enables zero-copy operations and maintains the integrity ofnull values.There’s more...You can convert to and from a pandas Series to aPolars Series:s = pl.Series([1,2,3])type(s.to_pandas())The preceding code producesthe following:>> pandas.core.series.SeriesThe.from_pandas()method returns a Series object when a pandas Series waspassed in:type(pl.from_pandas(s.to_pandas()))The preceding code producesthe following:>> polars.series.series.SeriesPackt library subscribers cancontinue reading the entire book for free. You can buy the Polars Cookbook,by Yuki Kakegawa,here.Get the eBook for $35.99 $24.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more
  • 0
  • 0
  • 8771
Divya Anne Selvaraj
03 Jun 2025
9 min read
Save for later

PythonPro #71: Pandas 3.0 Ditches NumPy, Pyrefly vs. ty, and HuggingFace for Object Detection

Divya Anne Selvaraj
03 Jun 2025
9 min read
Bite-sized actionable content, practical tutorials, and resources for Python programmers.#71Pandas 3.0 Ditches NumPy, Pyrefly vs. ty, and HuggingFace for Object DetectionHi ,Welcome to a brand new issue of PythonPro!News Highlights: Pandas 3.0 adopts PyArrow for faster string handling; Meta releases Pyrefly, a Rust-based type checker for large Python codebases; String Grouper gets 8× faster; and Muffin tops new ASGI benchmarks, beating FastAPI on JSON throughput.My top 5 picks from today’s learning resources:Pyrefly vs. ty: Comparing Python’s Two New Rust-Based Type Checkers⚙️Building an MCP server as an API developer🛰️Object Detection with Python and HuggingFace Transformers🖼️Matplotlib Alternatives That Actually Save You Time⏱️What's the Difference Between Zipping and Unzipping Your Jacket? • Unzipping in Python🧥And, in From the Cutting Edge, we introduce dro, a Python library that makes state-of-the-art distributionally robust optimization techniques practical and scalable for machine learning by unifying 79 methods into a single modular framework compatible with scikit-learn and PyTorch.Stay awesome!Divya Anne SelvarajEditor-in-ChiefSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython Pandas Ditches NumPy for Speedier PyArrow: Pandas 3.0 introduces PyArrow as a required dependency and default for string data, marking a shift toward faster, columnar data processing—though full replacement of NumPy as the backend remains experimental.Meta Open-Sources Pyrefly, a High-Performance Python Type Checker in Rust: The type checker is designed to replace the OCaml-based Pyre and support responsive, scalable IDE typechecking—especially for large codebases like Instagram.Even Faster String matching in Python: The latest version of String Grouper, a Python library for fuzzy string matching using TF-IDF and cosine similarity, is now 8× faster than its original release.Benchmarks for MicroPie v0.9.9.8: A benchmark comparing seven ASGI frameworks using a simple JSON "hello world" response showed that Muffin delivered the highest performance while FastAPI trailed with the lowest throughput.MonsterUI: Bringing Beautiful UI to FastHTML: MonsterUI is a Python library that simplifies frontend development for FastHTML apps by providing pre-styled, responsive UI components with smart defaults.💼Case Studies and Experiments🔬Rhyme Analysis of Virgil’s Æneid in English translation — Part 2: Uses Python and CMUDict to detect rhyme patterns in Edward Fairfax Taylor’s English translation of Virgil’s Æneid, achieving over 92% accuracy in capturing the Spenserian stanza structure.A Python frozenset interpretation of Dependent Type Theory: Illustrates how Python can serve as an intuitive metatheory for understanding complex type-theoretic concepts through executable, computable analogues.📊AnalysisPyrefly vs. ty: Comparing Python’s Two New Rust-Based Type Checkers: Compares two emerging Rust-based Python type checkers—pyrefly (by Meta) and ty (by Astral)—based on speed, design goals, incrementalization strategies, and type inference behavior.From Rows to Vectors: Under the Hood of DFEmbedder — A DataFrame Vector Store: Introduces DFEmbedder, an open source Python library that transforms tabular data into a low-latency vector store using static CPU-based embeddings.🎓Tutorials and Guides🤓Making C and Python Talk to Each Other: Covers locating and including Python.h, initializing and finalizing the Python interpreter, loading Python modules, calling Python functions (with and without arguments), and managing memory using PyObject references.Building an MCP server as an API developer: Walks you through building and deploying a stateless MCP server using Python, FastAPI, and AWS services, illustrating how to integrate OAuth-secured Strava APIs and support Streamable HTTP transport for LLM-assisted applications.Object Detection with Python and HuggingFace Transformers: Walks you through building an object detection pipeline while explaining how Transformer-based models like Detection Transformer (DETR) work and demonstrating a complete implementation.Expected Goals on Target (xGOT) 101: Explains a post-shot metric that improves on xG by factoring in shot placement, power, and trajectory—demonstrating how analysts use it to evaluate strikers’ finishing skill and goalkeepers’ shot-stopping, with a Python template.Regression Trees Explained: The Most Intuitive Intoduction: Offers a step-by-step explanation and Python implementation of regression trees, illustrating how they partition feature space and make predictions through recursive variance minimization.Efficiently dissolving adjacent polygons by attributes in a large GIS database: Demonstrates a step-by-step method with SQL and Python to cluster, merge, and reduce over 750,000 land-use records into fewer, generalized geometries.Tracking Urban Expansion Through Satellite Imagery: Covers selecting satellite imagery, preparing training data, computing indices, running classification, interpreting outputs, and validating results.🔑Best Practices and Advice🔏Matplotlib Alternatives That Actually Save You Time: Compares five modern Python visualization libraries—Plotly, Seaborn, Vega-Altair, Bokeh, and Plotnine—as more efficient, interactive, and expressive alternatives to Matplotlib.Automate Your Life: Five Everyday Tasks Made Easy With Python: Showcases five simple, real-world Python scripts—generating QR codes, converting text to speech, translating text, taking screenshots, and censoring profanity.Serving Deep Learning in AdTech: Offers practical guidance on choosing a model-serving approach based on system constraints, latency, and deployment needs.What's the Difference Between Zipping and Unzipping Your Jacket? • Unzipping in Python: Explains how Python’s zip() function not only combines multiple iterables into grouped tuples but can also be used in reverse—with unpacking—to "unzip" them back into separate iterables.The Chores Rota (#3 in The `itertools` Series • `cycle()` and Combining Tools): Uses a fictional story to teach Python's itertools.cycle() and zip() functions, illustrating how to create synchronized infinite iterators for task rotation.🔍From the Cutting Edge: DRO for ML💥In "DRO: A Python Library for Distributionally Robust Optimization in Machine Learning," Liu et al. introduce dro, a Python library that brings together state-of-the-art distributionally robust optimization (DRO) techniques into a single, modular, and scalable software package for supervised learning tasks.ContextDRO is a technique used in machine learning to build models that remain reliable under uncertainty—especially when there's a mismatch between training and deployment data distributions. This is crucial in high-stakes domains like healthcare, finance, and supply chain systems. DRO typically addresses this challenge by considering a worst-case loss over an ambiguity set: a collection of distributions close to the empirical training data under some metric.However, despite its theoretical promise, DRO has seen limited practical adoption due to the computational complexity of solving min-max problems and the lack of general-purpose libraries. Existing tools often either focus on a narrow subset of formulations or require users to manually reformulate and solve optimisation problems using external solvers.The dro library directly addresses these gaps. It offers the first comprehensive, ML-ready implementation of diverse DRO formulations within a unified, modular Python package. Compatible with both scikit-learn and PyTorch, dro abstracts away the need for manual optimisation reformulations and enables scalable training, evaluation, and experimentation with robust models. This makes cutting-edge DRO techniques accessible to both practitioners and researchers, and usable in real-world workflows.Key Features of droComprehensive coverage: The library supports 79 DRO method combinations across 14 formulations and 9 model backbones, covering linear, kernel-based, tree-based, and neural models.Seamless integration: All components follow the scikit-learn estimator interface and are compatible with PyTorch, enabling easy integration into existing machine learning workflows.Significant speed improvements: The library applies vectorisation, kernel approximation, and constraint reduction techniques to achieve 10× to 1000× speedups over baseline implementations.Flexible customisation: Users can personalise loss functions, model architectures, and robustness parameters through a modular design that supports both exact and approximate optimisation.Built-in diagnostics: The package includes tools to generate worst-case distributions and evaluate out-of-sample performance, supporting principled model assessment under distribution shift.What This Means for YouThe dro library is especially relevant for machine learning researchers, applied data scientists, and engineers working in high-stakes or shift-prone domains such as healthcare, finance, and logistics. It offers a practical pathway to integrate distributional robustness into real-world pipelines without requiring manual optimisation reformulations or deep expertise in convex programming. By unifying a wide range of DRO methods within a standardised, high-performance framework, dro enables users to develop models that remain reliable under uncertainty, experiment with robustness techniques at scale, and bridge the gap between theoretical advances and practical deployment.Examining the DetailsThe dro library operationalises Distributionally Robust Optimization by solving min–max problems where the outer minimisation spans a model class and the inner maximisation ranges over an ambiguity set of plausible distributions. This ambiguity set is defined using distance metrics such as Wasserstein distances, f-divergences (KL, χ², Total Variation, CVaR), kernel-based distances like Maximum Mean Discrepancy (MMD), and hybrid measures including Sinkhorn and Moment Optimal Transport distances.Exact optimisation is handled through disciplined convex programming using CVXPY, applicable to linear and kernel-based models with standard losses such as hinge, logistic, ℓ₁, and ℓ₂. For more complex architectures like neural networks and tree ensembles, the library employs approximate optimisation strategies using PyTorch, LightGBM, and XGBoost.To enhance scalability, the authors implement performance-optimisation techniques such as constraint vectorisation, Nyström kernel approximation, and constraint subsampling or sparsification, significantly reducing computational overhead without sacrificing accuracy. The methodology is underpinned by modular abstractions that isolate model type, loss function, and robustness metric, making the framework both extensible and maintainable.Additional tooling supports synthetic and real-world dataset generation, worst-case distribution derivation, and corrected out-of-sample evaluation.You can learn more by reading the entire paper here and accessing the library on GitHub.And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.reverse{display:table;width: 100%;
Read more
  • 0
  • 0