AI Tools | 0 articles | Tech News, Tutorials & Expert Insights

article-image-getting-started-with-med-palm-2

07 Sep 2023

5 min read

Getting Started with Med-PaLM 2

07 Sep 2023

Introduction Med-PaLM 2 is a large language model (LLM) from Google Research, designed for the medical domain. It is trained on a massive dataset of text and code, including medical journals, textbooks, and clinical trials. Med-PaLM 2 can answer questions about a wide range of medical topics, including diseases, treatments, and procedures. It can also generate text, translate languages, and write different kinds of creative content. Use Cases Med-PaLM 2 can be used for a variety of purposes in the healthcare industry, including: Medical research: Med-PaLM 2 can be used to help researchers find and analyze medical data. It can also be used to generate hypotheses and test new ideas. Clinical decision support: Med-PaLM 2 can be used to help doctors diagnose diseases and make treatment decisions. It can also be used to provide patients with information about their condition and treatment options. Health education: Med-PaLM 2 can be used to create educational materials for patients and healthcare professionals. It can also be used to answer patients' questions about their health. Drug discovery: Med-PaLM 2 can be used to help researchers identify new drug targets and develop new drugs. Personalized medicine: Med-PaLM 2 can be used to help doctors personalize treatment for individual patients. It can do this by taking into account the patient's medical history, genetic makeup, and other factors. How to Get Started Med-PaLM 2 is currently available to a limited number of Google Cloud customers. To get started, you can visit the Google Cloud website: https://cloud.google.com/ and sign up for a free trial. Once you have a Google Cloud account, you can request access to Med-PaLM 2. Here are the steps on how to get started with using Med-PaLM: 1. Check if Med-PaLM is available in your country. Med-PaLM is currently only available in the following countries: United States Canada United Kingdom Australia New Zealand Singapore India Japan South KoreaYou can check the Med-PaLM website: https://sites.research.google/med-palm/ for the latest list of supported countries. 2. Create a Google Cloud Platform (GCP) account. Med-PaLM is a cloud-based service, so you will need to create a GCP account in order to use it. You can do this by going to the GCP website: https://cloud.google.com/ and clicking on the "Create Account" button. 3. Enable the Med-PaLM API. Once you have created a GCP account, you will need to enable the Med-PaLM API. You can do this by going to the API Library: https://console.cloud.google.com/apis/library and searching for "Med-PaLM". Click on the "Enable" button to enable the API. 4. Create a Med-PaLM service account. A service account is a special type of account that can be used to access GCP resources. You will need to create a service account in order to use Med-PaLM. You can do this by going to the IAM & Admin: https://console.cloud.google.com/iam-admin/ page and clicking on the "Create Service Account" button. 5. Download the Med-PaLM credentials. Once you have created a service account, you will need to download the credentials. The credentials will be a JSON file that contains your service account's email address and private key. You can download the credentials by clicking on the "Download JSON" button. 6. Set up the Med-PaLM client library. There are client libraries available for a variety of programming languages. You will need to install the client library for the language that you are using. You can find the client libraries on the Med-PaLM website: https://sites.research.google/med-palm/. 7. Initialize the Med-PaLM client. Once you have installed the client library, you can initialize the Med-PaLM client. The client will need your service account's email address and private key in order to authenticate with Med-PaLM. You can initialize the client by using the following code: import medpalm client = medpalm.Client( email="your_service_account_email_address", key_file="your_service_account_private_key.json" ) 8. Start using Med-PaLM! Once you have initialized the Med-PaLM client, you can start using it to access Med-PaLM's capabilities. For example, you can use Med-PaLM to answer medical questions, generate text, and translate languages. Key Features Med-PaLM 2 has a number of key features that make it a valuable tool for the healthcare industry. These features include: Accuracy: Med-PaLM 2 is highly accurate in answering medical questions. It has been shown to achieve an accuracy of 85% on a variety of medical question answering datasets. Expertise: Med-PaLM 2 is trained on a massive dataset of medical text and code. This gives it a deep understanding of medical concepts and terminology. Versatility: Med-PaLM 2 can be used for a variety of purposes in the healthcare industry. It can answer questions, generate text, translate languages, and write different kinds of creative content. Scalability: Med-PaLM 2 is scalable and can be used to process large amounts of data. This makes it a valuable tool for research and clinical applications. Conclusion Med-PaLM 2 is a powerful LLM that has the potential to revolutionize the healthcare industry. It can be used to improve medical research, clinical decision support, health education, drug discovery, and personalized medicine. Med-PaLM 2 is still under development, but it has already demonstrated the potential to make a significant impact on healthcare.

0
0
79259

article-image-vertex-ai-workbench-your-complete-guide-to-scaling-machine-learning-with-google-cloud

Jasmeet Bhatia, Kartik Chaudhary

04 Nov 2024

15 min read

Vertex AI Workbench: Your Complete Guide to Scaling Machine Learning with Google Cloud

Jasmeet Bhatia, Kartik Chaudhary

04 Nov 2024

15 min read

1
1
67379

article-image-practical-ai-in-excel-create-a-linear-regression-model

M.T White

28 Jun 2023

12 min read

Practical AI in Excel: Create a Linear Regression Model

M.T White

28 Jun 2023

12 min read

0
0
66166

article-image-top-100-essential-data-science-tools-repos-streamline-your-workflow-today

Merlyn Shelley

27 Jun 2024

14 min read

Top 100+ Essential Data Science Tools & Repos: Streamline Your Workflow Today!

Merlyn Shelley

27 Jun 2024

14 min read

IntroductionAs data professionals, navigating the vast sea of Big Data often leaves us searching for the right tools to harness its potential. Whether we're defining intricate problems, identifying emerging trends, or crafting innovative solutions, the challenge is undeniable. Too often, this quest has us wandering aimlessly through the web, seeking elusive answers. Here at the DataPro Newsletter team, we understand this all too well. That's why, in celebration of our 100th edition, we're thrilled to present a special gift to our valued readers—a thorough reference module brimming with resources. This carefully curated collection features over 100 of the most popular tools and GitHub repositories. Each one is not only widely used and trusted but is also consistently updated with the latest breakthroughs to enhance your data processing capabilities. Think of this module as your treasure chest, designed to streamline your workflow and inspire innovative solutions. Bookmark this page for quick access whenever you encounter challenges in any area of data science and machine learning, from DataOps to Recommender Systems to Quantitative Finance—we've got it all covered! So, dive into this one-stop reference module, explore its depths, and let the spirit of data kinship propel you forward. Here's to more empowering tools and transformative insights from your DataPro team—cheers! DataOps/MLOps kestra-io/kestra: Kestra is an open-source orchestrator for scheduled and event-driven workflows, leveraging Infrastructure as Code for reliable management. open-metadata/OpenMetadata: OpenMetadata is a unified platform for data discovery, observability, and governance, featuring a central repository, column lineage, and team collaboration. dolthub/dolt: Dolt is a SQL database with Git-like version control features, accessible via MySQL or a command line interface. iterative/dvc: DVC is a tool for reproducible machine learning, enabling data and model versioning, lightweight pipelines, experiment tracking, and easy sharing. quiltdata/quilt: Quilt allows creating versioned datasets with Python and an S3 bucket. It supports data-driven teams, aiding rapid experimentation and collaboration. Real-time Data Processing allinurl/goaccess: GoAccess is a real-time web log analyzer for *nix systems and browsers, offering fast HTTP statistics. More details: goaccess.io. feathersjs/feathers: Feathers is a TypeScript/JavaScript framework for building APIs and real-time apps, compatible with various backends and frontends. apache/age: Apache AGE extends PostgreSQL with graph database capabilities, supporting both relational SQL and openCypher graph queries seamlessly. zephyrproject-rtos/zephyr: Real-time OS for diverse hardware, from IoT sensors to smart watches, emphasizing scalability, security, and resource efficiency. hazelcast/hazelcast: Hazelcast integrates stream processing and fast data storage for real-time insights, enabling immediate action on data-in-motion within unified platform. Data Quality Management WeBankFinTech/Qualitis: Qualitis manages data quality through verification, notification, and management across various data sources, solving data processing-related quality issues. raystack/optimus: Optimus is a robust workflow orchestrator for data transformation, modeling, pipelines, and quality management, emphasizing ease of use and reliability. Toloka/crowd-kit: Crowd-Kit is a Python library for crowdsourced annotation, featuring aggregation methods, metrics, and datasets to simplify working with crowd data. ydataai/ydata-profiling: ydata-profiling offers a streamlined, fast EDA solution akin to pandas' df.describe(), providing detailed DataFrame analysis exportable in formats like HTML and JSON. cleanlab/cleanlab: cleanlab automates data and label cleaning by detecting issues in ML datasets, enhancing model training with real-world data. Predictive Analytics spring-cloud/spring-cloud-dataflow: Spring Cloud Data Flow enables microservices-driven data processing pipelines on Cloud Foundry and Kubernetes, supporting diverse use cases like streaming and batch processing. ScottfreeLLC/AlphaPy: AlphaPy, a Python ML framework, caters to speculators and data scientists with scikit-learn, pandas, and additional tools for feature engineering and visualization. retentioneering/retentioneering-tools: Retentioneering simplifies analyzing clickstreams and user paths, offering deeper insights than funnel analysis, benefiting data and marketing analysts. genular/pandora: PANDORA offers advanced analytics for biomedical research, employing machine learning tools like clustering, PCA, UMAP, and interpretable models for discovery. nabeel-oz/qlik-py-tools: Qlik's SSE integrates modern data science into Qlik Sense, enabling business users to leverage advanced analytics through Python-based functions. Deep Learning Lightning-AI/pytorch-lightning: Lightning 2.0 simplifies PyTorch workflows with a stable API, enabling scalable training and deployment of AI models efficiently. ultralytics/yolov5: YOLOv5 by Ultralytics is a leading vision AI model, built on extensive open-source research and development for advanced performance. hpcaitech/ColossalAI: Colossal-AI simplifies distributed deep learning with user-friendly tools, enabling easy parallel training and inference similar to local model development. naptha/tesseract.js: Tesseract.js simplifies OCR with a webassembly-based Tesseract engine, supporting both browser and Node.js environments with easy integration and setup. microsoft/DeepSpeed: DeepSpeed enables efficient training of models like ChatGPT with significant speed improvements and cost reductions across all scales. Reinforcement Learning ray-project/ray: Ray is a unified framework that scales AI and Python applications with a distributed runtime and specialized AI libraries. d2l-ai/d2l-en: An open-source book using Jupyter notebooks to make deep learning accessible, blending concepts, context, and interactive code examples. Unity-Technologies/ml-agents: Unity ML-Agents enables games and simulations for training intelligent agents with deep reinforcement learning and imitation learning, fostering innovation in AI. google/trax: Trax is a Google Brain-endorsed deep learning library known for clear code and speed, demonstrated in a Colab notebook. wandb/wandb: The repository includes a CLI and Python API for visualizing and tracking machine learning experiments effectively. VowpalWabbit/vowpal_wabbit: Vowpal Wabbit advances machine learning with online, hashing, allreduce, and active learning techniques, pushing the frontier of ML capabilities. Time Series Analysis taosdata/TDengine: TDengine is a high-performance, open-source time-series database designed for IoT, connected cars, industrial IoT, and DevOps environments. timescale/timescaledb: An open-source SQL database for time-series data, optimized for rapid data ingestion and complex querying, available as a PostgreSQL extension. influxdata/telegraf: Telegraf is an agent for gathering and processing metrics, logs, and data, featuring 300+ plugins and community-driven development for flexibility. questdb/questdb: QuestDB is an open-source time-series database known for high throughput ingestion, fast SQL queries, and operational simplicity, ideal for various high-cardinality datasets. ccfos/nightingale: Nightingale is an all-in-one, open-source, cloud-native monitoring system combining data collection, visualization, and alerting capabilities seamlessly. Data Engineering PrefectHQ/prefect: Prefect simplifies Python data pipeline orchestration, transforming scripts into dynamic workflows that react to changes and ensure resilience. airbytehq/airbyte: Airbyte, an open-source data integration platform, offers 300+ connectors for seamless ELT pipelines between diverse data sources and destinations. argoproj/argo-workflows: Argo Workflows orchestrates parallel jobs on Kubernetes via container-native workflows, supporting DAGs and accelerating compute-intensive tasks like ML and data processing. dagster-io/dagster: Dagster is a cloud-native data pipeline orchestrator with integrated lineage, observability, declarative programming, and robust testability across the lifecycle. Avaiga/taipy: Taipy simplifies web app development for data scientists & ML engineers using Python, focusing on AI algorithms with no extra languages. Business Intelligence ankane/blazer: SQL-based tool for data exploration, chart creation, dashboard sharing. Supports various data sources, variables, checks, audits, and security integrations. evidence-dev/evidence: Open-source BI tool uses Markdown with SQL queries for data sourcing, rendering charts, and generating templated, dynamic web pages. lightdash/lightdash: Empower teams with self-service data insights using dbt: define metrics, visualize data, and share dashboards seamlessly across your organization. TuiQiao/CBoard: User-friendly open BI platform for self-service reporting and dashboards, simplifying data insights and sharing across teams effortlessly. quarylabs/quary: BI platform for engineers to connect databases, write SQL for table transformations, create charts, dashboards, and reports with collaboration and deployment capabilities. Data Visualization netdata/netdata: Real-time metrics collection and visualization for servers, cloud, Kubernetes, and edge/IoT devices, scaling effortlessly across diverse environments. directus/directus: Open-source API and dashboard for managing SQL database content with REST & GraphQL interfaces, supporting various databases, and customizable for on-premises or cloud deployment. airbnb/visx: Reusable low-level visualization components combining d3's power with React's DOM updating capabilities for dynamic data visualization. uber/react-vis: React component library for diverse data visualizations: line, bar, scatter, heatmaps, pie charts, sunbursts, radar charts, and more. bokeh/bokeh: Interactive visualization library for web browsers, offering versatile graphics creation and high-performance interactivity for large datasets and dashboards. apache/echarts: Free JavaScript library for intuitive, interactive, and customizable charts, ideal for enhancing commercial products with powerful visualizations. Recommender Systems NicolasHug/Surprise: Python scikit for building recommender systems with explicit rating data, emphasizing experiment control, dataset handling, and diverse prediction algorithms. gorse-io/gorse: Open-source recommendation system in Go, designed for universal integration into online services, automating model training based on user interaction data. recommenders-team/recommenders: Recommenders, a Linux Foundation project, offers Jupyter notebooks for building classic and cutting-edge recommendation systems, covering data prep, modeling, evaluation, optimization, and production deployment on Azure. alibaba/Alink: Alink, developed by Alibaba's PAI team, integrates Flink for ML algorithms. PyAlink supports various Flink versions, maintaining compatibility up to Flink 1.13. RUCAIBox/RecBole: RecBole, built on Python and PyTorch, facilitates research with 91 recommendation algorithms across general, sequential, context-aware, and knowledge-based categories. Quantitative Finance AI4Finance-Foundation/FinGPT: FinGPT is a cost-effective, adaptable financial large language model for quick updates and fine-tuning, enhancing accessibility compared to BloombergGPT. google/tf-quant-finance: This library leverages TensorFlow's hardware acceleration and automatic differentiation for high-performance mathematical methods, mid-level functions, and pricing models support. goldmansachs/gs-quant: GS Quant, a Python toolkit by Goldman Sachs, aids in developing quantitative trading strategies and risk management solutions with robust market experience. domokane/FinancePy: A Python finance library specializing in pricing and managing financial derivatives across fixed-income, equity, FX, and credit markets. romanmichaelpaolucci/Q-Fin: QFin is evolving with enhanced object-oriented principles, deprecating old modules like PDEs/SDEs, introducing 'stochastics' for model calibration and option pricing. avhz/RustQuant: This Rust library for quantitative finance covers diverse modules from autodiff and data handling to instruments pricing and stochastic processes. Responsible AI microsoft/responsible-ai-toolbox: Responsible AI Toolbox offers interfaces and libraries for model and data exploration, enabling developers to monitor and improve AI responsibly. Giskard-AI/giskard: Giskard, an open-source Python library, detects performance, bias, and security issues in AI applications, spanning LLMs to traditional ML models. fairlearn/fairlearn: Fairlearn, a Python package, helps developers assess and mitigate fairness issues in AI systems with algorithms and assessment metrics provided. Azure/PyRIT: PyRIT is an open-access Python tool for generative AI, aiding security professionals and ML engineers in identifying system risks. ModelOriented/DALEX: DALEX enhances model transparency to prevent failure through its explainability tools, supporting understanding and trust in complex AI systems. JohnSnowLabs/langtest: LangTest simplifies testing of AI models with over 60 tests in one line, covering robustness, bias, fairness, and accuracy across various NLP frameworks. Explainable AI (XAI) SeldonIO/alibi: Alibi is a Python library focused on machine learning model inspection, offering diverse explanation methods for classification and regression models. Trusted-AI/AIX360: AI Explainability 360 offers an open-source Python toolkit for detailed model interpretability across various data types, supporting diverse explanation methods. dssg/aequitas: Aequitas is an open-source toolkit for bias auditing and Fair ML, aiding data scientists and researchers in assessing and correcting model biases. albermax/innvestigate: iNNvestigate is a Python library providing a unified interface for various methods to analyze neural networks' predictions and understand their internal workings. mindsdb/lightwood: Lightwood is an AutoML framework simplifying machine learning pipelines with JSON-AI syntax, allowing customization and automation across diverse data types. Anomaly Detection SeldonIO/alibi-detect: Alibi Detect is a Python library for detecting outliers, adversarial attacks, and drift in tabular, text, image, and time series data. datamllab/tods: TODS automates outlier detection in multivariate time-series data with modules for data processing, feature analysis, and diverse detection algorithms. pygod-team/pygod: PyGOD is a Python library using PyTorch Geometric for graph outlier detection, offering 10+ algorithms and easy integration with PyOD. Jingkang50/OpenOOD: This repository replicates methods from the Generalized Out-of-Distribution Detection Framework for fair comparison across anomaly, novelty, and out-of-distribution detection methods. yzhao062/pyod: PyOD is a Python library for detecting anomalies in multivariate data, offering diverse algorithms for various project scales and datasets. chaos-genius/chaos_genius: Chaos Genius is an open-source ML-powered analytics engine for outlier detection and root cause analysis at scale. Supply Chain Analytics guacsec/guac: GUAC creates a high fidelity graph database for software security, facilitating organizational outcomes like audit, policy, and risk management. owasp-dep-scan/blint: BLint is a Binary Linter using lief to verify executable security and capabilities, now supporting SBOM generation for compatible binaries. samirsaci/picking-route: This repository focuses on improving warehouse productivity through Python-based tools and methodologies, particularly addressing order batching and optimizing picking routes using the Single Picker Routing Problem (SPRP). ragamarkely/scanalytics: Scanalytics automates Supply Chain Analytics & Design tasks in Python, streamlining analyses and reducing manual spreadsheet work for assignments. aitechtools/SunFlow: SunFlow optimizes supply chain design with comprehensive modeling of materials, components, suppliers, manufacturers, and customers, integrating costs, capacities, and constraints. CIOL-SUST/SupplyGraph: This repository introduces a benchmark dataset for applying Graph Neural Networks (GNNs) to supply chain networks, enabling research in optimization and prediction. Network Optimization ray-project/ray: Ray is a scalable framework with a distributed runtime and AI libraries designed to accelerate AI and Python applications. svg/svgo: SVGO optimizes SVG files by removing redundant metadata, comments, and hidden elements to improve file efficiency and rendering performance. zeux/meshoptimizer: meshoptimizer is a C/C++ library optimizing GPU rendering by reducing mesh complexity and storage overhead, compatible with Rust via meshopt crate. cvxpy/cvxpy: CVXPY is a Python-based modeling language designed for convex optimization problems, providing a natural expression format aligned with mathematical conventions. guofei9987/scikit-opt: The repository provides Python implementations of various swarm intelligence algorithms such as Genetic Algorithm, Particle Swarm Optimization, and others for optimization tasks. Speech Processing espnet/espnet: ESPnet is a detailed speech processing toolkit using PyTorch, covering recognition, synthesis, translation, enhancement, diarization, and understanding tasks. mozilla/DeepSpeech: DeepSpeech is an open-source Speech-To-Text engine based on Baidu's research, implemented using TensorFlow for accessibility and performance. microsoft/SpeechT5: The repository proposes SpeechT5, adapting T5's text-to-text approach for self-supervised speech and text representation learning using shared encoders and modality-specific nets. sloria/TextBlob: Python library simplifying NLP tasks like POS tagging, sentiment analysis, and classification with a straightforward API for textual data. pytorch/audio: Torchaudio integrates PyTorch with audio processing, emphasizing GPU acceleration, trainable features via autograd, and maintaining a consistent tensor-based style. Graph Data Science neo4j/graph-data-science: The Neo4j Graph Data Science (GDS) library offers graph algorithms, transformations, and ML pipelines, accessible via Cypher within Neo4j. cncf/landscape-graph: This repository explores open source project dynamics, evolution, and collaboration using a Graph Data Model for insightful community analysis. BlueBrain/nexus: Blue Brain Nexus organizes and enhances data with a Knowledge Graph ecosystem, featuring various products, libraries, and tools for comprehensive use. lynxkite/lynxkite: LynxKite is a robust graph data science platform with a user-friendly interface and powerful Python API for large datasets. dgraph-io/dgraph: Dgraph is a scalable GraphQL database optimized for performance, offering ACID transactions and distributed architecture for real-time queries. arangodb/arangodb: ArangoDB is a versatile multi-model database supporting documents, graphs, and key-values, empowering high-performance applications with SQL-like queries and JavaScript extensions. ETL/ELT (Extract, Transform, Load / Extract, Load, Transform) redpanda-data/connect: Redpanda Connect is a robust stream processor for seamless data integration, featuring a powerful mapping language and easy deployment options. turbot/steampipe: Steampipe simplifies data access from APIs with CLI, Postgres FDWs, SQLite extensions, export tools, and cloud-based Turbot Pipes. risingwavelabs/risingwave: RisingWave is a cost-efficient streaming database compatible with Postgres, designed for real-time event streaming data processing and analysis. apache/dolphinscheduler: Apache DolphinScheduler is a modern data orchestration platform with low-code workflow creation, robust task management, and cloud-native capabilities. rudderlabs/rudder-server: RudderStack is a privacy-focused, Segment-alternative platform in Golang and React. It simplifies data collection and integrates with warehouses and tools for enriched customer data pipelines. We hope this extensive collection of tools and techniques proves to be a valuable asset in your daily data practice. May it help you achieve smoother workflows and better outcomes!

1
0
65583

article-image-everything-you-need-to-know-about-pinecone-a-vector-database

Avinash Navlani

08 Jun 2023

5 min read

Everything you need to know about Pinecone – A Vector Database

Avinash Navlani

08 Jun 2023

5 min read

In this 21st century of information, we need efficient reliable storage and faster information retrieval. Relational or older databases are the most crucial databases for any computer application, but they are unable to handle the data in different forms such as documents, key-value pairs, and graphs. Vector database is a novel approach that uses vectorization for efficient search, storage, and data analysis. Image 1: Traditional Vs Vector Database Pinecone is one such vector database that is widely accepted across the industry for addressing challenges such as complexity and dimensionality. Pinecone is a cloud-native vector database that handles high-dimensional vector data. The core underlying approach for Pinecone is based on the Approximate Nearest Neighbor (ANN) search that efficiently locates faster matches and ranks them within a large dataset. In this tutorial, our focus will be on the pinecone database, its features, challenges, and use cases. Working Mechanism Traditional databases search for exact query matches while vector databases search for the most similar vector to the input query. It uses ANN (Approximate Nearest Neighbour) search. It provides approximate results at high performance, accuracy, and speed. Let's see the vector database working mechanism. Image 2: Vector Database Query Mechanism Vector databases first convert data into vectors and create indexing for faster searching. Vector database compares the indexed vector query and indexed vector in the database using the nearest neighbor or similarity matrix and computes the nearest most similar results. Finally, it post-processes the most similar results given by the nearest neighbor. Features Pinecone is a cloud-based vector database that offers various features and benefits to the infrastructure community: Fast and fresh vector search: Pinecone provides ultra-low query latency, even with billions of items. This means that users will always get a great experience, even when searching large datasets. Additionally, Pinecone indexes are updated in real-time, so users always have access to the most up-to-date information. Filtered vector search: Pinecone allows you to combine vector search with metadata filters to get more relevant and faster results. For example, you could filter by product category, price, or customer rating. Real-time updates: Pinecone supports real-time data updates, allowing for dynamic changes to the data. This contrasts with standalone vector indexes, which may require a full re-indexing process to incorporate new data. It has reliability, massive scalability, and security capability. Backups and collections: Pinecone handle the routine operation of backing up all the data stored in the database. You can also selectively choose specific indexes that can be backed up in the form of “collections,” which store the data in that index for later use. User-friendly API: Pinecone provides a user-friendly API layer that simplifies the development of high-performance vector search applications. This API layer is also language-agnostic, so you can use it with any programming language. Programming language integration: It supports a wide range of programming languages for integration. Cost-effectiveness: It is cost-effective because it offers cloud-native architecture. It offers pay-per-use based pricing. Challenges Pinecone vector database offers high-performance data search at a higher scale, but it also faces a few challenges such as: Application integration with other applications will evolve over a period. Data privacy is the biggest concern for any database. Organizations need to implement proper authentication and authorization mechanisms. Vector-based models don’t explain the model's interpretability. So, it is challenging to interpret the underlying reason behind those relationships. Use cases Pinecone has a variety of real-life industry applications. Let’s discuss a few applications: Audio/Textual Search: Pinecone offers faster, fully deployment-ready search and similarity functionality for high-dimensional text and audio data. Natural language Processing: Pinecone utilizes AutoGPT to create context-aware solutions for document classification, semantic search, text summarization, sentiment analysis, and question-answering systems. Recommendations: Pinecone enables personalized recommendations with efficient similar items recommendations that improve user experience and satisfaction. Image and Video Analysis: Pinecone also has the capability of faster retrieval of image and video content. It is very useful in real-life surveillance and image recognition. Time series similarity search: Pinecone can detect Time-series patterns in historical time-series data using a similarity search service. such core capability is quite helpful for recommendations, clustering, and labeling applications. Summary Pinecone vector database is a vector-based database that offers high-performance search and similarity matching. It can deal with high-dimensional vector data at a higher scale, easy integration, and faster query results. Pinecone provides a reliable, and faster, option for searching at a higher scale. Author BioAvinash Navlani has over 8 years of experience working in data science and AI. Currently, he is working as a senior data scientist, improving products and services for customers by using advanced analytics, deploying big data analytical tools, creating and maintaining models, and onboarding compelling new datasets. Previously, he was a university lecturer, where he trained and educated people in data science subjects such as Python for analytics, data mining, machine learning, database management, and NoSQL. Avinash has been involved in research activities in data science and has been a keynote speaker at many conferences in India.Link - LinkedIn Python Data Analysis, Third edition

0
0
58209

article-image-microsoft-ais-skeleton-key-automl-with-autogluon-multion-ais-retrieve-api-narrative-bis-hybrid-ai-pythons-duck-typing-gibbs-diffusion

05 Jul 2024

13 min read

Microsoft AI’s Skeleton Key, AutoML with AutoGluon, MultiOn AI's Retrieve API, Narrative BI’s Hybrid AI, Python's Duck Typing, Gibbs Diffusion

05 Jul 2024

13 min read

Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Happy Friday! Welcome to DataPro#101—Your Essential Data Science & ML Update! 🚀 This week, we’ve curated the latest techniques in data extraction, transforming unstructured data into structured formats, best practices for prompt engineering in NL2SQL, and much more. Consider this your all-in-one guide to staying informed in the ever-evolving world of data science and machine learning. Now, dive in and explore these exciting new ideas! ⚡ Tech Highlights: Stay Updated! Prompt Engineering with Claude 3: Learn hands-on techniques on Amazon Bedrock. Accelerated PyTorch: Boost models with torch.compile on AWS Graviton. BigQuery Data Canvas: Perfect your prompts. Skeleton Key AI: New AI jailbreak method. GraphRAG: Complex data discovery tool on GitHub. 📚 New from Packt Library Data Science for Web3 - Guide to blockchain data analysis and ML. 🔍 Latest in LLMs & GPTs NASA-IBM's INDUS Models: Advanced science LLMs. EvoAgent: Evolutionary multi-agent systems. Kyutai's Moshi: Real-time AI model. MultiOn AI's Retrieve API: Accurate web search. Gibbs Diffusion (GDiff): Bayesian image denoising. Narrative BI’s Hybrid AI: Business data analysis. WildGuard: Safe LLM interactions. ProgressGym: Ethical AI alignment. OmniParse: Structuring unstructured data for GenAI. ✨ What's Fresh Claude 3.5 Sonnet Use Cases: Future AI capabilities. Explainability in ML: Make models understandable. Group-By Aggregation: Powerful EDA tool. OpenAI and PandasAI: Series operations. AutoML with AutoGluon: ML in four lines of code. Python's Duck Typing: Flexible coding concept. 🔰 GitHub Finds: Add These Repos fal/AuraSR arcee-ai/Arcee-Spark-GGUF pprp/Pruner-Zero ruiyiw/patient-psi hrishioa/rakis ragapp/ragapp Doriandarko/claude-engineer hao-ai-lab/MuxServe DataPro Newsletter is not just a publication; it’s a complete toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 Data Science Tool Kit ➔ ️ fal/AuraSR: AuraSR, a GAN-based super-resolution model for upscaling images. Implemented in PyTorch, it's inspired by the GigaGAN paper, enhancing image quality significantly. ➔ arcee-ai/Arcee-Spark-GGUF: Arcee Spark, a 7B model from Qwen2, excels with fine-tuning and DPO, outperforming GPT-3.5 on tasks, ideal for efficient AI deployment. ➔ pprp/Pruner-Zero: Pruner-Zero automates symbolic pruning metric discovery for Large Language Models, surpassing current methods in language modeling and zero-shot tasks. ➔ ruiyiw/patient-psi: Patient-Ψ uses Large Language Models to simulate patient interactions for training mental health professionals, emphasizing cognitive modeling and practical deployment. ➔ hrishioa/rakis: Rakis is a browser-based permissionless AI inference network enabling decentralized consensus without servers, emphasizing open-source and educational use. ➔ ragapp/ragapp: RAGapp simplifies enterprise use of Agentic RAG models, configurable like OpenAI's custom GPTs, deployable via Docker on cloud infrastructure. ➔ Doriandarko/claude-engineer: Claude Engineer, powered by Anthropic's Claude-3.5-Sonnet, aids software development through an interactive CLI blending AI model capabilities with file operations and web search. ➔ hao-ai-lab/MuxServe: MuxServe efficiently serves multiple LLMs using spatial-temporal multiplexing, optimizing memory and computation resources based on LLM popularity and characteristics. 📚 Expert Insights from Packt CommunityData Science for Web3: A comprehensive guide to decoding blockchain data with data analysis basics and machine learning cases By Gabriela Castillo Areco Understanding the blockchain ingredients If you have a background in blockchain development, you may skip this section. Web3 represents a new generation of the World Wide Web that is based on decentralized databases, permissionless and trustless interactions, and native payments. This new concept of the internet opens up various business possibilities, some of which are still in their early stages. Currently, we are in the Web2 stage, where centralized companies store significant amounts of data sourced from our interactions with apps. The promise of Web3 is that we will interact with Decentralized Apps (dApps) that store only the relevant information on the blockchain, accessible to everyone. As of the time of writing, Web3 has some limitations recognized by the Ethereum organization: Velocity: The speed at which the blockchain is updated poses a scalability challenge. Multiple initiatives are being tested to try to solve this issue. Intuition: Interacting with Web3 is still difficult to understand. The logic and user experience are not as intuitive as in Web2 and a lot of education will be necessary before users can start utilizing it on a massive scale. Cost: Recording an entire business process on the chain is expensive. Having multiple smart contracts as part of a dApp costs a lot for the developer and the user. Blockchain technology is a foundational technology that underpins Web3. It is based on Distributed Ledger Technology (DLT), which stores information once it is cryptographically verified. Once reflected on the ledger, each transaction cannot be modified and multiple parties have a complete copy of it. Two structural characteristics of the technology are the following: It is structured as a set of blocks, where each block contains information (cryptographically hashed – we will learn more about this in this chapter) about the previous block, making it impossible to alter it at a later stage. Each block is chained to the previous one by this cryptographic sharing mechanism. It is decentralized. The copy of the entire ledger is distributed among several servers, which we will call nodes. Each node has a complete copy of the ledger and verifies consistency every time it adds a new block on top of the blockchain. This structure provides the solution to double spending, enabling for the first time the decentralized transfer of value through the internet. This is why Web3 is known as the internet of value. Since the complete version of the ledger is distributed among all the participants of the blockchain, any new transaction that contradicts previously stored information will not be successfully processed (there will be no consensus to add it). This characteristic facilitates transactions among parties that do not know each other without the need for an intermediary acting as a guarantor between them, which is why this technology is known as trustless. The decentralized storage also takes control away from each server and, thus, there is no sole authority with sufficient power to change any data point once the transaction is added to the blockchain. Since taking down one node will not affect the network, if a hacker wants to attack the database, they would require such high computing power that the attempt would be economically unfeasible. This adds a security level that centralized servers do not have. This excerpt is from the latest book, "Data Science for Web3: A comprehensive guide to decoding blockchain data with data analysis basics and machine learning cases” written by Gabriela Castillo Areco. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ➤ Prompt engineering techniques and best practices: Learn by doing with Anthropic’s Claude 3 on Amazon Bedrock. In this blog post, the focus is on crafting effective prompts for generative AI models to achieve desired outputs. It emphasizes the importance of well-constructed prompts in guiding models like Claude 3 Haiku on Amazon Bedrock to produce accurate and relevant responses, showcasing examples of prompt variations and their impact. ➤ Accelerated PyTorch inference with torch.compile on AWS Graviton processors. In this blog post, AWS optimized PyTorch's torch.compile feature for AWS Graviton3 processors, significantly enhancing performance for Hugging Face and TorchBench model inference compared to the default eager mode. These optimizations, available from PyTorch 2.3.1, aim to streamline model execution on Graviton3-based Amazon EC2 instances. Google➤ How to write prompts for BigQuery data canvas? This blog post focuses on leveraging generative AI, specifically Gemini in BigQuery, to perform data tasks via natural language queries (NL2SQL and NL2Chart). It highlights how refining NL prompts can enhance query accuracy, promoting collaboration and efficiency among data professionals using BigQuery's data canvas tool. Microsoft➤ Microsoft AI Unveils Skeleton Key: A Novel Generative AI Jailbreak Method. This blog post discusses a newly discovered type of attack in generative AI called Skeleton Key, also known as Master Key. It explores how this attack bypasses AI guardrails, allowing models to generate unauthorized content, and outlines Microsoft's mitigation strategies using Prompt Shields in Azure AI. ➤ GraphRAG: New tool for complex data discovery now on GitHub. The update introduces GraphRAG, a graph-based approach to retrieval-augmented generation (RAG), now available on GitHub. It enhances information retrieval and response generation by automating knowledge graph extraction from text datasets, offering structured insights for global queries. An Azure-hosted API facilitates easy deployment without coding. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🔸 NASA-IBM Collaboration Develops INDUS Large Language Models for Advanced Science Research. The blog explores NASA's collaboration with IBM to develop INDUS, a suite of specialized language models (LLMs) tailored for scientific domains. INDUS enhances data analysis, retrieval, and curation across Earth science, heliophysics, and more, advancing research capabilities in diverse scientific disciplines. 🔸 EvoAgent: Expanding Expert Agents to Multi-Agent Systems with Evolutionary Algorithms. EvoAgent automates the extension of expert agents to multi-agent systems using evolutionary algorithms, applicable to any LLM-based agent framework. It enhances agent diversity and performance across tasks, exemplified in debates by generating varied opinions and improving content quality dynamically. 🔸 Kyutai Releases Moshi: A Real-Time AI Model that Understands and Speaks. Kyutai introduces Moshi, a real-time native multimodal foundation model surpassing GPT-4o functionalities. Moshi understands emotions, speaks with accents like French, and handles dual audio streams, enabled by joint pre-training on text and audio. It supports open-source transparency and runs efficiently on consumer hardware. 🔸 MultiOn AI's Retrieve API Boosts Web Search with Real-Time Accuracy for Advanced Applications. MultiOn AI has launched the Retrieve API, a cutting-edge tool for autonomous web information retrieval. It enhances data extraction from web pages with real-time processing, catering to diverse applications such as personalized shopping assistants, automated lead generation, and content creation tools, setting new standards in web data extraction technology. 🔸 Gibbs Diffusion (GDiff): A Bayesian Blind Denoising Method for Images and Cosmology. The study introduces Gibbs Diffusion (GDiff) as an innovative method for blind denoising with deep generative models. It enables simultaneous sampling of signal and noise parameters, improving Bayesian inference for scenarios like natural image denoising and cosmological data analysis, enhancing accuracy in noise characterization and signal recovery. 🔸 Narrative BI Introduces Hybrid AI Approach for Business Data Analysis: The research explores hybrid approaches in business data analysis, combining rule-based systems' precision with Large Language Models' (LLMs) pattern recognition. This integration aims to generate actionable insights from complex datasets, improving efficiency and accuracy in decision-making processes for businesses. 🔸 WildGuard: A Lightweight Moderation Tool for User Safety in LLM Interactions. The paper introduces WildGuard, an open and lightweight moderation tool for enhancing safety in Large Language Models (LLMs). It focuses on identifying malicious intent in user prompts, detecting safety risks in model responses, and evaluating model refusal rates. WildGuard achieves state-of-the-art performance across these tasks, addressing critical gaps in existing moderation tools. 🔸 ProgressGym: ML Framework for Ethical Alignment in Frontier AI. This research addresses the influence of AI systems, particularly large language models (LLMs), on human epistemology and societal values. It introduces progress alignment as a technical solution to prevent AI reinforcement of problematic moral beliefs. ProgressGym, an experimental framework, facilitates learning from historical data to advance real-world moral decision-making challenges. 🔸 OmniParse: AI Platform for Structuring Unstructured Data for GenAI Applications. OmniParse tackles the challenge of managing diverse unstructured data types—documents, images, audio, video, and web content—by converting them into structured formats optimized for AI applications. It integrates various tools like Surya OCR and Florence-2 for accurate data extraction, enhancing workflow efficiency and data usability across platforms. ✨ On the Radar: Catch Up on What's Fresh🔹 10 Use Cases of Claude 3.5 Sonnet: Unveiling the Future of Artificial Intelligence AI with Revolutionary Capabilities. Claude 3.5 Sonnet by Anthropic AI marks a leap forward in AI capabilities, showcasing versatility across diverse domains. It excels in generating n-body particle animations, interactive learning dashboards, escape room experiences, virtual psychiatry, interactive poster designs, educational visual demonstrations, customizable calendar applications, real-time object detection, financial tools, and advanced physics simulations. 🔹 Explainability, Interpretability and Observability in Machine Learning: The article explores the nuances of machine learning (ML) transparency through concepts like explainability, interpretability, and observability. It discusses their definitions, distinctions, and importance in fostering trust, accountability, and effective deployment of ML models across various industries and applications. 🔹 A Powerful EDA Tool: Group-By Aggregation. The article dives into Exploratory Data Analysis (EDA) techniques, focusing on group-by aggregation in Pandas. Using the Metro Interstate Traffic dataset as an example, it demonstrates how to derive insights such as monthly traffic progression, daily traffic profiles, hourly traffic patterns by weekday versus weekend, and identifying top weather conditions associated with congestion rates. 🔹 Using OpenAI and PandasAI for Series Operations: This article explores PandasAI, leveraging AI models like OpenAI to enhance Pandas data manipulation tasks. It covers querying Series values, creating new Series, conditional value setting, and reshaping data using natural language commands. Examples include summarizing statistics, conditional operations, and reshaping COVID-19 and NLS youth study datasets efficiently. 🔹 AutoML with AutoGluon: ML workflow with Just Four Lines of Code. The article explores AutoGluon, an automated machine-learning framework developed by Amazon Web Services (AWS). It discusses how AutoGluon simplifies the entire machine-learning process—from data preprocessing to model selection and hyperparameter tuning—making it accessible and efficient for users across various data types like tabular, text, and image data. 🔹 Understanding Python's Duck Typing: The article explores the concept of duck typing in Python, emphasizing behavior over type. It allows objects to be used based on their methods rather than explicit types, promoting flexibility and polymorphism. Duck typing simplifies code but requires careful handling to avoid runtime errors. See you next time!

0
0
57265

article-image-setting-up-openai-playground

Henry Habib

13 Feb 2024

9 min read

Setting Up OpenAI Playground

Henry Habib

13 Feb 2024

9 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, OpenAI API Cookbook, by Henry Habib. Integrate the ChatGPT API into various domains ranging from simple wrappers to knowledge-based assistants, multi-model, and conversational applicationsIntroductionThe OpenAI Playground is an interactive web-based interface designed to allow users to experiment with OpenAI’s language models, including ChatGPT. It’s a place where you can learn about the capabilities of these models by entering prompts and seeing the responses generated in real time. This platform acts as a sandbox where developers, researchers, and curious individuals alike can experiment, learn, and even prototype their ideas.In the Playground, you have the freedom to engage in a wide range of activities. You can test out different versions of the AI models, experimenting with various prompts to see how the model responds, and you can play around with different parameters to influence the responses generated. It provides a real-time glimpse into how these powerful AI models think, react, and create based on your input.Setting up your OpenAI Playground environmentGetting readyBefore you start, you need to create an OpenAI Platform account.Navigate to https://platform.openai.com/and sign in to your OpenAI account. If you do not have an account, you can sign up for free with an email address. Alternatively, you can log in to OpenAI with a valid Google, Microsoft, or Apple account. Follow the instructions to complete the creation of your account. You may need to verify your identity with a valid phone number.How to do it…1. After you have successfully logged in, navigate to Profile in the top right-hand menu, select Personal, and then select Usage from the left-hand side menu. Alternatively, you can navigate to https://platform.openai.com/account/usage after logging in. This page shows the usage of your API, but more importantly, it shows you how many credits you have available.2. Normally, OpenAI provides you a $5 credit with a new account, which you should be able to see under the Free Trial Usage section of the page. If you do have credits, proceed to step 4. If, however, you do not have any credits, you will need to upgrade and set up a paid account.3. You need not set up a paid account if you have received free credits. If you run out of free credits, however, here is how you can set up a paid account: select Billing from the left-hand side menu and then select Overview. Then, select the Set up paid account button. You will be prompted to enter your payment details and set a dollar threshold, which can be set to any level of spend that you are comfortable with. Note that the amount of credits required to collectively execute every single recipe contained in this book is not likely to exceed $5.4. After you have created an OpenAI Platform account, you should be able to access the Playground by selecting Playground from the top menu bar, or by navigating to https://platform. openai.com/playground.How it works…The OpenAI Playground interface is, in my experience, clean, intuitive, and designed to provide users easy access to OpenAI’s powerful language models. The Playground is an excellent place to learn how the models perform under different settings, allowing you to experiment with parameters such as temperature and max tokens, which influence the randomness and length of the outputs respectively. The changes you make are instantly reflected in the model’s responses, offering immediate feedback.As shown in Figure 1.1, the Playground consists of three sections: the System Message, the Chat Log, and the Parameters. You will learn more about these three features in the Running a completion request in the OpenAI Playground recipe.Figure 1.1 – The OpenAI PlaygroundNow, your Playground is set up and ready to be used. You can use it to run completion requests and see how varying your prompts and parameters affect the response from OpenAI.Running a completion request in the OpenAI PlaygroundIn this recipe, we will actually put the Playground in action and execute a completion request from OpenAI. Here, you will see the power of the OpenAI API and how it can be used to provide completions for virtually any prompt.Getting readyEnsure you have an OpenAI Platform account with available usage credits. If you don’t, please follow the Setting up your OpenAI API Playground environment recipe. All the recipes in this chapter will have this same requirement.How to do it…Let’s go ahead and start testing the model with the Playground. Let’s create an assistant that writes marketing slogans:1. Navigate to the OpenAI Playground.2. In the System Message, type in the following: You are an assistant that creates marketing slogans based on descriptions of companies. Here, we are clearly instructing the model of its role and context.3. In the Chat Log, populate the USER message with the following: A company that writes engaging mystery novels.4. Select the Submit button on the bottom of the page.5. You should now see a completion response from OpenAI. In my case (Figure 1.2), the response isUnlock the Thrilling Pages of Suspense with Our Captivating Mystery Novels!Figure 1.2 – The OpenAI Playground with prompt and completionNoteSince OpenAI’s LLMs are probabilistic, you will likely not see the same outputs as me. In fact, if you run this recipe multiple times, you will likely see different answers, and that is expected because it is built into the randomness of the model.How it works…OpenAI’s text generation models utilize a specific neural network architecture termed a transformer. Before delving deeper into this, let’s unpack some of these terms:Neural network architecture: At a high level, this refers to a system inspired by the human brain’s interconnected neuron structure. It’s designed to recognize patterns and can be thought of as the foundational building block for many modern AI systems.Transformer: This is a type of neural network design that has proven particularly effective for understanding sequences, making it ideal for tasks involving human language. It focuses on the relationships between words and their context within a sentence or larger text segment.In machine learning, unsupervised learning typically refers to training a model without any labeled data, letting the model figure out patterns on its own. However, OpenAI’s methodology is more nuanced. The models are initially trained on a vast corpus of text data, supervised with various tasks. This helps them predict the next word in a sentence, for instance. Subsequent refinements are made using Reinforcement Learning through Human Feedback (RLHF), where the model is further improved based on feedback from human evaluators.Through this combination of techniques and an extensive amount of data, the model starts to capture the intricacies of human language, encompassing context, tone, humor, and even sarcasm.In this case, the completion response is provided based on both the System Message and the Chat Log. The System Message serves a critical role in shaping and guiding the responses you receive from Open AI, as it dictates the model’s persona, role, tone, and context, among other attributes. In our case, the System Message contains the persona we want the model to take: You are an assistant that creates marketing slogans based on descriptions of companies.The Chat Log contains the history of messages that the model has access to before providing its response, which contains our prompt, A company that writes engaging mystery novels.Finally, the parameters contain more granular settings that you can change for the model, such as temperature. These significantly change the completion response from OpenAI. We will discuss temperature and other parameters in greater detail in Chapter 3.There’s more…It is worth noting that ChatGPT does not read and understand the meaning behind text – instead, the responses are based on statistical probabilities based on patterns it observed during training.The model does not understand the text in the same way that humans do; instead, the completions are generated based on statistical associations and patterns that have been trained in the model’s neural network from a large body of similar text. Now, you know how to run completion requests with the OpenAI Playground. You can try this feature out for your own prompts and see what completions you get. Try creative prompts such as write me a song about lightbulbs or more professional prompts such as explain Newton's first law.ConclusionIn conclusion, the OpenAI Playground offers a dynamic environment for exploring the capabilities of language models like ChatGPT. By setting up your account and navigating through its features, you can unlock endless possibilities for creativity, learning, and innovation. Experiment with prompts, adjust parameters, and observe real-time responses to gain insights into AI's potential. Whether you're a developer, researcher, or curious individual, the Playground provides a sandbox for unleashing your imagination and understanding AI's intricacies. With each completion request, you delve deeper into the world of artificial intelligence, discovering its nuances and expanding your horizons. Start your journey today and witness the power of AI in action.Author BioHenry Habib is a Manager at one of the world's top management consulting firms, advising F500 companies on analytics and operations, with a particular focus on building intelligent AI-driven solutions and tools to create impact. He is a passionate online instructor and educator, amassing a of more than 150K paid students and facilitating technical programs at large banks and governmental.A proponent in the no-code and LLM revolution, he believes that anyone can now create powerful and intelligent applications without any deep technical skills. Henry resides in Toronto, Canada with his wife, and enjoys reading AI research papers and playing tennis in his free time.

0
0
47001

article-image-setting-up-polars-for-data-analysis

Luca Zanna

23 Feb 2024

7 min read

Setting Up Polars for Data Analysis

Luca Zanna

23 Feb 2024

7 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Data Analysis with Polars, by Luca Zanna. Leverage Polars, the lightning-fast dataframe library, to take your Python data analysis skills to the next levelIntroductionIn the ever-evolving landscape of data analysis, harnessing the right tools and methodologies can make all the difference. Welcome to a world where Polars, a powerful data manipulation library, takes center stage. This article is your gateway to unlocking the potential of Polars, and it begins by unraveling the essential components of the data analysis journey. From setting up virtual environments to simplifying data analysis in the cloud with Google Colab, we explore how Polars streamlines your path to insights. Whether you're a seasoned data analyst or just starting your journey, this guide will equip you with the knowledge and tools needed to make your data analysis endeavors efficient and rewarding. Join us as we delve into the fascinating realm of Polars and embrace a new era of data exploration.Installation and virtual environments We will not go through the installation of Python as that is outside the scope of the book. A visit to python.org will give all the information necessary to install Python. Now on to virtual environments. Understanding Virtual Environments and Their Benefits Imagine you have built a fantastic data analysis project using Polars. Your project uses: Python 3.8Polars version 0.15.1 Numpy 1.23.0 Now, you start a new project, and you want to use a newer Polars (0.16.14), along with Numpy and Arrow. So, the new project requires: Python 3.10 Polars 0.16.14 Numpy 1.24.0 Pyarrow 11.0.0 Upgrading Polars and Numpy libraries globally isn't a good idea. If Polars functions have changed between versions, your first project might stop working or give incorrect results with the new version. This is where virtual environments come in. Virtual environments create separate 'spaces' for each project: one for your first data analysis project and another for your new data pipeline project. You can set up a virtual environment manually or have your IDE set-up a virtual environment for you. If you decide to set it up manually, you can check out the guide at https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#creating-a-virtual-environment. Installing and using Polars on a machine To install Polars, first make sure you are in a virtual environment. Then, type: pip install polars If you already have Polars installed and want to upgrade it, type: pip install polars --upgrade In the book we will use other libraries, including numpy, pandas, matplotlib. You can install them with the syntax above, and you can also install multiple libraries at the same tine: pip install numpy pandas matplotlib Let’s now get our development environment set-up. We will use Visual Studio code, but you are free to use any other IDE that you like. 1. Type code . in the command line to open Visual Studio Code. 2. Right-click on the left, choose New File, and create first_dataframe.ipynb. Figure – Creating a new file in Visual Studio Code Files with extension .ipynb are Jupyter Notebook files, which are great for data analysis. To work with these files you need to install the Jupyter extension on VS Code. You can do that by clicking on ‘Extensions’ on the left bar, searching for Jupyter, installing it, and activating it. Figure – Install Jupyter extension in Visual Studio Code 3. Now back to our file. The first thing to ensure is that we are using Python from our virtual environment. Click on Select Kernel at the top right, then click on the Python that starts with env/: that will be the Python for our virtual environment. Avoid the paths starting with /usr and /bin as those are the system Python instead of our virtual environment. Figure – Select the Python interpreter in Visual Studio code Now, we're ready for Polars. 4. Type import polars as pl in the first cell and press Shift + Enter to run it. 5. Create a dataframe in the next cell by typing: df = pl.DataFrame({ 'a': ['Hello', 'World!'] }) 6. Press Shift + Enter to run the cell. This creates a dataframe called df with one column named 'a' and two rows: 'Hello' and 'World!' To see the dataframe, type df in the next cell and run it. Figure – Visual Studio code with first Polars dataframe We created our first Polars dataframe. Using Polars on the cloud with Google Colab Instead of installing Polars on your computer, you can also use it in the cloud. One popular cloud service for running code is Google Colab. This way, you don't need to install anything on your machine. To access Google Colab, visit https://colab.research.google.com/ in your web browser. Click on "New Notebook," and you'll see a page that looks similar to VS Code. Now, let's create the same Polars dataframe example in Google Colab: 1. In the first cell, type the following command to ensure we have the latest version of Polars: %pip install polars --upgrade 2. Next, enter this code to import Polars and create a dataframe: import polars as pl df = pl.DataFrame({ 'a': ['Hello', 'World !'] }) Finally, display the dataframe by typing: df And that's it! You now have your first Polars dataframe in Google Colab. Figure – Google Colab with first Polars dataframe ConclusionIn closing, Polars offers a bridge to the future of data analysis. With the knowledge and hands-on experience gained from this article, you're well-prepared to conquer the intricacies of data manipulation and visualization. The ability to effortlessly create, manipulate, and analyze data using Polars is a powerful tool in your arsenal. Whether you're a data enthusiast or a seasoned analyst, embracing Polars sets you on a path toward efficiency, precision, and data-driven success. As the data landscape continues to evolve, you're now equipped to stay ahead, make informed decisions, and revolutionize your approach to data exploration.Author BioLuca Zanna is a Data Engineer and Data Analyst with over 15 years of experience. He started his career as a financial data analyst after a Master's in Management and passing the Certified Public Accountant (CPA) exam. Luca spent a decade working on financial analysis systems at L’Oréal: developing the systems and training financial analysts across Europe and Asia.Currently, Luca helps companies with building data infrastructure to better leverage their data. Luca is also a corporate teacher for topics such as data analysis, SQL, Python, and cloud data engineering.

0
0
43446

article-image-everything-you-need-to-know-about-agentgpt

Avinash Navlani

02 Jul 2023

4 min read

Everything You Need to Know about AgentGPT

Avinash Navlani

02 Jul 2023

4 min read

Advanced language models have been used in the last couple of years to create a variety of AI products, including conversational AI tools and AI assistants. A web-based platform called AgentGPT allows users to build and use AI agents right from their browsers. Making AgentGPT available to everyone and promoting community-based collaboration are its key goals.ChatGPT provides accurate, meaningful, in-depth specific answers and discussion to given input questions while AgentGPT, on the other hand, is an AI agent platform that takes an objective and achieves the goal by thinking, learning, and taking actions.AgentGPT can assist you with your goals without installing and downloading. You just need to create an account and get the power of AI-enabled Conversational AI. You have to provide a name and objective for your agent, and the agent will achieve the goal.What is AgentGPT?AgentGPT is an open-source platform that is developed by openAI and uses the GPT3.5 architecture. AgentGPT is an NLP-based technology that generates human-like text with accuracy and fluency. It can engage in conversations, question-answers, generative content, and problem-solving assistance.How does AgentGPT work?AgentGPT breaks down a given prompt into smaller tasks, and the agent completes these specific tasks in order to achieve the goal. Its core strength is engaging in real and contextual conversation. It generates dynamic discussions while learning from the large dataset. It recognizes the intentions and responds in a way that is human-like.How to Use Agent GPT?Let’s first create an account on reworkd.ai. After creating the account, deploy the agent by providing the agent's name and objective.In the snapshot below, you can see that we are deploying an agent for Fake News Detection. As a user, we just need to provide two inputs: Name and Goal. For example, in our case, we have provided Fake News Detection as the name and Build Classifier for detecting fake news articles as a goal.Image 1: AgentGPT pageOnce you click the deploy agent. It starts identifying the task and add all the task in the queue. After that one by one, it executes all the tasks.Image 2: Queue of tasksIn the below snapshot, you can see it has completed the 2 tasks and working on the third task(Extract Relevant features). In all the tasks, it has also provided the code samples to implement the task.Image 3: Code samplesOnce your goal is achieved, you can save the results by clicking on the save button in the top-right corner.You can also improve the performance by providing relevant examples, using the ReAct approach for improving the prompting, and upgrading the version from local to Pro version.You also set up AgentGPT on the local machine. For detailed instructions, you can follow this link.SummaryCurrently, AgentGPT is in the beta phase, and the developer community is actively working on its features and use cases. It is one of the most significant milestones in the era of advanced large-language models. Its ability to generate human-like responses opens up potential opportunities for industrial applications such as customer service, content generation, decision support systems, and personal assistance.Author BioAvinash Navlani has over 8 years of experience working in data science and AI. Currently, he is working as a senior data scientist, improving products and services for customers by using advanced analytics, deploying big data analytical tools, creating and maintaining models, and onboarding compelling new datasets. Previously, he was a university lecturer, where he trained and educated people in data science subjects such as Python for analytics, data mining, machine learning, database management, and NoSQL. Avinash has been involved in research activities in data science and has been a keynote speaker at many conferences in India.Link - LinkedIn Python Data Analysis, Third edition

0
0
35626

article-image-getting-started-with-openai-whisper

Vivekanandan Srinivasan

30 Oct 2023

9 min read

Getting Started with OpenAI Whisper

Vivekanandan Srinivasan

30 Oct 2023

9 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionIn the era of rapid technological advancements, speech recognition technology has emerged as a game-changer, revolutionizing how we interact with machines and devices.As we know, OpenAI has developed an exceptional Automatic Speech Recognition (ASR) system known as OpenAI Whisper.In this blog, we will deeply dive into Whisper, understanding its capabilities, applications, and how you can harness its power through the Whisper API.Understanding OpenAI WhisperWhat is OpenAI Whisper?Well, put-“You Speak…AI Writes” OpenAI Whisper is an advanced ASR system that converts spoken language into written text.Built on cutting-edge technology and trained on 680,000 hours of multilingual and multitask supervised data collected from the web, OpenAI Whisper excels in a wide range of speech recognition tasks, making it a valuable tool for developers and businesses.Why Does ASR (Automatic Speech Recognition) Matter?Automatic Speech Recognition (ASR) is not just a cutting-edge technology; it's a game-changer reshaping how we interact with our digital world. Imagine a world where your voice can unlock a wealth of possibilities. That's what ASR, with robust systems like Whisper leading the charge, has made possible.Let's dive deeper into the ASR universe.It's not just about making life more convenient; it's about leveling the playing field. ASR technology is like the magic wand that enhances accessibility for individuals with disabilities. It's the backbone of those voice assistants you chat with and the transcription services that make your voice immortal in text.But ASR doesn't stop there; it's a versatile tool taking over various industries. Picture this: ASR helps doctors transcribe patient records in healthcare with impeccable accuracy and speed. That means better care for you. And let's remember the trusty voice assistants like Siri and Google Assistant, always at your beck and call, answering questions and performing tasks, all thanks to ASR's natural language interaction wizardry. Setup and InstallationWhen embarking on your journey to harness the remarkable power of OpenAI Whisper for Automatic Speech Recognition (ASR), the first crucial step is to set up and install the necessary components.In this section, we will guide you through starting with OpenAI Whisper, ensuring you have everything in place to begin transcribing spoken words into text with astonishing accuracy.Prerequisites Before you dive into the installation process, it's essential to make sure you have the following prerequisites in order:OpenAI Account To access OpenAI Whisper, you must have an active OpenAI account. If you still need to sign up, visit the OpenAI website and create an account.API Key You will need an API key from OpenAI to make API requests. This key acts as your access token to use the Whisper ASR service. Ensure you have your API key ready; if you don't have one, you can obtain it from your OpenAI account.Development Environment It would help to have a functioning development environment for coding and running API requests. You can use your preferred programming language, Python, to interact with the Whisper API. Make sure you have the necessary libraries and tools installed.Installation Steps Now, let's walk through the steps to install and set up OpenAI Whisper for ASR:1. Install the OpenAI Python LibraryIf you haven't already, you must install the OpenAI Python library. This library simplifies the process of making API requests to OpenAI services, including Whisper. You can install it using pip, the Python package manager, by running the following command in your terminal: pip install openai2. Authenticate with Your API KeyYou must authenticate your requests with your API key to interact with the Whisper ASR service. You can do this by setting your API key as an environment variable in your development environment or by directly including it in your code. Here's how you can set the API key as an environment variable:import openai openai.api_key = "YOUR_API_KEY_HERE" Replace "YOUR_API_KEY_HERE" with your actual API key.3. Make API RequestsWith the OpenAI Python library installed and your API key adequately set, you can now start making API requests to Whisper. You can submit audio files or chunks of spoken content and receive transcriptions in response. import openai response = openai.Transcription.create( model="whisper", audio="YOUR_AUDIO_FILE_URL_OR_CONTENT", language="en-US" # Adjust language code as needed ) print(response['text']) Replace "YOUR_AUDIO_FILE_URL_OR_CONTENT" with the audio source you want to transcribe.Testing Your SetupAfter following these installation steps, testing your setup with a small audio file or sample content is a good practice.This will help you verify that everything functions correctly and that you can effectively convert spoken words into text.Use Cases And ApplicationsTranscription ServicesWhisper excels in transcribing spoken words into text. This makes it a valuable tool for content creators, journalists, and researchers whose work demands them to convert audio recordings into written documents.Voice Assistants Whisper powers voice assistants and chatbots, enabling natural language understanding and interaction. This is instrumental in creating seamless user experiences in applications ranging from smartphones to smart home devices.AccessibilityWhisper enhances accessibility for individuals with hearing impairments by providing real-time captioning services during live events, presentations, and video conferences.Market ResearchASR technology can analyze customer call recordings, providing businesses with valuable insights and improving customer service.Multilingual SupportWhisper supports multiple languages, making it a valuable asset for global companies looking to reach diverse audiences.Making Your First API CallNow that you have your Whisper API key, it's time to make your first API call. Let's walk through a simple example of transcribing spoken language into text using Python.pythonCopy code:import openai # Replace 'your_api_key' with your actual Whisper API key openai.api_key = 'your_api_key' response = openai.Transcription.create( audio="<https://your-audio-url.com/sample-audio.wav>", model="whisper", language="en-US" ) print(response['text'])In this example, we set up the API key, specify the audio source URL, select the Whisper model, and define the language. The response['text'] contains the transcribed text from the audio.Use CasesLanguage DetectionOne of the remarkable features of OpenAI Whisper is its ability to detect the language being spoken.This capability is invaluable for applications that require language-specific processing, such as language translation or sentiment analysis.Whisper's language detection feature simplifies identifying the language spoken in audio recordings, making it a powerful tool for multilingual applications.TranscriptionTranscription is one of the most common use cases for Whisper. Whether you need to transcribe interviews, podcasts, or customer service calls, Whisper's accuracy and speed make it an ideal choice.Developers can integrate Whisper to automate transcription, saving time and resources.Supported LanguagesOpenAI Whisper supports many languages, making it suitable for global applications. Some supported languages include English, Spanish, French, German, Chinese, and others.Open AI supports all these languages as of now-Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.Best Practices While Using WhisperWhen working with Whisper for longer-format audio or complex tasks, following best practices is essential. For instance, you can break longer audio files into shorter segments for improved accuracy. Additionally, you can experiment with different settings and parameters to fine-tune the ASR system according to your specific requirements.Here's a simple example of how to break down more extended audio for transcription:pythonCopy codeimport openai # Replace 'your_api_key' with your actual Whisper API key openai.api_key = 'your_api_key' # Divide the longer audio into segments audio_segments = [ "<https://your-audio-url.com/segment1.wav>", "<https://your-audio-url.com/segment2.wav>", # Add more segments as needed ] # Transcribe each segment separately for segment in audio_segments: response = openai.Transcription.create( audio=segment, model="whisper", language="en-US" ) print(response['text'])These best practices and tips ensure you get the most accurate results when using OpenAI Whisper.ConclusionIn this blog, we've explored the incredible potential of OpenAI Whisper, an advanced ASR system that can transform how you interact with audio data. We've covered its use cases, how to access the Whisper API, make your first API call, and implement language detection and transcription. With its support for multiple languages and best practices for optimizing performance, Whisper is a valuable tool for developers and businesses looking to harness the power of automatic speech recognition.In our next blog post, we will delve even deeper into OpenAI Whisper, exploring its advanced features and the latest developments in ASR technology. Stay tuned for "Advances in OpenAI Whisper: Unlocking the Future of Speech Recognition."For now, start your journey with OpenAI Whisper by requesting access to the API and experimenting with its capabilities. The possibilities are endless, and the power of spoken language recognition is at your fingertips.Author BioVivekanandan, a seasoned Data Specialist with over a decade of expertise in Data Science and Big Data, excels in intricate projects spanning diverse domains. Proficient in cloud analytics and data warehouses, he holds degrees in Industrial Engineering, Big Data Analytics from IIM Bangalore, and Data Science from Eastern University.As a Certified SAFe Product Manager and Practitioner, Vivekanandan ranks in the top 1 percentile on Kaggle globally. Beyond corporate excellence, he shares his knowledge as a Data Science guest faculty and advisor for educational institutes.

0
0
33968

Merlyn Shelley

26 Mar 2024

14 min read

Transforming Web Data with Browse AI

Merlyn Shelley

26 Mar 2024

14 min read

Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!Partnering with Browse AI Turn Web Data into Your Business Superpower!👉 Train a robot in 2 minutes, no coding needed. 🤖 👉 Ideal for web scraping and data monitoring. 🌐 Here’s what you get: Monitor Websites for Changes ✅ Download Data from Any Website ✅ Turn Any Website into an API ✅ Product data extraction ✅ Also, extract data from news, stocks, jobs, social media, and more. Check out this 1-minute explainer video on how to extract data to Excel, Airtable, and connect to 5,000+ apps using Zapier! Start for free with up to 50 credits, and for a limited time, enjoy free setup and onboarding for Team and Company plans, saving up to 20% on Annual plans. Get Scraping Today!👋 Hello,Welcome to DataPro#85 – Your one-stop shop for the latest in Data Science and ML Algorithms! 🚀 In this issue:⚙️ Keeping Up with LLMs & GPTs Meet Devin: The pioneering AI software engineer. Google's Croissant: A fresh take on metadata for ML-ready datasets. INSTRUCTIR by Kaist AI: Setting new standards in instruction-following for information retrieval models. Spyx by Sussex AI: Turbocharging spiking neural networks with just-in-time compiled optimization. SynCode by VMware: Enhancing LLM code generation with a touch of grammar. Chatbot Arena: The ultimate battleground for evaluating LLMs by human preference. Apollo: Bringing medical AI to the masses with a multilingual medical LLM. ✨ On the RadarTop AI tools for code generation in 2024. Setting up a Pypi mirror in AWS with Terraform. Ensuring safer code changes with custom pre-commit hooks. Deciphering the AQLM Quantization Algorithm. AI's role in revolutionizing web browsing. Tackling tensors through three tricky errors. Running RStudio inside a container. Harnessing PyTorch and MLX for Apple Silicon. 🏭 Industry Highlights Google Research: Boosting LLMs with Cappy, evolving tables with Chain-of-table, and Scalable Instructable Multiworld Agent (SIMA). AWS: Streamlining code review with generative AI using Amazon Bedrock. OpenAI Updates: Leadership continuity and global news partnerships. 📚 New in Packt Library Practical Guide to Applied Conformal Prediction in Python by Valery Manokhin. DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 GitHub Finds: Any of These Repos in Your Toolbox?🛠️ deepseek-ai/DeepSeek-VL: Open-source Vision-Language (VL) model for real-world tasks, handling logical diagrams, web pages, formulas, scientific literature, and more. 🛠️ OpenGVLab/VideoMamba: VideoMamba enhances 3D CNNs and video transformers, excelling in long-term video understanding with scalability and modality compatibility. 🛠️ showlab/DragAnything: DragAnything uses entity representation for motion control in video generation, offering user-friendly interaction and outperforming existing methods. 🛠️ pkunlp-icler/FastV: FastV accelerates large vision language models by pruning redundant visual tokens, achieving 45% FLOPs reduction without performance loss. 🛠️ cnulab/RealNet: RealNet introduces SDAS for anomaly strength control, AFS for feature selection, and RRS for anomaly region identification. Partnering with SurfsharkSurfshark is allowing our readers to enjoy a full 2 years of their award-winning VPN protection for 79% off, plus 2 months free. With Surfshark One, you get: Unlimited devices and connections ✅ One account for the entire household ✅ Your online activity, made safe, secure, and invisible ✅ Plus, identity protection, ad blocking, antivirus, and data breach monitoring.Claim your VPN protection today! 📚 Expert Insights from Packt CommunityPractical Guide to Applied Conformal Prediction in Python - By Valery Manokhin Basic components of a conformal predictor We will now look at the basic components of a conformal predictor: Nonconformity measure: The nonconformity measure is a function that evaluates how much a new data point differs from the existing data points. It compares the new observation to either the entire dataset (in the full transductive version of conformal prediction) or the calibration set (in the most popular variant – ICP. The selection of the nonconformity measure is based on a particular machine learning task, such as classification, regression, or time series forecasting, as well as the underlying model. This will examine several nonconformity measures suitable for classification and regression tasks. Calibration set: The calibration set is a portion of the dataset used to calculate nonconformity scores for the known data points. These scores are a reference for establishing prediction intervals or regions for new test data points. The calibration set should be a representative sample of the entire data distribution and is typically randomly selected. The calibration set should contain a sufficient number of data points (at least 500). If the dataset is small and insufficient to reserve enough data for the calibration set, the user should consider other variants of conformal prediction – including TCP (see, for example, Mastering Classical Transductive Conformal Prediction in Action – https://medium.com/@valeman/how-to-use-full-transductive-conformal-prediction-7ed54dc6b72b). Test set: The test set contains new data points for generating predictions. For every data point in the test set, the conformal prediction model calculates a nonconformity score using the nonconformity measure and compares it to the scores from the calibration set. Using this comparison, the conformal predictor generates a prediction region that includes the target value with a user-defined confidence level. All these components work in tandem to create a conformal prediction framework that facilitates valid and efficient uncertainty quantification in a wide range of machine learning tasks. Discover more insights from 'Practical Guide to Applied Conformal Prediction in Python' by Valery Manokhin. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ML Made Easy 🌀 Enhance code review and approval efficiency with generative AI using Amazon Bedrock: This post discusses the challenges faced by managers in overseeing code review and approval processes in software development, such as lack of technical expertise, time constraints, volume of change requests, manual effort, and the need for documentation. It also introduces a solution that leverages generative artificial intelligence and integrates it with AWS deployment tools to streamline the review and approval process. The solution includes automated change analysis, summarization, and an approval workflow. Google Research 🌀 Cappy: Outperforming and boosting large multi-task language models with a small scorer. This blog discusses advancements in large language models (LLMs) and their use in natural language processing (NLP). It introduces the concept of multi-task LLMs, such as T0, FLAN, and OPT-IML, which excel at understanding and solving various tasks. It also presents a new approach called Cappy, a lightweight pre-trained scorer that enhances the performance and efficiency of multi-task LLMs. 🌀 Chain-of-table: Evolving tables in the reasoning chain for table understanding. This research focuses on improving how large language models (LLMs) reason over tabular data, which is challenging due to the structured nature of tables. The proposed framework, Chain-of-Table, trains LLMs to iteratively update tables, mimicking human reasoning, resulting in improved performance on table understanding tasks. 🌀 Talk like a graph: Encoding graphs for large language models. This research explores how to teach large language models (LLMs) to reason with graph information, crucial for understanding interconnected data. They introduce GraphQA, a benchmark to evaluate LLMs on graph problems, revealing insights into effective graph encoding methods and improving LLM performance on graph tasks by up to 60%. 🌀 Scalable Instructable Multiworld Agent (SIMA): A generalist AI agent for 3D virtual environments. Google DeepMind has developed SIMA, a versatile AI agent trained on multiple video games to follow natural-language instructions, akin to human behavior. Collaborating with game studios, SIMA navigates various environments, showcasing potential for AI to understand and execute diverse tasks. OpenAI Updates 🌀 Review completed & Altman, Brockman to continue to lead OpenAI: The OpenAI Board completed a review by WilmerHale, expressing full confidence in Sam Altman and Greg Brockman's leadership. They also elected new board members and adopted governance enhancements. WilmerHale's review found a breakdown in trust between the prior Board and Mr. Altman, leading to his removal, but concluded that his conduct did not mandate removal. Following the review, the Board endorsed the decision to rehire Mr. Altman and Mr. Brockman. 🌀 Global news partnerships: Le Monde and Prisa Media: OpenAI has partnered with Le Monde and Prisa Media to bring French and Spanish news content to ChatGPT. This partnership aims to enhance user interaction with news content and contribute to the training of OpenAI's models. Through these partnerships, users will access summaries and links to original articles, expanding their news consumption experience. This collaboration supports the news industry and its role in providing reliable information globally. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🌀 Introducing Devin, the first AI software engineer: Meet Devin, the autonomous AI software engineer, skilled in long-term reasoning and planning. Devin can learn new technologies, build and deploy apps, find and fix bugs, train AI models, and contribute to open source. Devin excels in resolving real-world GitHub issues, outperforming previous models. Cognition, the AI lab behind Devin, aims to unlock new possibilities beyond coding. 🌀 Google’s Croissant: a metadata format for ML-ready datasets. Croissant is a new metadata format for ML datasets, aiming to simplify the use of existing datasets for training ML models. It standardizes dataset descriptions and organization, supporting responsible AI practices. Croissant builds upon schema.org and is supported by major tools and repositories like Kaggle, Hugging Face, and OpenML. It includes a specification, example datasets, a Python library, and a visual editor to facilitate dataset usage and publication. 🌀 Kaist AI’s INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models. This research focuses on enhancing search accuracy by improving retrievers to understand users' intentions, similar to language models. It introduces INSTRUCTIR, a benchmark for evaluating retrievers' ability to follow user-aligned instructions in retrieval tasks. The study addresses limitations in existing benchmarks and highlights potential overfitting issues in instruction-aware retrieval datasets. 🌀 Sussex AI’s Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks. Advancements in large neural architectures have led to powerful AI accelerators for training deep neural networks. However, these networks often incur high costs. Neuromorphic computing with Spiking Neural Networks (SNNs) offers energy-efficient alternatives, but training SNNs is challenging. Spyx, a new lightweight SNN simulation and optimization library designed in JAX, aims to facilitate SNN architecture investigation by bridging Python-based deep learning frameworks with custom compute kernels, achieving optimal hardware utilization. 🌀 VMware’s SynCode: Improving LLM Code Generation with Grammar Augmentation. SynCode is a novel framework for efficient syntactical decoding of code with large language models (LLMs). It leverages grammar of a programming language using an offline-constructed efficient lookup table called Deterministic Finite Automaton (DFA) mask store. SynCode seamlessly integrates with any context-free grammar (CFG) defined language, reducing syntax errors by 96.07% when combined with LLMs. 🌀 Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. Chatbot Arena is an open platform designed to evaluate Large Language Models (LLMs) by considering human preferences. Utilizing a pairwise comparison method and crowdsourced input, it assesses LLMs' alignment with user preferences. The platform, operational for months with over 240K votes, provides a credible and valuable resource for ranking LLMs. Check out the tool here. 🌀 Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People. The project aims to develop medical Large Language Models (LLMs) in the six most spoken languages, benefiting 6.1 billion people. This includes creating the ApolloCorpora multilingual medical dataset and the XMedBench benchmark, with Apollo models achieving top performance among models of similar sizes. The project will open-source training data, code, model weights, and evaluation benchmarks. You can check for the demo here. ✨ On the Radar: Catch Up on What's Fresh🌀 Top Artificial Intelligence (AI) Tools That Can Generate Code To Help Programmers (2024): The article discusses how AI is changing programming, with tools like OpenAI Codex and GitHub Copilot generating code. It explores AI's impact on code quality and development speed, showcasing various AI-powered tools like Tabnine, CodeT5, and Polycoder. Additionally, it mentions AI tools for code review, static code analysis, and AI-assisted coding in IDEs like PyCharm and Visual Studio. 🌀 Pypi mirror in a private AWS environment Terraform: This article explains how to install Python packages in an AWS Sagemaker Studio environment without internet access. It covers setting up Sagemaker in VPC Only mode, using VPC Endpoint interfaces for network communications, and accessing the Pypi package repository through AWS Codeartifact, which allows defining Pypi as an upstream repository. 🌀 Custom pre-commit hooks for safer code changes: This blog post explains the importance of using pre-commit hooks in software development, particularly with the git version control system. It discusses the challenges of maintaining coding standards in collaborative projects and provides a step-by-step tutorial on how to set up and use custom pre-commit hooks for a Python project, using the example of validating dataflow definitions for the Hamilton library. 🌀 AQLM Quantization Algorithm, explained: A new quantization algorithm, AQLM (Additive Quantization of Language Models), was recently released and integrated into HuggingFace Transformers and HuggingFace PEFT. AQLM sets a new state-of-the-art for 2-bit quantization while providing improvements for 3-bit and 4-bit ranges, pushing the boundaries of model accuracy and memory footprint. 🌀 Revolutionize Web Browsing with AI: This article explores creating an AI agent using the gpt-4-vision-preview model from OpenAI, enabling it to navigate the web like a human. It discusses the agent's browser control, content browsing, and decision-making processes, showcasing potential use cases such as aiding visually challenged users and automating web browsing tasks. 🌀 Understanding Tensors: Learning a Data Structure Through 3 Pesky Errors. This article discusses transitioning from managing tabular data to working with tensors in TensorFlow, offering debugging tips and code recipes. It covers visualizing TensorFlow datasets, understanding tensor specs, and augmenting model summaries, while addressing common errors related to tensor rank and shape. 🌀 Running RStudio Inside a Container: This tutorial focuses on setting up RStudio using Docker, particularly leveraging the Rocker RStudio image. It covers pulling the image, launching RStudio in a container, and ensuring persistence of data by using volume mapping. The tutorial provides step-by-step instructions and explanations for each stage. 🌀 PyTorch and MLX for Apple Silicon: The blog discusses Apple's MLX framework, which is optimized for Apple Silicon and serves as a bridge between PyTorch, NumPy, and Jax. It details a comparison between MLX and PyTorch through a custom convolutional neural network implementation for image classification tasks. The discussion includes insights into MLX's features, such as its array class, lazy computation, and compilation for performance optimization. The post also highlights the ease of converting PyTorch code to MLX, despite some differences in API compatibility and coding conventions. See you next time!Affiliate Disclosure: This newsletter contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. This supports our work and helps us keep providing useful content. We only recommend products and services we think will benefit our readers. Thanks for your support!

0
0
33695

article-image-fabrics-code-first-automl-and-hyperparameter-tuning-google-cloud-cortex-framework-snowflakes-data-metric-functions-qliks-ai-accelerator

Merlyn Shelley

29 Apr 2024

12 min read

Fabric’s Code-First AutoML and Hyperparameter Tuning, Google Cloud Cortex Framework, Snowflake’s Data Metric Functions, Qlik's AI Accelerator

Merlyn Shelley

29 Apr 2024

12 min read

Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to BI-Pro #54: Your Premier Destination for Data and Business Intelligence Insights! 🌟 In this edition, we dive deep into the cutting-edge solutions of business intelligence, data modeling, and advanced analytics. Prepare to explore an array of transformative topics and industry insights that will redefine how you interact with technology and data. 🧩 Highlights of This Issue: Python Practice Platforms: The top 7 platforms where you can sharpen your Python skills. Innovative Experiments: Dive into hands-on experiments with MLFlow and Microsoft Fabric to enhance your project’s efficiency. SAP Expertise: Master the complex data models of SAP and leverage them for optimal performance. AI-Powered Business Management: Learn how to integrate AI to streamline and enhance business management functions. Snowflake’s Surveillance: Monitor your data pipelines effectively using Snowflake’s Data Metric Functions. 🧬 Stay Informed with Industry Highlights: Power BI: Learn about the significant deprecation of AutoML in Power BI using Dataflows V1. Microsoft Fabric: Get the scoop on the new code-first AutoML and hyperparameter tuning, now available in public preview. AWS BI: Discover how to build SAP Golden AMIs with EC2 Image Builder and Ansible and explore the transformative impact of Amazon Q on business experiences. Google Cloud Data: Catch up with the latest updates from the Google Cloud Cortex Framework. Tableau: Uncover how Einstein Copilot for Tableau is building the next generation of AI-driven analytics. From the Experts at Packt Community: Gain insights from industry leaders on the fundamentals of Analytics Engineering. 🧮 What’s the Latest from the BI Community? Explore real-time AI capabilities with Datorios’ new observability tool. Learn about Snowflake's launch of Arctic, an enterprise-grade LLM. Discover how Qlik's AI Accelerator is integrating generative AI to deliver customer outcomes. Witness the future of AI with Avant Technologies’ new supercomputing advancements. Join us as we unpack these topics to keep you at the forefront of the data and BI world. Stay curious, stay informed! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktPackt BI-Pro is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Upgrade to paidSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos🧩 pixiedust/pixiedust: PixieDust is an open-source library enhancing Jupyter notebooks, improving data work experience, particularly for cloud-hosted notebooks without configuration access. 🧩 plotly/plotly.py: plotly.py is an interactive, open-source graphing library for Python, offering over 30 chart types, including scientific, 3D, statistical, and financial charts. 🧩 AykutSarac/jsoncrack.com: JSON Crack is a free, open-source data visualization app for JSON, YAML, XML, CSV, etc., offering interactive graphs for easy data exploration and analysis. 🧩 apexcharts/apexcharts.js: ApexCharts is a JavaScript charting library with a simple API, 100+ samples, and over a dozen chart types for beautiful, responsive visualizations in apps and dashboards. 🧩 antvis/G2: G2 is a visualization library inspired by "The Grammar of Graphics," offering an introduction, examples, tutorials, and API reference for learning and using its core concepts. 🧩 visgl/deck.gl: deck.gl simplifies high-performance, WebGL2/WebGPU-based visualization of large datasets. It offers pre-built layers for easy setup or customizable architecture for tailored needs. Email Forwarded? Join BI-Pro Here!🔮 Revolutionizing Analytics: New BI Tools🧬 7 Best Platforms to Practice Python: The article lists seven platforms—Practice Python, Edabit, CodeWars, Exercism, PYnative, LeetCode, and HackerRank—that offer various levels of programming challenges for learning and practicing Python, particularly for coding interviews and skill improvement. 🧬 Experimenting with MLFlow and Microsoft Fabric: The blog discusses the importance of systematic experimentation in machine learning (ML) to improve model performance, highlighting the use of MLFlow within Fabric for managing ML experiments. It covers setting up experiments, running them, logging results, and analyzing them, emphasizing the importance of tracking configurations and outcomes for iterative improvement in ML models. 🧬 Mastering SAP’s data models: The article discusses challenges faced in understanding SAP data models for analytics, focusing on integrating procurement data. It explains SAP's ERP software, data architecture basics, table types (master vs. transaction), and data mapping for procurement tables. 🧬 Building an AI-Powered Business Manager: The post explores the concept of consolidating business management into a single, chat-based platform powered by Large Language Models (LLMs). It discusses the advantages for small businesses, outlines project structure, sets up the database, and updates the Tool class to handle SQLModel instances. 🧬 Monitor Data Pipelines Using Snowflake’s Data Metric Functions: The author emphasizes the importance of data quality in gaining trust with stakeholders and focuses on using Google's Site Reliability Engineering principles to measure the health of data systems. It discusses defining service level indicators and objectives for data quality dimensions and provides a technical implementation example in Snowflake. ⚡Stay Informed with Industry HighlightsPower BI🧮 Deprecation of AutoML in Power BI using Dataflows V1: The update announces the deprecation of Power BI Automated Machine Learning (AutoML) models for Dataflows V1 in all regions as of the third week of April. Customers are encouraged to migrate to the AutoML solution based on Synapse Data Science in Microsoft Fabric, offering a more customizable AutoML experience with advanced tools and features. Microsoft Fabric🧮 Introducing Code-First AutoML and Hyperparameter Tuning: Now in Public Preview for Fabric Data Science: The update introduces code-first automated machine learning (AutoML) and hyperparameter tuning in Public Preview for Fabric Data Science. Users can access both AutoML and Tune capabilities seamlessly within the Fabric 1.2 runtime, enhancing machine learning model optimization and accessibility. 🧮 Fabric Change the Game: Embracing Azure Cosmos DB for NoSQL. The post explores setting up Azure Cosmos DB for NoSQL and leveraging Vector Search capabilities of AI Search Services through Microsoft Fabric's Lakehouse features. It also discusses integrating Cosmos DB Mirror and using Python coding facilitated through Lakehouse, highlighting Fabric's integration capabilities for search or data mirroring. 🧮 Microsoft Fabric April 2024 Update: The April 2024 update brings various enhancements and previews to Microsoft Fabric, including new visuals like the 100% Stacked Area Chart, improvements to reporting, data connectivity, administration features, analytics, real-time analytics, data factory, and data pipelines. Additionally, the update includes the availability of Exam DP-600 for Fabric Analytics Engineer certification and free learning sessions. AWS BI 🧮 Build SAP Golden AMIs with EC2 Image Builder and Ansible: This blog post guides users on building a reusable Amazon Machine Image (AMI) for deploying Amazon Elastic Compute Cloud (EC2) instances for SAP installations. It covers using Terraform and Ansible to automate the process and provides sample code. 🧮 Transforming Business Experiences: The Impact of Amazon Q and Generative BI for AWS Partners. This post highlights how advances in AI, particularly Amazon Q and generative BI, are transforming business operations. It showcases how AWS partners like ZS Associates, Tiger Analytics, and Compass UOL are leveraging these innovations for industry-specific solutions. Google Cloud Data 🧮 What’s new with Google Cloud Cortex Framework? The article discusses Google Cloud Cortex Framework, emphasizing its role in unifying enterprise data for AI-driven insights. It highlights new solutions for marketing, sustainability management, and finance, showcasing how Cortex Framework accelerates innovation, enhances decision-making, and drives business efficiency in the AI era. Tableau🧮 Einstein Copilot for Tableau: Building the Next Generation of AI-Driven Analytics. The post delves into the development of Einstein Copilot for Tableau, an AI-driven tool revolutionizing data analysis. It highlights the challenges and solutions in building its infrastructure, improving accuracy and efficiency, and enhancing AI and core capabilities through collaboration and continuous improvement. ✨ Expert Insights from Packt CommunityFundamentals of Analytics Engineering - By Dumky De Wilde, Fanny Kassapian, Jovan Gligorevic and 4 more The role of dbt in analytics engineering dbt emerged as a solution to the challenges relating to data transformation faced in data analysis. Initially crafted as an open-source Python package, dbt aimed to bring software engineering best practices to the world of analytics. Over time, dbt matured beyond just a package, becoming a versatile cloud service. While the open-source package remains available and actively supported, dbt now offers a cloud-based version, packed with features such as an integrated development environment (IDE), scheduling tools, data lineage trackers, and hosted documentation. This is especially valuable for analysts who might not have a deep software engineering background. For more information on dbt’s history, read https://www.getdbt.com/blog/what-exactly-is-dbt. We will use dbt Cloud, which offers a free tier for a single developer: that’s you! You can learn more about its pricing here: https://www.getdbt.com/pricing. dbt seamlessly integrates into the ELT architecture. It does not store or process data but serves as a bridge between analysts and the data warehouse. dbt’s position in a data stack as an intermediary in the transformation layer. This is how it works: analysts draft SQL queries, enhanced with dbt’s unique capabilities. dbt then translates this specialized SQL into the native SQL of the data warehouse and dispatches it for execution. All the transformed data and results remain within the data warehouse, making dbt a lightweight yet powerful tool in the analytics toolkit. Because of dbt’s pivotal position in analytics engineering, we will spend more time discussing its features and zooming in on best practices. First, we will set up dbt for our use case. Setting up dbt Cloud The following steps are required for dbt: Creating a dbt Cloud account. Setting up a connection from dbt Cloud to BigQuery. Testing the connection by querying the data using dbt Cloud. Follow the step-by-step instructions here: https://github.com/PacktPublishing/Fundamentals-of-Analytics-Engineering/blob/main/chapter_8/guides/setting_up_dbt_cloud.md. Now, let’s focus on the various data layers in dbt. Data layers in dbt It is a widespread practice to separate the data we use for analytics into layers. This helps data practitioners communicate the distinct parts of the data transformation process. Broadly speaking, the process will fall into three layers in dbt, Raw, Preparation and Business. Let’s take a closer look: Raw layer: The source data is stored in the form it arrives in. Whenever you receive data, it should be stored as-is so that you have a backup in case something goes wrong during the transformations. When you copied the Excel sheets using Airbyte, they became part of the raw layer inside BigQuery. Preparation layer: In the second layer, the raw data is cleaned, deduplicated, and transformed to conform to naming conventions and other rules. For our data, this could mean renaming fields for readability and standardizing sales figures from cents to euros. Business layer: In the final layer, business rules are applied to the prepared data, and different data is joined and modeled into datasets that are ready for consumption by BI tools and stakeholders. In our case, we might add a business rule to disregard negative sales amounts when summing the total stroopwafels sold, as these are likely an error. The resulting data can then be served to the BI tool for dashboarding. Discover more insights from Fundamentals of Analytics Engineering - By Dumky De Wilde, Fanny Kassapian, Jovan Gligorevic and 4 more. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here💡 What's the Latest Scoop from the BI Community? 🧠 Datorios unleashes real-time AI with the first observability tool for streaming data: Datorios introduces the first observability tool for Apache Flink, offering deep insights into streaming data processing. It enables faster AI innovation and thorough auditability, providing developers with event visualization, event search, state monitoring, window analysis, and more. Datorios is now publicly available for free. 🧠 Snowflake Launches Arctic: The Most Open, Enterprise-Grade Large Language Model: Snowflake introduces Snowflake Arctic, an open, enterprise-grade large language model (LLM) with a Mixture-of-Experts architecture, optimized for complex enterprise workloads. Arctic sets new openness standards for AI technology, offering weights under an Apache 2.0 license and enhancing AI innovation. 🧠 Introducing Qlik's AI Accelerator - Delivering Tangible Customer Outcomes in Generative AI Integration: Qlik is at the forefront of integrating generative AI, particularly Large Language Models (LLMs), into data analysis and decision-making. They address key challenges like data privacy, technical complexity, and cost, offering seamless integration of popular LLMs and an AI Accelerator program to quickly prove the benefits of AI integration with minimal barriers to entry. 🧠 Avant Technologies Launches Advanced AI Supercomputing: Avant Technologies, an AI company, introduces a supercomputing network and licensable dataset with Wired4Tech, aiming to accelerate AI adoption. The offerings include a versatile AI dataset, dynamic resource scaling, accelerated AI processing, robust security measures, and seamless integration, designed to empower developers and drive innovation across industries. See you next time!

0
0
31717

article-image-elevate-your-bi-dashboards-with-figma

Merlyn Shelley

28 Mar 2024

12 min read

Elevate Your BI Dashboards with Figma

Merlyn Shelley

28 Mar 2024

12 min read

Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!Partnering with Figma Want to take your BI dashboards to the next level? Figma is the way to go! It's all about ramping up the design, making things work better, and giving your Power BI projects a real boost. With Figma, you'll speed up your projects, get more creative, and see better performance. So, why not give your reports a makeover with Figma? It's where design and data come together to make a big impact! Here's what Figma offers: ✅ Figma Professional: An all-in-one tool for seamless team collaboration. ✅ FigJam: Enables real-time teamwork and brainstorming. ✅ FigJam AI: Integrates ChatGPT for smarter collaboration. Guess what? You also have the Power BI UI Kit from the Figma Community! Sign Up Now! 👋 Hello,Welcome to BI-Pro #48, your ultimate guide to data and BI insights! 🚀In this issue: 🔮 Python Data Viz Matplotlib Data Visualization Seaborn: Visualizing Data in Python Use pandas for CSV Data Visualization Guides on SQL, Python, Data Cleaning, and Analysis Build An AI App with Python in 10 Steps ⚡ Industry Highlights Power BI Hybrid Workforce Experience Report Lakeview Dashboards Overview Grouping and Binning in Power BI Desktop Dashboards in Operations Manager Microsoft Fabric Analyze Dataverse Tables Bridging Fabric Lakehouses AWS Big Data Multicloud Analytics with Amazon Athena Analyze Fastly CDN Logs with QuickSight Google Cloud Data Spark Procedures in BigQuery Gemini Pro 1.0 in BigQuery via Vertex AI ✨ Expert Insights from Packt Community Unlocking the Secrets of Prompt Engineering 💡 BI Community Scoop Creating Interactive Power BI Dashboards Using Report Templates in Power BI Desktop 10 Analytics Dashboard Examples for SaaS Future of Data Storytelling: Actionable Intelligence Power BI: Transforming Banking Data Power BI vs Tableau vs Qlik Sense | 2024 Winner Get ready to supercharge your skills with BI-Pro! 🌟 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos🌀 sdv-dev/SDV: The Synthetic Data Vault (SDV) is a Python library that creates tabular synthetic data by learning patterns from real data using machine learning algorithms. 🌀 hyperspy/hyperspy: HyperSpy is a Python library for analyzing multidimensional datasets, making it easy to apply analytical procedures and access tools. 🌀 hi-primus/optimus: Optimus is a Python library for loading, processing, plotting, and creating ML models that works with pandas, Dask, cuDF, dask-cuDF, Vaex, or Spark. It simplifies data processing and offers various functions for data quality, plotting, and cross-platform compatibility. 🌀 mingrammer/diagrams: Diagrams simplifies cloud system architecture design in Python, supporting major providers and tracking changes in version control. 🌀 kayak/pypika: PyPika simplifies building SQL queries in Python with a flexible, easy-to-use interface, leveraging the builder design pattern for clean, efficient queries. Email Forwarded? Join BI-Pro Here!Partnering with Webflow Transform your BI reporting with Webflow Enterprise. Create visually stunning, scalable websites without coding, using a visual canvas.Seamlessly integrate with popular BI platforms and let Webflow handle the code.Start building smarter, faster, and more reliable websites for your data-driven decisions today! Get Started for Free! 🔮 Data Viz with Python Libraries 🌀 Matplotlib Data Visualization in Python: This blog introduces Matplotlib, a Python library for 2D visualizations, covering its capabilities and plot types like line, scatter, bar, histograms, and pie charts. It highlights Matplotlib's versatility, customization, and integration with other libraries, making it essential for data science and research. 🌀 Visualizing Data in Python With Seaborn: This article introduces the seaborn library for statistical visualizations in Python. It covers creating various plots, such as bar, distribution, and relational plots, using seaborn's functional and objects interfaces. It emphasizes seaborn's clear and concise code for effective data visualization. 🌀 Use pandas to Visualize CSV Data in Python: This blog discusses using the CData Python Connector for CSV with pandas, Matplotlib, and SQLAlchemy to analyze and visualize live CSV data in Python. It highlights the ease of integration and superior performance of the connector, along with step-by-step instructions for connecting to CSV data, executing SQL queries, and visualizing the results in Python. 🌀 Collection of Guides on Mastering SQL, Python, Data Cleaning, Data Wrangling, and Exploratory Data Analysis: This guide is tailored for business intelligence professionals new to data science, offering step-by-step instructions on mastering SQL, Python, data cleaning, wrangling, and exploratory analysis. It emphasizes practical skills for extracting insights and showcases essential tools and techniques for effective data analysis. 🌀 Build An AI Application with Python in 10 Easy Steps: This blog outlines a 10-step guide to building and deploying AI applications with Python, covering objectives, data collection, model selection, training, evaluation, optimization, web app development, cloud deployment, and sharing the AI model, with practical advice for each step. ⚡Stay Informed with Industry HighlightsPower BI 🌀 Hybrid Workforce Experience Power BI report: This tutorial explains using the Power BI Hybrid Workforce Experience report to analyze the impact of hybrid work models on employees working onsite, remotely, or in a hybrid manner. It covers setup, key metrics analysis, and improving employee experience, with prerequisites outlined. 🌀 What are Lakeview dashboards? This article discusses Lakeview dashboards, designed for creating and sharing data visualizations within teams. It highlights their advanced features, comparison with Databricks SQL dashboards, and dataset optimizations for better performance, including handling various dataset sizes and query efficiency. 🌀 Use grouping and binning in Power BI Desktop: This article explains how to use grouping and binning in Power BI Desktop to refine data visualization. Grouping allows you to combine data points into larger categories for clearer analysis, while binning lets you define the size of data chunks for more meaningful visualization. The article provides step-by-step instructions for creating, editing, and applying groups and bins to numerical and time fields, enhancing the exploration of data and trends in visuals. 🌀 Dashboards in Operations Manager: This article covers dashboard templates and widgets in Operations Manager, outlining their layouts and functions. It highlights various dashboard types, such as Service Level, Summary, and Object State, each with specific widgets. Users can create, share, and view dashboards across different consoles. Microsoft Fabric🌀 Analyze Dataverse tables from Microsoft Fabric: The article announces new features for Dynamics 365 and Power Apps customers, allowing easy integration of insights into Fabric. Users can now create shortcuts to Dataverse environments in Fabric for quick data access and analysis across multiple environments, enhancing business insights. 🌀 Bridging Fabric Lakehouses: Delta Change Data Feed for Seamless ETL. This article explains using Delta Tables and the Delta Change Data Feed in Microsoft Fabric for efficient data synchronization across lakehouses. It highlights Delta Tables' features and demonstrates updating tables across Silver and Gold Lakehouses in a medallion architecture. AWS BI 🌀 Multicloud data lake analytics with Amazon Athena: This post discusses creating a unified query interface using Amazon Athena connectors to seamlessly query across multiple cloud data stores, simplifying analytics in organizations with data spread over different clouds. It also explores managing analytics costs using Athena workgroups and cost allocation tags. 🌀 How to Analyze Fastly Content Delivery Network Logs with Amazon QuickSight Powered by Generative BI? This post discusses using Fastly, a content delivery network (CDN), to enhance web performance and security. It highlights creating a dashboard with Amazon QuickSight for analyzing CDN logs, using AWS services like S3 and Glue for data storage and cataloging. Google Cloud Data 🌀 Apache Spark stored procedures in BigQuery are GA: BigQuery now supports Apache Spark stored procedures, enabling users to integrate Spark-based data processing with BigQuery's SQL capabilities. This simplifies using Spark within BigQuery, allowing seamless development, testing, and deployment of PySpark code, and installation of necessary packages in a unified environment. 🌀 Gemini Pro 1.0 available in BigQuery through Vertex AI: This post advocates for a unified platform to bridge data and AI teams, ensuring smooth workflows from data ingestion to ML training. It introduces BigQuery ML, enabling ML model creation, training, and execution in BigQuery using SQL. It supports various models, including Vertex AI-trained ones like PaLM 2 and Gemini Pro 1.0, and enables sharing trained models, promoting governed data usage and easy dataset discovery. Gemini Pro 1.0 integration into BigQuery via Vertex AI simplifies generative AI, enhancing collaboration, security, and governance in data workflows. ✨ Expert Insights from Packt CommunityUnlocking the Secrets of Prompt Engineering - By Gilbert Mizrahi Exploring LLM parameters LLMs such as OpenAI’s GPT-4 consist of several parameters that can be adjusted to control and fine-tune their behavior and performance. Understanding and manipulating these parameters can help users obtain more accurate, relevant, and contextually appropriate outputs. Some of the most important LLM parameters to consider are listed here: Model size: The size of an LLM typically refers to the number of neurons or parameters it has. Larger models can be more powerful and capable of generating more accurate and coherent responses. However, they might also require more computational resources and processing time. Users may need to balance the trade-off between model size and computational efficiency, depending on their specific requirements. Temperature: The temperature parameter controls the randomness of the output generated by the LLM. A higher temperature value (for example, 0.8) produces more diverse and creative responses, while a lower value (for example, 0.2) results in more focused and deterministic outputs. Adjusting the temperature can help users fine-tune the balance between creativity and consistency in the model’s responses. Top-k: The top-k parameter is another way to control the randomness and diversity of the LLM’s output. This parameter limits the model to consider only the top “k” most probable tokens for each step in generating the response. For example, if top-k is set to 5, the model will choose the next token from the five most likely options. By adjusting the top-k value, users can manage the trade-off between response diversity and coherence. A smaller top-k value generally results in more focused and deterministic outputs, while a larger top-k value allows for more diverse and creative responses. Max tokens: The max tokens parameter sets the maximum number of tokens (words or subwords) allowed in the generated output. By adjusting this parameter, users can control the length of the response provided by the LLM. Setting a lower max tokens value can help ensure concise answers, while a higher value allows for more detailed and elaborate responses. Prompt length: While not a direct parameter of the LLM, the length of the input prompt can influence the model’s performance. A longer, more detailed prompt can provide the LLM with more context and guidance, resulting in more accurate and relevant responses. However, users should be aware that very long prompts can consume a significant portion of the token limit, potentially truncating the model’s output. Discover more insights from 'Unlocking the Secrets of Prompt Engineering' by Gilbert Mizrahi. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here💡 What's the Latest Scoop from the BI Community? 🌀 Creating Interactive Power BI Dashboards That Engage Your Audience: This blog discusses the challenges faced by stakeholders and clients unfamiliar with using dashboards, preferring traditional tools like Excel. It emphasizes the importance of creating user-friendly and interactive dashboards to bridge this gap, offering techniques to enhance engagement and accessibility.🌀 Create and use report templates in Power BI Desktop: This tutorial explains how to create and use report templates in Power BI Desktop, enabling users to streamline report creation and standardize layouts, data models, and queries. Templates, saved with the .PBIT extension, help jump-start and share report creation processes across an organization. 🌀 10 Analytics Dashboard Examples to Gain Data Insights for SaaS: This article discusses the importance of analytics dashboards in simplifying the tracking of SaaS metrics and extracting insights. It provides 10 examples of analytics dashboards, including web, digital marketing, and user behavior, and highlights the top 5 analytics tools. The article emphasizes the need for clear, customizable, and intuitive dashboards for effective decision-making. 🌀 The Future of Data Storytelling: Actionable Intelligence [AI, Power BI, and Office]: This blog post discusses Zebra BI's solutions for reporting, planning, and presenting, emphasizing the importance of clarity, consistency, and actionability in data visualization. It introduces the concept of a reporting-planning-presenting cycle and highlights upcoming features and innovations, including the integration of AI. The post also mentions Zebra BI's adherence to the IBCS standard for clear and consistent business communication. 🌀 Power BI: Transforming Banking Data. This blog post discusses how Power BI can help banks analyze complex data for better decision-making. It covers challenges in banking, how Power BI integrates data sources, develops dashboards, and optimizes analytics. Benefits include improved operations, customer experience, risk management, and cost savings. 🌀 Power BI vs Tableau vs Qlik Sense | Which Wins In 2024? This blog compares Power BI, Tableau, and Qlik Sense for business intelligence (BI) and analytics. It highlights Power BI's advantages in data management, Tableau's strong visualization capabilities, and Qlik Sense's modern self-service platform. The article concludes with a comparison of features and recommendations for different needs. See you next time!Affiliate Disclosure: This newsletter contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. This supports our work and helps us keep providing useful content. We only recommend products and services we think will benefit our readers. Thanks for your support!

0
0
30729

article-image-enhancing-image-search-with-vector-similarity

Bahaaldine Azarmi, Jeff Vestal

12 Mar 2024

12 min read

Enhancing Image Search with Vector Similarity

Bahaaldine Azarmi, Jeff Vestal

12 Mar 2024

12 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Vector Search for Practitioners with Elastic, by Bahaaldine Azarmi and Jeff Vestal. Optimize your search capabilities in Elastic by operationalizing and fine-tuning vector search and enhance your search relevance while improving overall search performanceIntroductionVector similarity search plays a crucial role in image search. After images are transformed into vectors, a search query (also represented as a vector) is compared against the database of image vectors to find the most similar matches. This process is known as k-Nearest Neighbor (kNN) search, where “k” represents the number of similar items to retrieve.Several algorithms can be used for kNN search, including brute-force search and more efficient methods such as the Hierarchical Navigable Small World (HNSW) algorithm (see Chapter 7, Next Generation of Observability Powered, by Vectors for a more in-depth discussion on HNSW). Bruteforce search involves comparing the query vector with every vector in the database, which can be computationally expensive for large databases. On the other hand, HNSW is an optimized algorithm that can quickly find the nearest neighbors in a large-scale database, making it particularly useful for vector similarity search in image search systems.The tangible benefits of image search are observed across industries. Its flexibility and adaptability make it a tool of choice for enhancing user experiences, ensuring digital security, or even revolutionizing digital content interactions.Image search in practiceApplications of image search are varied and far-reaching. In e-commerce, for example, reverse image search allows customers to upload a photo of a product and find similar items for sale. In the field of digital forensics, image search can be used to find visually similar images across a database to detect illicit content. It is also used in the realm of social media for face recognition, image tagging, and content recommendation.As we continue to generate and share more visual content, the need for effective and efficient image search technology will only grow. The combination of artificial intelligence, machine learning, and vector similarity search provides a powerful toolkit to meet this demand, powering a new generation of image search capabilities that can analyze and understand visual content.Traditionally, image search engines use text-based metadata associated with images, such as the image’s filename, alt text, and surrounding text context, to understand the content of an image. This approach, however, is limited by the accuracy and completeness of the metadata, and it fails to analyze the actual visual content of the image itself.Over time, with advancements in artificial intelligence and machine learning, more sophisticated methods of image search have been developed that can analyze the visual content of images directly. This technique, known as content-based image retrieval (CBIR), involves extracting feature vectors from images and using these vectors to find visually similar images.Feature vectors are a numerical representation of an image’s visual content. They are generated by applying a feature extraction algorithm to the image. The specifics of the feature extraction process can vary, but in general, it involves analyzing the image’s colors, textures, and shapes. In recent years, CNNs have become a popular tool for feature extraction due to their ability to capture complex patterns in image data.Once feature vectors have been extracted from a set of images, these vectors can be indexed in a database. When a new query image is submitted, its feature vector is compared to the indexed vectors, and the images with the most similar vectors are returned as the search results. The similarity between vectors is typically measured using distance metrics such as Euclidean distance or cosine similarity.Despite the impressive capabilities of CBIR systems, there are several challenges in implementing them. For instance, interpreting and understanding the semantic meaning of images is a complex task due to the subjective nature of visual perception. Furthermore, the high dimensionality of image data can make the search process computationally expensive, particularly for large databases.To address these challenges, approximate nearest neighbor (ANN) search algorithms, such as the HNSW graph, are often used to optimize the search process. These algorithms sacrifice a small amount of accuracy for a significant increase in search speed, making them a practical choice for large-scale image search applications.With the advent of Elasticsearch’s dense vector field type, it is now possible to index and search highdimensional vectors directly within an Elasticsearch cluster. This functionality, combined with an appropriate feature extraction model, provides a powerful toolset for building efficient and scalable image search systems.In the following sections, we will delve into the details of image feature extraction, vector indexing, and search techniques. We will also demonstrate how to implement an image search system using Elasticsearch and a pre-trained CNN model for feature extraction. The overarching goal is to provide a comprehensive guide for building and optimizing image search systems using state-of-the-art technology.Vector search with imagesVector search is a transformative feature of Elasticsearch and other vector stores that enables a method for performing searches within complex data types such as images. Through this approach, images are converted into vectors that can be indexed, searched, and compared against each other, revolutionizing the way we can retrieve and analyze image data. This inherent characteristic of producing embeddings applies to other media types as well. This section provides an in-depth overview of the vector search process with images, including image vectorization, vector indexing in Elasticsearch, kNN search, vector similarity metrics, and fine-tuning the kNN algorithm.Image vectorizationThe first phase of the vector search process involves transforming the image data into a vector, a process known as image vectorization. Deep learning models, specifically CNNs, are typically employed for this task. CNNs are designed to understand and capture the intricate features of an image, such as color distribution, shapes, textures, and patterns. By processing an image through layers of convolutional, pooling, and fully connected nodes, a CNN can represent an image as a high-dimensional vector. This vector encapsulates the key features of the image, serving as its numerical representation.The output layer of a pre-trained CNN (often referred to as an embedding or feature vector) is often used for this purpose. Each dimension in this vector represents some learned feature from the image. For instance, one dimension might correspond to the presence of a particular color or texture pattern.The values in the vector quantify the extent to which these features are present in the image.Figure 1 : Layers of a CNN modelAs seen in the preceding diagram, these are the layers of a CNN model:1. Accepts raw pixel values of the image as input.2. Each layer extracts specific features such as edges, corners, textures, and so on.3. Introduces non-linearity, learns from errors, and approximates more complex functions.4. Reduces the dimensions of feature maps through down-sampling to decrease the computational complexity.5. Consists of the weights and biases from the previous layers for the classification process to take place.6. Outputs a probability distribution over classes.Indexing image vectors in ElasticsearchOnce the image vectors have been obtained, the next step is to index these vectors in Elasticsearch for future searching. Elasticsearch provides a special field type, the dense_vector field, to handle the storage of these high-dimensional vectors.A dense_vector field is defined as an array of numeric values, typically floating-point numbers, with a specified number of dimensions (dims). The maximum number of dimensions allowed for indexed vectors is currently 2,048, though this may be further increased in the future. It’s essential to note that each dense_vector field is single-valued, meaning that it is not possible to store multiple values in one such field.In the context of image search, each image (now represented as a vector) is indexed into an Elasticsearch document. This vector can be one per document or multiple vectors per document. The vector representing the image is stored in a dense_vector field within the document. Additionally, other relevant information or metadata about the image can be stored in other fields within the same document.The full example code can be found in the Jupyter Notebook available in the chapter 5 folder of this book’s GitHub repository at https://github.com/PacktPublishing/VectorSearch-for-Practitioners-with-Elastic/tree/main/chapter5, but we’ll discuss the relevant parts here.First, we will initialize a pre-trained model using the SentenceTransformer library.The clip-ViT-B-32-multilingual-v1 model is discussed in detail later in this chapter:model = SentenceTransformer('clip-ViT-B-32-multilingual-v1')Next, we will prepare the image transformation function:transform = transforms.Compose([ transforms.Resize(224), transforms.CenterCrop(224), lambda image: image.convert("RGB"), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ])Transforms.Compose() combines all the following transformations:transforms.Resize(224): Resizes the shorter side of the image to 224 pixels while maintaining the aspect ratio.transforms.CenterCrop(224): Crops the center of the image so that the resultant image has dimensions of 224x224 pixels.lambda image: image.convert("RGB"): This is a transformation that converts the image to the RGB format. This is useful for grayscale images or images with an alpha channel, as deep learning models typically expect RGB inputs.transforms.ToTensor(): Converts the image (in the PIL image format) into a PyTorch tensor. This will change the data from a range of [0, 255] in the PIL image format to a float in a range [0.0, 1.0].transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)): Normalizes the tensor image with a given mean and standard deviation for each channel. In this case, the mean and standard deviation for all three channels (R, G, B) are 0.5. This normalization will transform the data range from [0.0, 1.0] to [-1.0, 1.0].We can use the following code to apply the transform to an image file and then generate an image vector using the model. See the Python notebook for this chapter to run against actual image files:from PIL import Image img = Image.open("image_file.jpg") image = transform(img).unsqueeze(0) image_vector = model.encode(image)The vector and other associated data can then be indexed into Elasticsearch for use with kNN search:# Create document document = {'_index': index_name, '_source': {"filename": filename, "image_vector": vector See the complete code in the chapter 5 folder of this book’s GitHub repository.With vectors generated and indexed into Elasticsearch, we can move on to searching for similar images.k-Nearest Neighbor (kNN) searchWith the vectors now indexed in Elasticsearch, the next step is to make use of kNN search. You can refer back to Chapter 2, Getting Started with Vector Search in Elastic, for a full discussion on kNN and HNSW search.As with text-based vector search, when performing vector search with images, we first need to convert our query image to a vector. The process is the same as we used to convert images to vectors at index time.We convert the image to a vector and include that vector in the query_vector parameter of the knn search function:knn = { "field": "image_vector", "query_vector": search_image_vector[0], "k": 1, "num_candidates": 10 }Here, we specify the following:field: The field in the index that contains vector representations of images we are searching againstquery_vector: The vector representation of our query imagek: We want only one closest imagenum_candidates: The number of approximate nearest neighbor candidates on each shard to search againstWith an understanding of how to convert an image to a vector representation and perform an approximate nearest neighbor search, let’s discuss some of the challenges.Challenges and limitations with image searchWhile vector search with images offers powerful capabilities for image retrieval, it also comes with certain challenges and limitations. One of the main challenges is the high dimensionality of image vectors, which can lead to computational inefficiencies and difficulties in visualizing and interpreting the data.Additionally, while pre-trained models for feature extraction can capture a wide range of features, they may not always align with the specific features that are relevant to a particular use case. This can lead to suboptimal search results. One potential solution, not limited to image search, is to use transfer learning to fine-tune the feature extraction model on a specific task, although this requires additional data and computational resources.ConclusionIn conclusion, vector similarity search revolutionizes image retrieval by harnessing advanced algorithms and machine learning. From e-commerce to digital forensics, its impact is profound, enhancing user experiences and content discovery. Leveraging techniques like k-Nearest Neighbor search and Elasticsearch's dense vector field, image search becomes more efficient and scalable. Despite challenges, such as high dimensionality and feature alignment, ongoing advancements promise even greater insights into visual data. As technology evolves, so does our ability to navigate and understand the vast landscape of images, ensuring a future of enhanced digital interactions and insights.Author BioBahaaldine Azarmi, Global VP Customer Engineering at Elastic, guides companies as they leverage data architecture, distributed systems, machine learning, and generative AI. He leads the customer engineering team, focusing on cloud consumption, and is passionate about sharing knowledge to build and inspire a community skilled in AI.Jeff Vestal has a rich background spanning over a decade in financial trading firms and extensive experience with Elasticsearch. He offers a unique blend of operational acumen, engineering skills, and machine learning expertise. As a Principal Customer Enterprise Architect, he excels at crafting innovative solutions, leveraging Elasticsearch's advanced search capabilities, machine learning features, and generative AI integrations, adeptly guiding users to transform complex data challenges into actionable insights.

0
0
30457

article-image-zapiers-ai-features-a-game-changer-for-automation

Kelly Goss

15 Feb 2024

8 min read

Zapier's AI Features: A Game-Changer for Automation

Kelly Goss

15 Feb 2024

8 min read

0
0
30034

How-To Tutorials - AI Tools

Getting Started with Med-PaLM 2

Vertex AI Workbench: Your Complete Guide to Scaling Machine Learning with Google Cloud

Practical AI in Excel: Create a Linear Regression Model

Top 100+ Essential Data Science Tools & Repos: Streamline Your Workflow Today!

Everything you need to know about Pinecone – A Vector Database

Microsoft AI’s Skeleton Key, AutoML with AutoGluon, MultiOn AI's Retrieve API, Narrative BI’s Hybrid AI, Python's Duck Typing, Gibbs Diffusion

Setting Up OpenAI Playground

Setting Up Polars for Data Analysis

Everything You Need to Know about AgentGPT

Getting Started with OpenAI Whisper

Trending Topics

Transforming Web Data with Browse AI

Fabric’s Code-First AutoML and Hyperparameter Tuning, Google Cloud Cortex Framework, Snowflake’s Data Metric Functions, Qlik's AI Accelerator

Elevate Your BI Dashboards with Figma

Enhancing Image Search with Vector Similarity

Zapier's AI Features: A Game-Changer for Automation