Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-configuring-esp8266
Packt
14 Jun 2017
10 min read
Save for later

Configuring the ESP8266

Packt
14 Jun 2017
10 min read
In this article by Marco Schwartz the authors of the book ESP8266 Internet of Things Cookbook, we will learn following recipes: Setting up the Arduino development environment for the ESP8266 Choosing an ESP8266 Required additional components (For more resources related to this topic, see here.) Setting up the Arduino development environment for the ESP8266 To start us off, we will look at how to set up Arduino IDE development environment so that we can use it to program the ESP8266. This will involve installing the Arduino IDE and getting the board definitions for our ESP8266 module. Getting ready The first thing you should do is download the Arduino IDE if you do not already have it installed in your computer. You can do that from this link: https://www.arduino.cc/en/Main/Software. The webpage will appear as shown. It features that latest version of the Arduino IDE. Select your operating system and download the latest version that is available when you access the link (it was 1.6.13 at when this articlewas being written): When the download is complete, install the Arduino IDE and run it on your computer. Now that the installation is complete it is time to get the ESP8266 definitions. Open the preference window in the Arduino IDE from File|Preferences or by pressing CTRL+Comma. Copy this URL: http://arduino.esp8266.com/stable/package_esp8266com_index.json. Paste it in the filed labelled additional board manager URLs as shown in the figure. If you are adding other URLs too, use a comma to separate them: Open the board manager from the Tools|Board menu and install the ESP8266 platform. The board manager will download the board definition files from the link provided in the preferences window and install them. When the installation is complete the ESP8266 board definitions should appear as shown in the screenshot. Now you can select your ESP8266 board from Tools|Board menu: How it works… The Arduino IDE is an open source development environment used for programming Arduino boards and Arduino-based boards. It is also used to upload sketches to other open source boards, such as the ESP8266. This makes it an important accessory when creating Internet of Things projects. Choosing an ESP8266 board The ESP8266 module is a self-contained System On Chip (SOC) that features an integrated TCP/IP protocol stack that allows you to add Wi-Fi capability to your projects. The module is usually mounted on circuit boards that breakout the pins of the ESP8266 chip, making it easy for you program the chip and to interface with input and output devices. ESP8266 boards come in different forms depending on the company that manufactures them. All the boards use Espressif’s ESP8266 chip as the main controller, but have different additional components and different pin configurations, giving each board unique additional features. Therefore, before embarking on your IoT project, take some time to compare and contrast the different types of ESP8266 boards that are available. This way, you will be able to select the board that has features best suited for your project. Available options The simple ESP8266-01 module is the most basic ESP8266 board available in the market. It has 8 pins which include 4 General Purpose Input/Output (GPIO) pins, serial communication TX and RX pins, enable pin and power pins VCC and GND. Since it only has 4 GPIO pins, you can only connect three inputs or outputsto it. The 8-pin header on the ESP8266-01 module has a 2.0mm spacing which is not compatible with breadboards. Therefore, you have to look for another way to connect the ESP8266-01 module to your setup when prototyping. You can use female to male jumper wires to do that: The ESP8266-07 is an improved version of the ESP8266-01 module. It has 16 pins which comprise of 9 GPIO pins, serial communication TX and RX pins, a reset pin, an enable pin and power pins VCC and GND. One of the GPIO pins can be used as an analog input pin.The board also comes with a U.F.L. connector that you can use to plug an external antenna in case you need to boost Wi-Fi signal. Since the ESP8266 has more GPIO pins you can have more inputs and outputs in your project. Moreover, it supports both SPI and I2C interfaces which can come in handy if you want to use sensors or actuators that communicate using any of those protocols. Programming the board requires the use of an external FTDI breakout board based on USB to serial converters such as the FT232RL chip. The pads/pinholes of the ESP8266-07 have a 2.0mm spacing which is not breadboard friendly. To solve this, you have to acquire a plate holder that breaks out the ESP8266-07 pins to a breadboard compatible pin configuration, with 2.54mm spacing between the pins. This will make prototyping easier. This board has to be powered from a 3.3V which is the operating voltage for the ESP8266 chip: The Olimex ESP8266 module is a breadboard compatible board that features the ESP8266 chip. Just like the ESP8266-07 board, it has SPI, I2C, serial UART and GPIO interface pins. In addition to that it also comes with Secure Digital Input/Output (SDIO) interface which is ideal for communication with an SD card. This adds 6 extra pins to the configuration bringing the total to 22 pins. Since the board does not have an on-board USB to serial converter, you have to program it using an FTDI breakout board or a similar USB to serial board/cable. Moreover it has to be powered from a 3.3V source which is the recommended voltage for the ESP8266 chip: The Sparkfun ESP8266 Thing is a development board for the ESP8266 Wi-Fi SOC. It has 20 pins that are breadboard friendly, which makes prototyping easy. It features SPI, I2C, serial UART and GPIO interface pins enabling it to be interfaced with many input and output devices.There are 8 GPIO pins including the I2C interface pins. The board has a 3.3V voltage regulator which allows it to be powered from sources that provide more than 3.3V. It can be powered using a micro USB cable or Li-Po battery. The USB cable also charges the attached Li-Po battery, thanks to the Li-Po battery charging circuit on the board. Programming has to be done via an external FTDI board: The Adafruit feather Huzzah ESP8266 is a fully stand-alone ESP8266 board. It has built in USB to serial interface that eliminates the need for using an external FTDI breakout board to program it. Moreover, it has an integrated battery charging circuit that charges any connected Li-Po battery when the USB cable is connected. There is also a 3.3V voltage regulator on the board that allows the board to be powered with more than 3.3V. Though there are 28 breadboard friendly pins on the board, only 22 are useable. 10 of those pins are GPIO pins and can also be used for SPI as well as I2C interfacing. One of the GPIO pins is an analog pin: What to choose? All the ESP8266 boards will add Wi-Fi connectivity to your project. However, some of them lack important features and are difficult to work with. So, the best option would be to use the module that has the most features and is easy to work with. The Adafruit ESP8266 fits the bill. The Adafruit ESP8266 is completely stand-alone and easy to power, program and configure due to its on-board features. Moreover, it offers many input/output pins that will enable you to add more features to your projects. It is affordable andsmall enough to fit in projects with limited space. There’s more… Wi-Fi isn’t the only technology that we can use to connect out projects to the internet. There are other options such as Ethernet and 3G/LTE. There are shields and breakout boards that can be used to add these features to open source projects. You can explore these other options and see which works for you. Required additional components To demonstrate how the ESP8266 works we will use some addition components. These components will help us learn how to read sensor inputs and control actuators using the GPIO pins. Through this you can post sensor data to the internet and control actuators from the internet resources such as websites. Required components The components we will use include: Sensors DHT11 Photocell Soil humidity Actuators Relay Powerswitch tail kit Water pump Breadboard Jumper wires Micro USB cable Sensors Let us discuss the three sensors we will be using. DHT11 The DHT11 is a digital temperature and humidity sensor. It uses a thermistor and capacitive humidity sensor to monitor the humidity and temperature of the surrounding air and produces a digital signal on the data pin. A digital pin on the ESP8266 can be used to read the data from the sensor data pin: Photocell A photocell is a light sensor that changes its resistance depending on the amount of incident light it is exposed to. They can be used in a voltage divider setup to detect the amount of light in the surrounding. In a setup where the photocell is used in the Vcc side of the voltage divider, the output of the voltage divider goes high when the light is bright and low when the light is dim. The output of the voltage divider is connected to an analog input pin and the voltage readings can be read: Soil humidity sensor The soil humidity sensor is used for measuring the amount of moisture in soil and other similar materials. It has two large exposed pads that act as a variable resistor. If there is more moisture in the soil the resistance between the pads reduces, leading to higher output signal. The output signal is connected to an analog pin from where its value is read: Actuators Let’s discuss about the actuators. Relays A relay is a switch that is operated electrically. It uses electromagnetism to switch large loads using small voltages. It comprises of three parts: a coil, spring and contacts. When the coil is energized by a HIGH signal from a digital pin of the ESP8266 it attracts the contacts forcing them closed. This completes the circuit and turns on the connected load. When the signal on the digital pin goes LOW, the coil is no longer energized and the spring pulls the contacts apart. This opens the circuit and turns of the connected load: Power switch tail kit A power switch tail kit is a device that is used to control standard wall outlet devices with microcontrollers. It is already packaged to prevent you from having to mess around with high voltage wiring. Using it you can control appliances in your home using the ESP8266: Water pump A water pump is used to increase the pressure of fluids in a pipe. It uses a DC motor to rotate a fan and create a vacuum that sucks up the fluid. The sucked fluid is then forced to move by the fan, creating a vacuum again that sucks up the fluid behind it. This in effect moves the fluid from one place to another: Breadboard A breadboard is used to temporarily connect components without soldering. This makes it an ideal prototyping accessory that comes in handy when building circuits: Jumper wires Jumper wires are flexible wires that are used to connect different parts of a circuit on a breadboard: Micro USB cable A micro USB cable will be used to connect the Adafruit ESP8266 board to the compute: Summary In this article we have learned how to setting up the Arduino development environment for the ESP8266,choosing an ESP8266, and required additional components.  Resources for Article: Further resources on this subject: Internet of Things with BeagleBone [article] Internet of Things Technologies [article] BLE and the Internet of Things [article]
Read more
  • 0
  • 0
  • 44643

article-image-essential-sql-for-data-engineers
Kedeisha Bryan, Taamir Ransome
31 Oct 2024
10 min read
Save for later

Essential SQL for Data Engineers

Kedeisha Bryan, Taamir Ransome
31 Oct 2024
10 min read
This article is an excerpt from the book, Cracking the Data Engineering Interview, by Kedeisha Bryan, Taamir Ransome. The book is a practical guide that’ll help you prepare to successfully break into the data engineering role. The chapters cover technical concepts as well as tips for resume, portfolio, and brand building to catch the employer's attention, while also focusing on case studies and real-world interview questions.Introduction In the world of data engineering, SQL is the unsung hero that empowers us to store, manipulate, transform, and migrate data easily. It is the language that enables data engineers to communicate with databases, extract valuable insights, and shape data to meet their needs. Regardless of the nature of the organization or the data infrastructure in use, a data engineer will invariably need to use SQL for creating, querying, updating, and managing databases. As such, proficiency in SQL can often the difference between a good data engineer and a great one. Whether you are new to SQL or looking to brush up your skills, this chapter will serve as a comprehensive guide. By the end of this chapter, you will have a solid understanding of SQL as a data engineer and be prepared to showcase your knowledge and skills in an interview setting. In this article, we will cover the following topics: Must-know foundational SQL concepts Must-know advanced SQL concepts Technical interview questions Must-know foundational SQL concepts In this section, we will delve into the foundational SQL concepts that form the building blocks of data engineering. Mastering these fundamental concepts is crucial for acing SQL-related interviews and effectively working with databases. Let’s explore the critical foundational SQL concepts every data engineer should be comfortable with, as follows: SQL syntax: SQL syntax is the set of rules governing how SQL statements should be written. As a data engineer, understanding SQL syntax is fundamental because you’ll be writing and reviewing SQL queries regularly. These queries enable you to extract, manipulate, and analyze data stored in relational databases. SQL order of operations: The order of operations dictates the sequence in which each of the following operators is executed in a query: FROM and JOIN WHERE GROUP BY HAVING SELECT DISTINCT ORDER BY LIMIT/OFFSET Data types: SQL supports a variety of data types, such as INT, VARCHAR, DATE, and so on. Understanding these types is crucial because they determine the kind of data that can be stored in a column, impacting storage considerations, query performance, and data integrity. As a data engineer, you might also need to convert data types or handle mismatches. SQL operators: SQL operators are used to perform operations on data. They include arithmetic operators (+, -, *, /), comparison operators (>, <, =, and so on), and logical operators (AND, OR, and NOT). Knowing these operators helps you construct complex queries to solve intricate data-related problems. Data Manipulation Language (DML), Data Definition Language (DDL), and Data Control  Language (DCL) commands: DML commands such as SELECT, INSERT, UPDATE, and DELETE allow you to manipulate data stored in the database. DDL commands such as CREATE, ALTER, and DROP enable you to manage database schemas. DCL commands such as GRANT and REVOKE are used for managing permissions. As a data engineer, you will frequently use these commands to interact with databases. Basic queries: Writing queries to select, filter, sort, and join data is an essential skill for any data engineer. These operations form the basis of data extraction and manipulation. Aggregation functions: Functions such as COUNT, SUM, AVG, MAX, MIN, and GROUP BY are used to perform calculations on multiple rows of data. They are essential for generating reports and deriving statistical insights, which are critical aspects of a data engineer’s role. The following section will dive deeper into must-know advanced SQL concepts, exploring advanced techniques to elevate your SQL proficiency. Get ready to level up your SQL game and unlock new possibilities in data engineering! Must-know advanced SQL concepts This section will explore advanced SQL concepts that will elevate your data engineering skills to the next level. These concepts will empower you to tackle complex data analysis, perform advanced data transformations, and optimize your SQL queries. Let’s delve into must-know advanced SQL concepts, as follows: Window functions: These do a calculation on a group of rows that are related to the current row. They are needed for more complex analyses, such as figuring out running totals or moving averages, which are common tasks in data engineering. Subqueries: Queries nested within other queries. They provide a powerful way to perform complex data extraction, transformation, and analysis, often making your code more efficient and readable. Common Table Expressions (CTEs): CTEs can simplify complex queries and make your code more maintainable. They are also essential for recursive queries, which are sometimes necessary for problems involving hierarchical data. Stored procedures and triggers: Stored procedures help encapsulate frequently performed tasks, improving efficiency and maintainability. Triggers can automate certain operations, improving data integrity. Both are important tools in a data engineer’s toolkit. Indexes and optimization: Indexes speed up query performance by enabling the database to locate data more quickly. Understanding how and when to use indexes is key for a data engineer, as it affects the efficiency and speed of data retrieval. Views: Views simplify access to data by encapsulating complex queries. They can also enhance security by restricting access to certain columns. As a data engineer, you’ll create and manage views to facilitate data access and manipulation. By mastering these advanced SQL concepts, you will have the tools and knowledge to handle complex data scenarios, optimize your SQL queries, and derive meaningful insights from your datasets. The following section will prepare you for technical interview questions on SQL. We will equip you with example answers and strategies to excel in SQL-related interview discussions. Let’s further enhance your SQL expertise and be well prepared for the next phase of your data engineering journey. Technical interview questions This section will address technical interview questions specifically focused on SQL for data engineers. These questions will help you demonstrate your SQL proficiency and problem-solving abilities. Let’s explore a combination of primary and advanced SQL interview questions and the best methods to approach and answer them, as follows: Question 1: What is the difference between the WHERE and HAVING clauses? Answer: The WHERE clause filters data based on conditions applied to individual rows, while the HAVING clause filters data based on grouped results. Use WHERE for filtering before aggregating data and HAVING for filtering after aggregating data. Question 2: How do you eliminate duplicate records from a result set? Answer: Use the DISTINCT keyword in the SELECT statement to eliminate duplicate records and retrieve unique values from a column or combination of columns. Question 3: What are primary keys and foreign keys in SQL? Answer: A primary key uniquely identifies each record in a table and ensures data integrity. A foreign key establishes a link between two tables, referencing the primary key of another table to enforce referential integrity and maintain relationships. Question 4: How can you sort data in SQL? Answer: Use the ORDER BY clause in a SELECT statement to sort data based on one or more columns. The ASC (ascending) keyword sorts data in ascending order, while the DESC (descending) keyword sorts it in descending order. Question 5: Explain the difference between UNION and UNION ALL in SQL. Answer: UNION combines and removes duplicate records from the result set, while UNION ALL combines all records without eliminating duplicates. UNION ALL is faster than UNION because it does not involve the duplicate elimination process. Question 6: Can you explain what a self join is in SQL? Answer: A self join is a regular join where a table is joined to itself. This is often useful when the data is related within the same table. To perform a self join, we have to use table aliases to help SQL distinguish the left from the right table. Question 7: How do you optimize a slow-performing SQL query? Answer: Analyze the query execution plan, identify bottlenecks, and consider strategies such as creating appropriate indexes, rewriting the query, or using query optimization techniques such as JOIN order optimization or subquery optimization.  Question 8: What are CTEs, and how do you use them? Answer: CTEs are temporarily named result sets that can be referenced within a query. They enhance query readability, simplify complex queries, and enable recursive queries. Use the WITH keyword to define CTEs in SQL. Question 9: Explain the ACID properties in the context of SQL databases. Answer: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These are basic properties that make sure database operations are reliable and transactional. Atomicity makes sure that a transaction is handled as a single unit, whether it is fully done or not. Consistency makes sure that a transaction moves the database from one valid state to another. Isolation makes sure that transactions that are happening at the same time don’t mess with each other. Durability makes sure that once a transaction is committed, its changes are permanent and can survive system failures. Question 10: How can you handle NULL values in SQL? Answer: Use the IS NULL or IS NOT NULL operator to check for NULL values. Additionally, you can use the COALESCE function to replace NULL values with alternative non-null values. Question 11: What is the purpose of stored procedures and functions in SQL? Answer: Stored procedures and functions are reusable pieces of SQL code encapsulating a set of SQL statements. They promote code modularity, improve performance, enhance security, and simplify database maintenance. Question 12: Explain the difference between a clustered and a non-clustered index. Answer: The physical order of the data in a table is set by a clustered index. This means that a table can only have one clustered index. The data rows of a table are stored in the leaf nodes of a clustered index. A non-clustered index, on the other hand, doesn’t change the order of the data in the table. After sorting the pointers, it keeps a separate object in a table that points back to the original table rows. There can be more than one non-clustered index for a table. Prepare for these interview questions by understanding the underlying concepts, practicing SQL queries, and being able to explain your answers. ConclusionThis article explored the foundational and advanced principles of SQL that empower data engineers to store, manipulate, transform, and migrate data confidently. Understanding these concepts has unlocked the door to seamless data operations, optimized query performance, and insightful data analysis. SQL is the language that bridges the gap between raw data and valuable insights. With a solid grasp of SQL, you possess the skills to navigate databases, write powerful queries, and design efficient data models. Whether preparing for interviews or tackling real-world data engineering challenges, the knowledge you have gained in this chapter will propel you toward success. Remember to continue exploring and honing your SQL skills. Stay updated with emerging SQL technologies, best practices, and optimization techniques to stay at the forefront of the ever-evolving data engineering landscape. Embrace the power of SQL as a critical tool in your data engineering arsenal, and let it empower you to unlock the full potential of your data. Author BioKedeisha Bryan is a data professional with experience in data analytics, science, and engineering. She has prior experience combining both Six Sigma and analytics to provide data solutions that have impacted policy changes and leadership decisions. She is fluent in tools such as SQL, Python, and Tableau.She is the founder and leader at the Data in Motion Academy, providing personalized skill development, resources, and training at scale to aspiring data professionals across the globe. Her other works include another Packt book in the works and an SQL course for LinkedIn Learning.Taamir Ransome is a Data Scientist and Software Engineer. He has experience in building machine learning and artificial intelligence solutions for the US Army. He is also the founder of the Vet Dev Institute, where he currently provides cloud-based data solutions for clients. He holds a master's degree in Analytics from Western Governors University.
Read more
  • 2
  • 0
  • 44542

article-image-implementing-software-engineering-best-practices-and-techniques-apache-maven
Packt
24 Aug 2011
10 min read
Save for later

Implementing Software Engineering Best Practices and Techniques with Apache Maven

Packt
24 Aug 2011
10 min read
  Apache Maven 3 Cookbook Over 50 recipes towards optimal Java Software Engineering with Maven 3 These techniques have been around for more than a decade and are well-known by practitioners of software engineering. The benefits, trade-offs, and pros and cons of these practices are well-known and will only need little mentioning. These practices are not inter-dependent, but some of them are inter-related in the larger scheme of things. One such example would be the relation between project modularization and dependency management. While nothing stops either from being implemented in isolation, they are more beneficial when implemented together. These techniques can be further supplemented by the industry's best practices such as continuous integration, maintaining centralized repositories, source code integration, and so on. Our focus here will be on steadily understanding these software engineering techniques within the context of Maven projects and we will look at practical ways to implement and integrate them. Build automation Build automation is the scripting of tasks that software developers have to do on a day-to-day basis. These tasks include: Compilation of source code to binary code Packaging of binary code Running tests Deployment to remote systems Creation of documentation and release notes Build automation offers a range of benefits including speeding up of builds, elimination of bad builds, standardization in teams and organizations, increased efficiency, and improvements in product quality. Today, it is considered as an absolute essential for software engineering practitioners. Getting ready You need to have a Maven project ready. If you don't have one, run the following in the command line to create a simple Java project: $ mvn archetype:generate -DgroupId=net.srirangan.packt.maven -DartifactId=MySampleApp How to do it... The archetype:generate command would have generated a sample Apache Maven project for us. If we choose the maven-archetype-quickstart archetype from the list, our project structure would look similar to the following: └───src ├───main │ └───java │ └───net │ └───srirangan │ └───packt │ └───maven └───test └───java └───net └───srirangan └───packt └───maven In every Apache Maven project, including the one we just generated, the build is pre-automated following the default build lifecycle. Follow the steps given next to validate the same: Start the command-line terminal and navigate to the root of the Maven project. Try running the following commands in serial order: $ mvn validate ... $ mvn compile ... $ mvn package ... $ mvn test ... You just triggered some of the phases of the build life cycle by individual commands. Maven lets you automate the running of all the phases in the correct order. Just execute the following command, mvn install, and it will encapsulate much of the default build lifecycle including compiling, testing, packaging, and installing the artifact in the local repository. How it works... For every Apache Maven project, regardless of the packaging type, the default build lifecycle is applied and the build is automated. As we just witnessed, the default build lifecycle consists of phases that can be executed from the command-line terminal. These phases are: Validate: Validates that all project information is available and correct Compile: Compiles the source code Test: Runs unit tests within a suitable framework Package: Packages the compiled code in its distribution format Integration-test: Processes the package in the integration test environment Verify: Runs checks to verify if the package is valid Install: Installs the package in the local repository Deploy: Installs the final package in a remote repository Each of the build lifecycle phases is a Maven plugin. When you execute them for the first time, Apache Maven will download the plugin from the default online Maven Central Repository that can be found at http://repo1.maven.org/maven2 and will install it in your local Apache Maven repository. This ensures that build automation is always set up in a consistent manner for everyone in the team, while the specifics and internals of the build are abstracted out. Maven build automation also pushes for standardization among different projects within an organization, as the commands to execute build phases remain the same. Project modularization Considering that you're building a large enterprise application, it will need to interact with a legacy database, work with existing services, provide a modern web and device capable user interface, and expose APIs for other applications to consume. It does make sense to split this rather large project into subprojects or modules. Apache Maven provides impeccable support for such a project organization through Apache Maven Multi-modular projects. Multi-modular projects consist of a "Parent Project" which contains "Child Projects" or "Modules". The parent project's POM file contains references to all these sub-modules. Each module can be of a different type, with a different packaging value. Getting ready We begin by creating the parent project. Remember to set the value of packaging to pom, as highlighted in the following code: <?xml version="1.0" encoding="UTF-8"?> <project xsi_schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>net.srirangan.packt.maven</groupId> <artifactId>TestModularApp</artifactId> <version>1.0-SNAPSHOT</version> <packaging>pom</packaging> <name>MyLargeModularApp</name> </project> This is the base parent POM file for our project MyLargeModularApp. It doesn't contain any sub-modules for now. How to do it... To create your first sub-module, start the command-line terminal, navigate to the parent POM directory, and run the following command: $ mvn archetype:generate This will display a list of archetypes for you to select. You can pick archetype number 101, maven-archetype-quickstart, which generates a basic Java project. The archetype:generate command also requires you to fill in the Apache Maven project co-ordinates including the project groupId, artifactId, package, and version. After project generation, inspect the POM file of the original parent project. You will find the following block added: <modules> <module>moduleJar</module> </modules> The sub-module we created has been automatically added in the parent POM. It simply works—no intervention required! We now create another sub-module, this time a Maven web application by running the following in the command line: $ mvn archetype:generate -DarchetypeArtifactId=maven-archetype- webapp Let's have another look at the parent POM file; we should see both the sub-modules included: <modules> <module>moduleJar</module> <module>moduleWar</module> </modules> Our overall project structure should look like this: MyLargeModularApp ├───MyModuleJar │ └───src │ ├───main │ │ └───java │ │ └───net │ │ └───srirangan │ │ └───packt │ │ └───maven │ └───test │ └───java │ └───net │ └───srirangan │ └───packt │ └───maven └───MyModuleWar └───src └───main ├───resources └───webapp └───WEB-INF How it works... Compiling and installing both sub-modules (in the correct order in case sub-modules are interdependent) is essential. It can be done in the command line by navigating to the parent POM folder and running the following command: $ mvn clean install Thus, executing build phase on the parent project automatically gets executed for all its child projects in the correct order. You should get an output similar to: ------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] MyLargeModularApp ........................ SUCCESS [0.439s] [INFO] MyModuleJar .............................. SUCCESS [3.047s] [INFO] MyModuleWar Maven Webapp ................. SUCCESS [0.947s] ------------------------------------------------------------------ [INFO] BUILD SUCCESS ------------------------------------------------------------------ Dependency management Dependency management can be universally acknowledged as one of the best features of Apache Maven. In Multi-modular projects, where dependencies can run into tens or even hundreds, Apache Maven excels in allowing you to retain a high degree of control and stability. Apache Maven dependencies are transient, which means Maven will automatically discover artifacts that your dependencies require. This feature has been available since Maven 2, and it especially comes in handy for many of the open source project dependencies we have in today's enterprise projects. Getting ready Maven dependencies have six possible scopes: Compile: This is the default scope. Compile dependencies are available in the classpaths. Provided: This scope assumes that the JDK or the environment provides dependencies at runtime. Runtime: Dependencies that are required at runtime and are specified in the runtime classpaths. Test: Dependencies required for test compilation and execution. System: Dependency is always available, but the JAR is provided nonetheless. Import: Imports dependencies specified in POM included via the <dependencyManagement/> element. How to do it... Dependencies for Apache Maven projects are described in project POM files. While we take a closer look at these in the How it works... section of this recipe, here we will explore the Apache Maven dependency plugin. According to http://maven.apache.org/plugins/maven-dependency-plugin/: "The dependency plugin provides the capability to manipulate artifacts. It can copy and/or unpack artifacts from local or remote repositories to a specified location." It's a decent little plugin and provides us with a number of very useful goals. They are as follows: $ mvn dependency:analyze Analyzes dependencies (used, unused, declared, undeclared) $ mvn dependency:analyze-duplicate Determines duplicate dependencies $ mvn dependency:resolve Resolves all dependencies $ mvn dependency:resolve-plugin Resolves all plugins $ mvn dependency:tree Displays dependency trees How it works... Most Apache Maven projects have dependencies to other artifacts (that is, other projects, libraries, and tools). Management of dependencies and their seamless integration is one of Apache Maven's strongest features. These dependencies for a Maven project are specified in the project's POM file. <dependencies> <dependency> <groupId>...</groupId> <artifactId>...</artifactId> <version>...</version> <scope>...</scope> </dependency> </dependencies> In Multi-modular projects, dependencies can be defined in the parent POM files and can be subsequently inherited by child POM files as and when required. Having a single source for all dependency definitions makes dependency versioning simpler, thus keeping large projects' dependencies organized and manageable over time. The following is an example to show a Multi-modular project having a MySQL dependency. The parent POM would contain the complete definition of the dependency: <dependencyManagement> <dependencies> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.2</version> </dependency> <dependencies> </dependencyManagement> All child modules that require MySQL would only include a stub dependency definition: <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> </dependency> There will be no version conflicts between multiple child modules having the same dependencies. The dependencies scope and type are defaulted to compile and JAR. However, they can be overridden as required: <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.8.2</version> <scope>test</scope> </dependency> <dependency> <groupId>...</groupId> <artifactId>...</artifactId> <version>...</version> <type>war</type> </dependency> There's more... System dependencies are not looked for in the repository. For them, we need to specify the path to the JAR: <dependencies> <dependency> <groupId>sun.jdk</groupId> <artifactId>tools</artifactId> <version>1.5.0</version> <scope>system</scope> <systemPath>${java.home}/../lib/tools.jar</systemPath> </dependency> </dependencies> However, avoiding the use of system dependencies is strongly recommended because it kills the whole purpose of Apache Maven dependency management in the first place. Ideally, a developer should be able to clone code out of the SCM and run Apache Maven commands. After that, it should be the responsibility of Apache Maven to take care of including all dependencies. System dependencies would force the developer to take extra steps and that dilutes the effectiveness of Apache Maven in your team environment.
Read more
  • 0
  • 0
  • 44521

article-image-gemini-10-pro-vision-in-bigquery-python-ui-library-feature-engineering-with-fabric-and-pyspark-power-analytics-with-redshift-amazon-rds-for-mysql
Merlyn Shelley
19 Apr 2024
14 min read
Save for later

Gemini 1.0 Pro Vision in BigQuery, Python UI Library, Feature Engineering with Fabric and PySpark, Power analytics with Redshift, Amazon RDS for MySQL

Merlyn Shelley
19 Apr 2024
14 min read
Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!Get the first look at Sigma's new features and functionality at our virtual product launch on May 2nd at 12pm ET/9am PT.The virtual event will showcase talks and demos from Sigma's CEO, co-founders, and product managers about what's next in the future of analytics.Don't miss out. See how Sigma is reinventing BI.👋 Hello,Welcome to BI-Pro #52: Your Premier Destination for Data and BI Insights! 🌟 In This Edition: 🔮 Data Viz with Python Libraries Exploring causality with Python. Meet NiceGUI: Your Soon-to-be Favorite Python UI Library. Feature Engineering with Microsoft Fabric and PySpark. 10 GitHub Repositories to Master Python. 🔌 Power BI On-premises data gateway April 2024 release. Copilot in Power BI expansion. 🛠️ Microsoft Fabric Introducing Optimistic Job Admission for Fabric Spark. Introducing Job Queueing for Notebook in Microsoft Fabric. ☁️ AWS BI Meet Amazon QuickSight expert Sanjeeb Mohapatra. Handle tables without primary keys for Amazon Aurora MySQL and Amazon RDS for MySQL. Power analytics with Amazon Redshift. 🌐 Google Cloud Data Gemini 1.0 Pro Vision in BigQuery. BigQuery data canvas. Gemini in Looker AI-powered BI. Memorystore for Redis Cluster updates. Firestore launch updates. 📊Tableau Tableau vs Power BI: A Comparison of AI-Powered Analytics Tools. Salesforce-Informatica Deal Could Transform Enterprise GenAI Forever. ✨ Expert Insights from Packt Community ChatGPT for Cybersecurity Cookbook by Clint Bodungen. 💡 What's the Latest Scoop from the BI Community? Geospatial Data Analysis with Geemap. Microsoft Fabric Table Maintenance - Checkpoint and Statistics. Identifying Customer Buying Pattern in Power BI - Part 1. Full vs. Incremental Loads – Data Engineering with Fabric. Joining Queries in Azure Data Factory on Cosmos DB Sources. Feature Engineering with Microsoft Fabric and Dataflow Gen2. Stay ahead in the ever-evolving landscape of business intelligence with BI-Pro. Unleash the full potential of your data today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos 🐾 altair - Vega-Altair is a Python library for statistical visualization, offering simplicity, friendliness, and consistency for creating beautiful and effective visualizations. 🐾 bokeh - Bokeh is a Python library for creating interactive plots and data applications in web browsers, offering elegant and versatile graphics. 🐾 bqplot - bqplot is a 2-D visualization system for Jupyter, based on the Grammar of Graphics, enabling interactive plots with other Jupyter widgets. 🐾 cartopy - Cartopy simplifies map drawing in Python, offering easy projection definitions, point transformations, and integration with Matplotlib for advanced mapping. 🐾 diagrams - Diagrams simplifies cloud system architecture design in Python, supporting major providers and frameworks, allowing prototyping and visualization of existing architectures. Email Forwarded? Join BI-Pro Here!🔮 Data Viz with Python Libraries   🐍 Exploring causality with Python. Difference-in-differences: The series dives into causal inference, crucial in modern analytics, explaining tools like difference-in-differences. It explores how events impact outcomes, using examples such as minimum wage effects on employment. The setup involves treatment and control groups to establish cause-and-effect relationships in diverse real-world scenarios. 🐍 Meet the NiceGUI: Your Soon-to-be Favorite Python UI Library. NiceGUI is a Python UI framework for web and desktop apps, offering a simple interface for small projects, dashboards, and robotics. It simplifies state management and interaction, boasting features like easy layout, visualization tools, and integration with popular libraries. 🐍 Feature Engineering with Microsoft Fabric and PySpark: The post delves into feature engineering in Microsoft Fabric, emphasizing its importance in ML development. It explores PySpark's role in handling large datasets and provides a basic overview and example of using PySpark for feature engineering. 🐍 10 GitHub Repositories to Master Python: The blog explores 10 essential GitHub repositories for mastering Python, emphasizing hands-on experience and real-world projects to enhance skills. It covers a range of topics, from beginner to advanced, including machine learning, web development, and data analysis. Asabeneh/30-Days-Of-Python  trekhleb/learn-python  Avik-Jain/100-Days-Of-ML-Code  realpython/python-guide  zhiwehu/Python-programming-exercises  geekcomputers/Python  practical-tutorials/project-based-learning  avinashkranjan/Amazing-Python-Scripts  TheAlgorithms/Python  vinta/awesome-python   ⚡Stay Informed with Industry Highlights Power BI 📊 On-premises data gateway April 2024 release: This update to the on-premises data gateway aligns it with the April 2024 release of Power BI Desktop, ensuring consistency in query execution. Additionally, the gateway now supports refreshes longer than one hour, allowing tokens to be refreshed mid-stream for continuous operation.  📊 Copilot in Power BI: Soon available to more users in your organization. The update introduces changes to Copilot in Power BI, including enabling Copilot by default for all tenants starting May 20th, 2024. It also addresses features reported by customers and community, updates abuse monitoring to not store prompts, and improves geo mapping for EU data boundary customers. Microsoft Fabric📊 Introducing Optimistic Job Admission for Fabric Spark: The post introduces Optimistic Job Admission for Spark in Microsoft Fabric, a new feature aimed at improving concurrency and job admission experience. It explains how this feature optimizes resource allocation and increases the number of concurrent jobs that can be admitted to the cluster. 📊 Introducing Job Queueing for Notebook in Microsoft Fabric: Microsoft Fabric introduces Job Queueing for Notebook Jobs to streamline data engineering and data science processes. This feature automatically queues notebook jobs when Fabric capacity is maxed out, eliminating manual retries and improving user experience. Jobs are retried when resources become available, enhancing efficiency for enterprise users. AWS BI  📊 Meet one of Amazon QuickSight’s Top Community Experts: Sanjeeb Mohapatra. The Amazon QuickSight Community, launched in 2022, is a hub for BI authors and developers to collaborate, ask and answer questions, and learn about QuickSight. Sanjeeb Mohapatra, the top Community Expert for 2023, exemplifies the community's spirit by providing over 1,700 replies and 235 solutions in one year. 📊 Handle tables without primary keys while creating Amazon Aurora MySQL or Amazon RDS for MySQL zero-ETL integrations with Amazon Redshift: AWS is advancing its zero-ETL vision with Amazon Aurora zero-ETL integration to Amazon Redshift, combining transactional data with analytics capabilities. This integration, along with four new ones announced at re:Invent 2023, empowers customers to implement near real-time analytics for various use cases. 📊 Power analytics as a service capabilities using Amazon Redshift: Analytics as a service (AaaS) leverages cloud-based analytic capabilities to enable cost-effective, scalable solutions for organizations. Amazon Redshift, a cloud data warehouse service, facilitates real-time insights and predictive analytics, empowering AaaS providers to embed rich data analytics capabilities. Delivery models include managed, bring-your-own-Redshift (BYOR), and hybrid options, offering flexibility to meet customer needs. Google Cloud Data 📊 How to use Gemini 1.0 Pro Vision in BigQuery? BigQuery integrates with Vertex AI to leverage Gemini 1.0 Pro, PaLM, Vision AI, Speech AI, Doc AI, Natural Language AI, enabling analysis of unstructured data like images, audio, and documents. New integrations support multimodal generative AI, enhancing capabilities for object recognition, info seeking, captioning, digital content understanding, and structured content generation, allowing structured data output for deeper analysis. 📊 Get to know BigQuery data canvas: BigQuery Data Canvas simplifies the data-to-insights journey by offering a natural language-driven experience. It centralizes data tasks, accelerates analysis, and fosters collaboration, all within a unified workspace, enabling faster and more efficient data analytics. 📊 Gemini in Looker to bring intelligent AI-powered BI to everyone: Gemini in Looker introduces Conversational Analytics, transforming how businesses engage with data. It offers a natural language-driven experience, simplifying data analytics and fostering collaboration, all within a unified workspace. 📊 Memorystore for Redis Cluster updates at Next ‘24: The article elaborates on the rapid adoption and recent enhancements of Google Cloud's Memorystore for Redis Cluster. It features customer testimonials from companies like Statsig, Character.AI, and AXON Networks, showcasing the service's performance, scalability, and cost-effectiveness. It also highlights new features such as data persistence, new node types, and ultra-fast vector search. 📊 Firestore launches at Next ‘24: Firestore is beloved by developers for its speed in app development. Updates include improved developer productivity, AI-enabled app building, richer queries, and enterprise-level scalability. Gemini Code Assist now supports Firestore, allowing natural language queries and data model definitions, enhancing the development experience. Firestore also supports AI applications and integrations with LangChain and LlamaIndex for generative AI. Tableau📊 Tableau vs Power BI: A Comparison of AI-Powered Analytics Tools. The comparison delves into the unique strengths of Tableau and Power BI, showcasing how each excels in different areas of data visualization and analytics. It outlines Tableau's robust visualizations and analytics capabilities, especially for large datasets, contrasting with Power BI's integration with Microsoft services and affordability for small to medium-sized businesses. 📊 Salesforce-Informatica Deal Could Transform Enterprise GenAI Forever: Salesforce is reportedly in advanced talks to acquire Informatica, a data-management software provider, for $11 billion. This aligns with Salesforce's strategy to expand beyond CRM, bolstered by recent AI advancements like Einstein Copilot, complementing Informatica's data integration expertise and potential synergy with Tableau and MuleSoft. Additionally, it aligns with Salesforce's strategy to expand beyond CRM and become a comprehensive data journey platform. ✨ Expert Insights from Packt Community ChatGPT for Cybersecurity Cookbook - By Clint Bodungen Sending API Requests and Handling Responses with PythonIn this recipe, we will explore how to send requests to the OpenAI GPT API and handle the responses using Python. We’ll walk through the process of constructing API requests, sending them, and processing the responses using the openai module. Getting ready Ensure you have Python installed on your system. Install the OpenAI Python module by running the following command in your Terminal or command prompt: pip install openai How to do it… The importance of using the API lies in its ability to communicate with and get valuable insights from ChatGPT in real time. By sending API requests and handling responses, you can harness the power of GPT to answer questions, generate content, or solve problems in a dynamic and customizable way. In the following steps, we’ll demonstrate how to construct API requests, send them, and process the responses, enabling you to effectively integrate ChatGPT into your projects or applications: Start by importing the required modules: import openai from openai import OpenAI import os Set up your API key by retrieving it from an environment variable, as we did in the Setting the OpenAI API key as an Environment Variable recipe: openai.api_key = os.getenv("OPENAI_API_KEY") Define a function to send a prompt to the OpenAI API and receive a response:client = OpenAI() def get_chat_gpt_response(prompt):  response = client.chat.completions.create(    model="gpt-3.5-turbo",    messages=[{"role": "user", "content": prompt}],    max_tokens=2048,    temperature=0.7  )  return response.choices[0].message.content.strip() Call the function with a prompt to send a request and receive a response:prompt = "Explain the difference between symmetric and asymmetric encryption." response_text = get_chat_gpt_response(prompt) print(response_text) How it works… First, we import the required modules. The openai module is the OpenAI API library, and the os module helps us retrieve the API key from an environment variable. We set up the API key by retrieving it from an environment variable using the os module. Next, we define a function called get_chat_gpt_response() that takes a single argument: the prompt. This function sends a request to the OpenAI API using the openai.Completion.create() method. This method has several parameters: engine: Here, we specify the engine (in this case, chat-3.5-turbo). prompt: The input text for the model to generate a response. max_tokens: The maximum number of tokens in the generated response. A token can be as short as one character or as long as one word. n: The number of generated responses you want to receive from the model. In this case, we’ve set it to 1 to receive a single response. stop: A sequence of tokens that, if encountered by the model, will stop the generation process. This can be useful for limiting the response’s length or stopping at specific points, such as the end of a sentence or paragraph. temperature: A value that controls the randomness of the generated response. A higher temperature (for example, 1.0) will result in more random responses, while a lower temperature (for example, 0.1) will make the responses more focused and deterministic. Discover more insights from ChatGPT for Cybersecurity Cookbook - By Clint Bodungen. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!Read Here💡 What's the Latest Scoop from the BI Community?  🧠 Geospatial Data Analysis with Geemap: This article introduces geospatial data analysis, focusing on raster data from Google Earth Engine, accessed and analyzed using the Geemap Python library. Earth Engine offers a vast catalog of geospatial datasets, and Geemap simplifies access and analysis, making it easier to work with such data in Python. 🧠 Microsoft Fabric Table Maintenance - Checkpoint and Statistics: This article discusses the maintenance requirements for warehouse tables in Microsoft Fabric, particularly focusing on tasks like updating statistics, removing fragmentation, and managing log files. While some maintenance tasks, such as data compaction and log file checkpointing, are automated, others, like managing statistics, may require manual intervention. 🧠 Identifying Customer Buying Pattern in Power BI - Part 1: This article is part 1 of a retail analytics analysis in Power BI, focusing on customer purchasing frequency for various products over the years. It includes identifying data elements, creating calculated columns, and analyzing trends to aid in business decision-making. 🧠 Full vs. Incremental Loads – Data Engineering with Fabric: This article discusses using Apache Spark in Microsoft Fabric to achieve data quality zones (bronze and silver) in a data lake. It explores loading weather data, transforming it with Spark SQL and DataFrames, and implementing full and incremental load patterns. 🧠 Joining Queries in Azure Data Factory on Cosmos DB Sources: This article provides a detailed guide on joining two queries in Azure Data Factory (ADF). It covers prerequisites, creation of data sources, defining queries for each dataset, and using the "Join" transformation in ADF to merge data. Different join types such as inner, left outer, right outer, and full outer joins are explained. 🧠 Feature Engineering with Microsoft Fabric and Dataflow Gen2: This article introduces Dataflow Gen2 as a low-code data transformation and integration engine for creating data pipelines in Microsoft Fabric. It focuses on using Dataflow Gen2 to create features needed for training a machine learning model with college basketball game data, offering different approaches from no code to all code. See you next time!
Read more
  • 0
  • 0
  • 44486

Packt
23 Feb 2017
12 min read
Save for later

How to start using AWS

Packt
23 Feb 2017
12 min read
In this article, by Lucas Chan and Rowan Udell, authors of the book, AWS Administration Cookbook, we will cover the following: Infrastructure as Code AWS CloudFormation (For more resources related to this topic, see here.) What is AWS? Amazon Web Services (AWS) is a public cloud provider. It provides infrastructure and platform services at a pay-per-use rate. This means you get on-demand access to resources that you used to have to buy outright. You can get access to enterprise-grade services while only paying for what you need, usually down to the hour. AWS prides itself on providing the primitives to developers so that they can build and scale the solutions that they require. How to create an AWS account In order to follow along, you will need an AWS account. Create an account here. by clicking on the Sign Up button and entering your details. Regions and Availability Zones on AWS A fundamental concept of AWS is that its services and the solutions built on top of them are architected for failure. This means that a failure of the underlying resources is a scenario actively planned for, rather than avoided until it cannot be ignored. Due to this, all the services and resources available are divided up in to geographically diverse regions. Using specific regions means you can provide services to your users that are optimized for speed and performance. Within a region, there are always multiple Availability Zones (a.k.a. AZ). Each AZ represents a geographically distinct—but still close—physical data center. AZs have their own facilities and power source, so an event that might take a single AZ offline is unlikely to affect the other AZs in the region. The smaller regions have at least two AZs, and the largest has five. At the time of writing this, the following regions are active: Code Name Availability Zones us-east-1 N. Virginia 5 us-east-2 Ohio 3 us-west-1 N. California 3 us-west-2 Oregon 3 ca-central-1 Canada 2 eu-west-1 Ireland 3 eu-west-2 London 2 eu-central-1 Frankfurt 2 ap-northeast-1 Tokyo 3 ap-northeast-2 Seoul 2 ap-southeast-1 Singapore 2 ap-southeast-2 Sydney 3 ap-south-1 Mumbai 2 sa-east-1 Sao Paulo 3 The AWS web console The web-based console is the first thing you will see after creating your AWS account, and you will often refer to it when viewing and confirming your configuration. The AWS web console The console provides an overview of all the services available as well as associated billing and cost information. Each service has its own section, and the information displayed depends on the service being viewed. As new features and services are released, the console will change and improve. Don't be surprised if you log in and things have changed from one day to the next. Keep in mind that the console always shows your resources by region. If you cannot see a resource that you created, make sure you have the right region selected. Choose the region closest to your physical location for the fastest response times. Note that not all regions have the same services available. The larger, older regions generally have the most services available. Some of the newer or smaller regions (that might be closest to you) might not have all services enabled yet. While services are continually being released to regions, you may have to use another region if you simply must use a newer service. The us-east-1 (a.k.a. North Virginia) region is special given its status as the first region. All services are available there, and new services are always released there. As you get more advanced with your usage of AWS, you will spend less time in the console and more time controlling your services programmatically via the AWS CLI tool and CloudFormation, which we will go into in more detail in the next few topics. CloudFormation templates CloudFormation is the Infrastructure as Code service from AWS. Where CloudFormation was not applicable, we have used the AWS CLI to make the process repeatable and automatable. Since the recipes are based on CloudFormation templates, you can easily combine different templates to achieve your desired outcomes. By editing the templates or joining them, you can create more useful and customized configurations with minimal effort. What is Infrastructure as Code? Infrastructure as Code (IaC) is the practice of managing infrastructure though code definitions. On an Infrastructure-as-a-Service (IaaS) platform such as AWS, IaC is needed to get the most utility and value. IaC differs primarily from traditional interactive methods of managing infrastructure because it is machine processable. This enables a number of benefits: Improved visibility of resources Higher levels of consistency between deployments and environments Easier troubleshooting of issues The ability to scale more with less effort Better control over costs On a less tangible level, all of these factors contribute to other improvements for your developers: you can now leverage tried-and-tested software development practices for your infrastructure and enable DevOps practices in your teams. Visibility As your infrastructure is represented in machine-readable files, you can treat it like you do your application code. You can take the best-practice approaches to software development and apply them to your infrastructure. This means you can store it in version control (for example, Git and SVN) just like you do your code, along with the benefits that it brings: All changes to infrastructure are recorded in commit history You can review changes before accepting/merging them You can easily compare different configurations You can pick and use specific point-in-time configurations Consistency Consistent configuration across your environments (for example, dev, test, and prod) means that you can more confidently deploy your infrastructure. When you know what configuration is in use, you can easily test changes in other environments due to a common baseline. IaC is not the same as just writing scripts for your infrastructure. Most tools and services will leverage higher-order languages and DSLs to allow you to focus on your higher-level requirements. It enables you to use advanced software development techniques, such as static analysis, automated testing, and optimization. Troubleshooting IaC makes replicating and troubleshooting issues easier: since you can duplicate your environments, you can accurately reproduce your production environment for testing purposes. In the past, test environments rarely had exactly the same infrastructure due to the prohibitive cost of hardware. Now that it can be created and destroyed on demand, you are able to duplicate your environments only when they are needed. You only need to pay for the time that they are running for, usually down to the hour. Once you have finished testing, simply turn your environments off and stop paying for them. Even better than troubleshooting is fixing issues before they cause errors. As you refine your IaC in multiple environments, you will gain confidence that is difficult to obtain without it. By the time you deploy your infrastructure in to production, you have done it multiple times already. Scale Configuring infrastructure by hand can be a tedious and error-prone process. By automating it, you remove the potential variability of a manual implementation: computers are good at boring, repetitive tasks, so use them for it! Once automated, the labor cost of provisioning more resources is effectively zero—you have already done the work. Whether you need to spin up one server or a thousand, it requires no additional work. From a practical perspective, resources in AWS are effectively unconstrained. If you are willing to pay for it, AWS will let you use it. Costs AWS have a vested (commercial) interest in making it as easy as possible for you to provision infrastructure. The benefit to you as the customer is that you can create and destroy these resources on demand. Obviously, destroying infrastructure on demand in a traditional, physical hardware environment is simply not possible. You would be hard-pressed to find a data center that will allow you to stop paying for servers and space simply because you are not currently using them. Another use case where on demand infrastructure can make large cost savings is your development environment. It only makes sense to have a development environment while you have developers to use it. When your developers go home at the end of the day, you can switch off your development environments so that you no longer pay for them. Before your developers come in in the morning, simply schedule their environments to be created. DevOps DevOps and IaC go hand in hand. The practice of storing your infrastructure (traditionally the concern of operations) as code (traditionally the concern of development) encourages a sharing of responsibilities that facilitates collaboration. Image courtesy: Wikipedia By automating the PACKAGE, RELEASE, and CONFIGURE activities in the software development life cycle (as pictured), you increase the speed of your releases while also increasing the confidence. Cloud-based IaC encourages architecture for failure: as your resources are virtualized, you must plan for the chance of physical (host) hardware failure, however unlikely. Being able to recreate your entire environment in minutes is the ultimate recovery solution. Unlike physical hardware, you can easily simulate and test failure in your software architecture by deleting key components—they are all virtual anyway! Server configuration Server-side examples of IaC are configuration-management tools such as Ansible, Chef, and Puppet. While important, these configuration-management tools are not specific to AWS, so we will not be covering them in detail here. IaC on AWS CloudFormation is the IaC service from AWS. Templates written in a specific format and language define the AWS resources that should be provisioned. CloudFormation is declarative and can not only provision resources, but also update them. We will go into CloudFormation in greater detail in the following topic. CloudFormation We'll use CloudFormation extensively throughout, so it's important that you have an understanding of what it is and how it fits in to the AWS ecosystem. There should easily be enough information here to get you started, but where necessary, we'll refer you to AWS' own documentation. What is CloudFormation? The CloudFormation service allows you to provision and manage a collection of AWS resources in an automated and repeatable fashion. In AWS terminology, these collections are referred to as stacks. Note however that a stack can be as large or as small as you like. It might consist of a single S3 bucket, or it might contain everything needed to host your three-tier web app. In this article, we'll show you how to define the resources to be included in your CloudFormation stack. We'll talk a bit more about the composition of these stacks and why and when it's preferable to divvy up resources between a number of stacks. Finally, we'll share a few of the tips and tricks we've learned over years of building countless CloudFormation stacks. Be warned: pretty much everyone incurs at least one or two flesh wounds along their journey with CloudFormation. It is all very much worth it, though. Why is CloudFormation important? By now, the benefits of automation should be starting to become apparent to you. But don't fall in to the trap of thinking CloudFormation will be useful only for large collections of resources. Even performing the simplest task of, say, creating an S3 bucket can get very repetitive if you need to do it in every region. We work with a lot of customers who have very tight controls and governance around their infrastructure, especially in the finance sector, and especially in the network layer (think VPCs, NACLs, and security groups). Being able to express one's cloud footprint in YAML (or JSON), store it in a source code repository, and funnel it through a high-visibility pipeline gives these customers confidence that their infrastructure changes are peer-reviewed and will work as expected in production. Discipline and commitment to IaC SDLC practices are of course a big factor in this, but CloudFormation helps bring us out of the era of following 20-page run-sheets for manual changes, navigating untracked or unexplained configuration drift, and unexpected downtime caused by fat fingers. The layer cake Now is a good time to start thinking about your AWS deployments in terms of layers. Your layers will sit atop one another, and you will have well-defined relationships between them. Here's a bottom-up example of how your layer cake might look: VPC Subnets, routes, and NACLs NAT gateways, VPN or bastion hosts, and associated security groups App stack 1: security groups, S3 buckets App stack 1: cross-zone RDS and read replica App stack 1: app and web server auto scaling groups and ELBs App stack 1: cloudfront and WAF config In this example, you may have many occurrences of the app stack layers inside your VPC, assuming you have enough IP addresses in your subnets! This is often the case with VPCs living inside development environments. So immediately, you have the benefit of multi-tenancy capability with application isolation. One advantage of this approach is that while you are developing your CloudFormation template, if you mess up the configuration of your app server, you don't have to wind back all the work CFN did on your behalf. You can just turf that particular layer (and the layers that depend on it) and restart from there. This is not the case if you have everything contained in a single template. We commonly work with customers for whom ownership and management of each layer in the cake reflects the structure of the technological divisions within a company. The traditional infrastructure, network, and cyber security folk are often really interested in creating a safe place for digital teams to deploy their apps, so they like to heavily govern the foundational layers of the cake. Conway's Law, coined by Melvin Conway, starts to come in to play here: "Any organization that designs a system will inevitably produce a design whose structure is a copy of the organization's communication structure." Finally, even if you are a single-person infrastructure coder working in a small team, you will benefit from this approach. For example, you'll find that it dramatically reduces your exposure to things such as AWS limits, timeouts, and circular dependencies.
Read more
  • 0
  • 0
  • 44469

article-image-build-and-train-rnn-chatbot-using-tensorflow
Sunith Shetty
28 Jun 2018
21 min read
Save for later

Build and train an RNN chatbot using TensorFlow [Tutorial]

Sunith Shetty
28 Jun 2018
21 min read
Chatbots are increasingly used as a way to provide assistance to users. Many companies, including banks, mobile/landline companies and large e-sellers now use chatbots for customer assistance and for helping users in pre and post sales queries. They are a great tool for companies which don't need to provide additional customer service capacity for trivial questions: they really look like a win-win situation! In today’s tutorial, we will understand how to train an automatic chatbot that will be able to answer simple and generic questions, and how to create an endpoint over HTTP for providing the answers via an API. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. There are mainly two types of chatbot: the first is a simple one, which tries to understand the topic, always providing the same answer for all questions about the same topic. For example, on a train website, the questions Where can I find the timetable of the City_A to City_B service? and What's the next train departing from City_A? will likely get the same answer, that could read Hi! The timetable on our network is available on this page: <link>. This types of chatbots use classification algorithms to understand the topic (in the example, both questions are about the timetable topic). Given the topic, they always provide the same answer. Usually, they have a list of N topics and N answers; also, if the probability of the classified topic is low (the question is too vague, or it's on a topic not included in the list), they usually ask the user to be more specific and repeat the question, eventually pointing out other ways to do the question (send an email or call the customer service number, for example). The second type of chatbots is more advanced, smarter, but also more complex. For those, the answers are built using an RNN, in the same way, that machine translation is performed. Those chatbots are able to provide more personalized answers, and they may provide a more specific reply. In fact, they don't just guess the topic, but with an RNN engine, they're able to understand more about the user's questions and provide the best possible answer: in fact, it's very unlikely you'll get the same answers with two different questions using these types of chatbots. The input corpus Unfortunately, we haven't found any consumer-oriented dataset that is open source and freely available on the Internet. Therefore, we will train the chatbot with a more generic dataset, not really focused on customer service. Specifically, we will use the Cornell Movie Dialogs Corpus, from the Cornell University. The corpus contains the collection of conversations extracted from raw movie scripts, therefore the chatbot will be able to answer more to fictional questions than real ones. The Cornell corpus contains more than 200,000 conversational exchanges between 10+ thousands of movie characters, extracted from 617 movies. The dataset is available here: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. We would like to thank the authors for having released the corpus: that makes experimentation, reproducibility and knowledge sharing easier. The dataset comes as a .zip archive file. After decompressing it, you'll find several files in it: README.txt contains the description of the dataset, the format of the corpora files, the details on the collection procedure and the author's contact. Chameleons.pdf is the original paper for which the corpus has been released. Although the goal of the paper is strictly not around chatbots, it studies the language used in dialogues, and it's a good source of information to understanding more movie_conversations.txt contains all the dialogues structure. For each conversation, it includes the ID of the two characters involved in the discussion, the ID of the movie and the list of sentences IDs (or utterances, to be more precise) in chronological order. For example, the first line of the file is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'] That means that user u0 had a conversation with user u2 in the movie m0 and the conversation had 4 utterances: 'L194', 'L195', 'L196' and 'L197' movie_lines.txt contains the actual text of each utterance ID and the person who produced it. For example, the utterance L195 is listed here as: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. So, the text of the utterance L195 is Well, I thought we'd start with pronunciation, if that's okay with you. And it was pronounced by the character u2 whose name is CAMERON in the movie m0. movie_titles_metadata.txt contains information about the movies, including the title, year, IMDB rating, the number of votes in IMDB and the genres. For example, the movie m0 here is described as: m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance'] So, the title of the movie whose ID is m0 is 10 things i hate about you, it's from 1999, it's a comedy with romance and it received almost 63 thousand votes on IMDB with an average score of 6.9 (over 10.0) movie_characters_metadata.txt contains information about the movie characters, including the name the title of the movie where he/she appears, the gender (if known) and the position in the credits (if known). For example, the character “u2” appears in this file with this description: u2 +++$+++ CAMERON +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ m +++$+++ 3 The character u2 is named CAMERON, it appears in the movie m0 whose title is 10 things i hate about you, his gender is male and he's the third person appearing in the credits. raw_script_urls.txt contains the source URL where the dialogues of each movie can be retrieved. For example, for the movie m0 that's it: m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html As you will have noticed, most files use the token  +++$+++  to separate the fields. Beyond that, the format looks pretty straightforward to parse. Please take particular care while parsing the files: their format is not UTF-8 but ISO-8859-1. Creating the training dataset Let's now create the training set for the chatbot. We'd need all the conversations between the characters in the correct order: fortunately, the corpora contains more than what we actually need. For creating the dataset, we will start by downloading the zip archive, if it's not already on disk. We'll then decompress the archive in a temporary folder (if you're using Windows, that should be C:Temp), and we will read just the movie_lines.txt and the movie_conversations.txt files, the ones we really need to create a dataset of consecutive utterances. Let's now go step by step, creating multiple functions, one for each step, in the file corpora_downloader.py. The first function we need is to retrieve the file from the Internet, if not available on disk. def download_and_decompress(url, storage_path, storage_dir): import os.path directory = storage_path + "/" + storage_dir zip_file = directory + ".zip" a_file = directory + "/cornell movie-dialogs corpus/README.txt" if not os.path.isfile(a_file): import urllib.request import zipfile urllib.request.urlretrieve(url, zip_file) with zipfile.ZipFile(zip_file, "r") as zfh: zfh.extractall(directory) return This function does exactly that: it checks whether the “README.txt” file is available locally; if not, it downloads the file (thanks for the urlretrieve function in the urllib.request module) and it decompresses the zip (using the zipfile module). The next step is to read the conversation file and extract the list of utterance IDS. As a reminder, its format is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'], therefore what we're looking for is the fourth element of the list after we split it on the token  +++$+++ . Also, we'd need to clean up the square brackets and the apostrophes to have a clean list of IDs. For doing that, we shall import the re module, and the function will look like this. import re def read_conversations(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_conversations.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: conversations_chunks = [line.split(" +++$+++ ") for line in fh] return [re.sub('[[]']', '', el[3].strip()).split(", ") for el in conversations_chunks] As previously said, remember to read the file with the right encoding, otherwise, you'll get an error. The output of this function is a list of lists, each of them containing the sequence of utterance IDS in a conversation between characters. Next step is to read and parse the movie_lines.txt file, to extract the actual utterances texts. As a reminder, the file looks like this line: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. Here, what we're looking for are the first and the last chunks. def read_lines(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_lines.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: lines_chunks = [line.split(" +++$+++ ") for line in fh] return {line[0]: line[-1].strip() for line in lines_chunks} The very last bit is about tokenization and alignment. We'd like to have a set whose observations have two sequential utterances. In this way, we will train the chatbot, given the first utterance, to provide the next one. Hopefully, this will lead to a smart chatbot, able to reply to multiple questions. Here's the function: def get_tokenized_sequencial_sentences(list_of_lines, line_text): for line in list_of_lines: for i in range(len(line) - 1): yield (line_text[line[i]].split(" "), line_text[line[i+1]].split(" ")) Its output is a generator containing a tuple of the two utterances (the one on the right follows temporally the one on the left). Also, utterances are tokenized on the space character. Finally, we can wrap up everything into a function, which downloads the file and unzip it (if not cached), parse the conversations and the lines, and format the dataset as a generator. As a default, we will store the files in the /tmp directory: def retrieve_cornell_corpora(storage_path="/tmp", storage_dir="cornell_movie_dialogs_corpus"): download_and_decompress("http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip", storage_path, storage_dir) conversations = read_conversations(storage_path, storage_dir) lines = read_lines(storage_path, storage_dir) return tuple(zip(*list(get_tokenized_sequencial_sentences(conversations, lines)))) At this point, our training set looks very similar to the training set used in the translation project. We can, therefore, use some pieces of code we've developed in the machine learning translation article. For example, the corpora_tools.py file can be used here without any change (also, it requires the data_utils.py). Given that file, we can dig more into the corpora, with a script to check the chatbot input. To inspect the corpora, we can use the corpora_tools.py, and the file we've previously created. Let's retrieve the Cornell Movie Dialog Corpus, format the corpora and print an example and its length: from corpora_tools import * from corpora_downloader import retrieve_cornell_corpora sen_l1, sen_l2 = retrieve_cornell_corpora() print("# Two consecutive sentences in a conversation") print("Q:", sen_l1[0]) print("A:", sen_l2[0]) print("# Corpora length (i.e. number of sentences)") print(len(sen_l1)) assert len(sen_l1) == len(sen_l2) This code prints an example of two tokenized consecutive utterances, and the number of examples in the dataset, that is more than 220,000: # Two consecutive sentences in a conversation Q: ['Can', 'we', 'make', 'this', 'quick?', '', 'Roxanne', 'Korrine', 'and', 'Andrew', 'Barrett', 'are', 'having', 'an', 'incredibly', 'horrendous', 'public', 'break-', 'up', 'on', 'the', 'quad.', '', 'Again.'] A: ['Well,', 'I', 'thought', "we'd", 'start', 'with', 'pronunciation,', 'if', "that's", 'okay', 'with', 'you.'] # Corpora length (i.e. number of sentences) 221616 Let's now clean the punctuation in the sentences, lowercase them and limits their size to 20 words maximum (that is examples where at least one of the sentences is longer than 20 words are discarded). This is needed to standardize the tokens: clean_sen_l1 = [clean_sentence(s) for s in sen_l1] clean_sen_l2 = [clean_sentence(s) for s in sen_l2] filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2) print("# Filtered Corpora length (i.e. number of sentences)") print(len(filt_clean_sen_l1)) assert len(filt_clean_sen_l1) == len(filt_clean_sen_l2) This leads us to almost 140,000 examples: # Filtered Corpora length (i.e. number of sentences) 140261 Then, let's create the dictionaries for the two sets of sentences. Practically, they should look the same (since the same sentence appears once on the left side, and once in the right side) except there might be some changes introduced by the first and last sentences of a conversation (they appear only once). To make the best out of our corpora, let's build two dictionaries of words and then encode all the words in the corpora with their dictionary indexes: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=15000, storage_path="/tmp/l1_dict.p") dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=15000, storage_path="/tmp/l2_dict.p") idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) print("# Same sentences as before, with their dictionary ID") print("Q:", list(zip(filt_clean_sen_l1[0], idx_sentences_l1[0]))) print("A:", list(zip(filt_clean_sen_l2[0], idx_sentences_l2[0]))) That prints the following output. We also notice that a dictionary of 15 thousand entries doesn't contain all the words and more than 16 thousand (less popular) of them don't fit into it: [sentences_to_indexes] Did not find 16823 words [sentences_to_indexes] Did not find 16649 words # Same sentences as before, with their dictionary ID Q: [('well', 68), (',', 8), ('i', 9), ('thought', 141), ('we', 23), ("'", 5), ('d', 83), ('start', 370), ('with', 46), ('pronunciation', 3), (',', 8), ('if', 78), ('that', 18), ("'", 5), ('s', 12), ('okay', 92), ('with', 46), ('you', 7), ('.', 4)] A: [('not', 31), ('the', 10), ('hacking', 7309), ('and', 23), ('gagging', 8761), ('and', 23), ('spitting', 6354), ('part', 437), ('.', 4), ('please', 145), ('.', 4)] As the final step, let's add paddings and markings to the sentences: data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) print("# Prepared minibatch with paddings and extra stuff") print("Q:", data_set[0][0]) print("A:", data_set[0][1]) print("# The sentence pass from X to Y tokens") print("Q:", len(idx_sentences_l1[0]), "->", len(data_set[0][0])) print("A:", len(idx_sentences_l2[0]), "->", len(data_set[0][1])) And that, as expected, prints: # Prepared minibatch with paddings and extra stuff Q: [0, 68, 8, 9, 141, 23, 5, 83, 370, 46, 3, 8, 78, 18, 5, 12, 92, 46, 7, 4] A: [1, 31, 10, 7309, 23, 8761, 23, 6354, 437, 4, 145, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0] # The sentence pass from X to Y tokens Q: 19 -> 20 A: 11 -> 22 Training the chatbot After we're done with the corpora, it's now time to work on the model. This project requires again a sequence to sequence model, therefore we can use an RNN. Even more, we can reuse part of the code from the previous project: we'd just need to change how the dataset is built, and the parameters of the model. We can then copy the training script, and modify the build_dataset function, to use the Cornell dataset. Mind that the dataset used in this article is bigger than the one used in the machine learning translation article, therefore you may need to limit the corpora to a few dozen thousand lines. On a 4 years old laptop with 8GB RAM, we had to select only the first 30 thousand lines, otherwise, the program ran out of memory and kept swapping. As a side effect of having fewer examples, even the dictionaries are smaller, resulting in less than 10 thousands words each. def build_dataset(use_stored_dictionary=False): sen_l1, sen_l2 = retrieve_cornell_corpora() clean_sen_l1 = [clean_sentence(s) for s in sen_l1][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP clean_sen_l2 = [clean_sentence(s) for s in sen_l2][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2, max_len=10) if not use_stored_dictionary: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=10000, storage_path=path_l1_dict) dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=10000, storage_path=path_l2_dict) else: dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2_length = len(dict_l2) idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) max_length_l1 = extract_max_length(idx_sentences_l1) max_length_l2 = extract_max_length(idx_sentences_l2) data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) return (filt_clean_sen_l1, filt_clean_sen_l2), data_set, (max_length_l1, max_length_l2), (dict_l1_length, dict_l2_length) By inserting this function into the train_translator.py file and rename the file as train_chatbot.py, we can run the training of the chatbot. After a few iterations, you can stop the program and you'll see something similar to this output: [sentences_to_indexes] Did not find 0 words [sentences_to_indexes] Did not find 0 words global step 100 learning rate 1.0 step-time 7.708967611789704 perplexity 444.90090078460474 eval: perplexity 57.442316329639176 global step 200 learning rate 0.990234375 step-time 7.700247814655302 perplexity 48.8545568311572 eval: perplexity 42.190180314697045 global step 300 learning rate 0.98046875 step-time 7.69800933599472 perplexity 41.620538109894945 eval: perplexity 31.291903031786116 ... ... ... global step 2400 learning rate 0.79833984375 step-time 7.686293318271639 perplexity 3.7086356605442767 eval: perplexity 2.8348589631663046 global step 2500 learning rate 0.79052734375 step-time 7.689657487869262 perplexity 3.211876894960698 eval: perplexity 2.973809378544393 global step 2600 learning rate 0.78271484375 step-time 7.690396382808681 perplexity 2.878854805600354 eval: perplexity 2.563583924617356 Again, if you change the settings, you may end up with a different perplexity. To obtain these results, we set the RNN size to 256 and 2 layers, the batch size of 128 samples, and the learning rate to 1.0. At this point, the chatbot is ready to be tested. Although you can test the chatbot with the same code as in the test_translator.py, here we would like to do a more elaborate solution, which allows exposing the chatbot as a service with APIs. Chatbox API First of all, we need a web framework to expose the API. In this project, we've chosen Bottle, a lightweight simple framework very easy to use. To install the package, run pip install bottle from the command line. To gather further information and dig into the code, take a look at the project webpage, https://bottlepy.org. Let's now create a function to parse an arbitrary sentence provided by the user as an argument. All the following code should live in the test_chatbot_aas.py file. Let's start with some imports and the function to clean, tokenize and prepare the sentence using the dictionary: import pickle import sys import numpy as np import tensorflow as tf import data_utils from corpora_tools import clean_sentence, sentences_to_indexes, prepare_sentences from train_chatbot import get_seq2seq_model, path_l1_dict, path_l2_dict model_dir = "/home/abc/chat/chatbot_model" def prepare_sentence(sentence, dict_l1, max_length): sents = [sentence.split(" ")] clean_sen_l1 = [clean_sentence(s) for s in sents] idx_sentences_l1 = sentences_to_indexes(clean_sen_l1, dict_l1) data_set = prepare_sentences(idx_sentences_l1, [[]], max_length, max_length) sentences = (clean_sen_l1, [[]]) return sentences, data_set The function prepare_sentence does the following: Tokenizes the input sentence Cleans it (lowercase and punctuation cleanup) Converts tokens to dictionary IDs Add markers and paddings to reach the default length Next, we will need a function to convert the predicted sequence of numbers to an actual sentence composed of words. This is done by the function decode, which runs the prediction given the input sentence and with softmax predicts the most likely output. Finally, it returns the sentence without paddings and markers: def decode(data_set): with tf.Session() as sess: model = get_seq2seq_model(sess, True, dict_lengths, max_sentence_lengths, model_dir) model.batch_size = 1 bucket = 0 encoder_inputs, decoder_inputs, target_weights = model.get_batch( {bucket: [(data_set[0][0], [])]}, bucket) _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket, True) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] if data_utils.EOS_ID in outputs: outputs = outputs[1:outputs.index(data_utils.EOS_ID)] tf.reset_default_graph() return " ".join([tf.compat.as_str(inv_dict_l2[output]) for output in outputs]) Finally, the main function, that is, the function to run in the script: if __name__ == "__main__": dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l2_length = len(dict_l2) inv_dict_l2 = {v: k for k, v in dict_l2.items()} max_lengths = 10 dict_lengths = (dict_l1_length, dict_l2_length) max_sentence_lengths = (max_lengths, max_lengths) from bottle import route, run, request @route('/api') def api(): in_sentence = request.query.sentence _, data_set = prepare_sentence(in_sentence, dict_l1, max_lengths) resp = [{"in": in_sentence, "out": decode(data_set)}] return dict(data=resp) run(host='127.0.0.1', port=8080, reloader=True, debug=True) Initially, it loads the dictionary and prepares the inverse dictionary. Then, it uses the Bottle API to create an HTTP GET endpoint (under the /api URL). The route decorator sets and enriches the function to run when the endpoint is contacted via HTTP GET. In this case, the api() function is run, which first reads the sentence passed as HTTP parameter, then calls the prepare_sentence function, described above, and finally runs the decoding step. What's returned is a dictionary containing both the input sentence provided by the user and the reply of the chatbot. Finally, the webserver is turned on, on the localhost at port 8080. Isn't very easy to have a chatbot as a service with Bottle? It's now time to run it and check the outputs. To run it, run from the command line: $> python3 –u test_chatbot_aas.py Then, let's start querying the chatbot with some generic questions, to do so we can use CURL, a simple command line; also all the browsers are ok, just remember that the URL should be encoded, for example, the space character should be replaced with its encoding, that is, %20. Curl makes things easier, having a simple way to encode the URL request. Here are a couple of examples: $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "i ' m here with you .", "in": "where are you?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you here?" {"data": [{"out": "yes .", "in": "are you here?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you a chatbot?" {"data": [{"out": "you ' for the stuff to be right .", "in": "are you a chatbot?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=what is your name ?" {"data": [{"out": "we don ' t know .", "in": "what is your name ?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "that ' s okay .", "in": "how are you?"}]} If the system doesn't work with your browser, try encoding the URL, for example: $> curl -X GET http://127.0.0.1:8080/api?sentence=how%20are%20you? {"data": [{"out": "that ' s okay .", "in": "how are you?"}]}. Replies are quite funny; always remember that we trained the chatbox on movies, therefore the type of replies follow that style. To turn off the webserver, use Ctrl + C. To summarize, we've learned to implement a chatbot, which is able to respond to questions through an HTTP endpoint and a GET API. To know more how to design deep learning systems for a variety of real-world scenarios using TensorFlow, do checkout this book TensorFlow Deep Learning Projects. Facebook’s Wit.ai: Why we need yet another chatbot development framework? How to build a chatbot with Microsoft Bot framework Top 4 chatbot development frameworks for developers
Read more
  • 0
  • 1
  • 44455
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-the-seven-deadly-sins-of-web-design
Guest Contributor
13 Mar 2019
7 min read
Save for later

The seven deadly sins of web design

Guest Contributor
13 Mar 2019
7 min read
Just 30 days before the debut of "Captain Marvel," the latest cinematic offering by the successful and prolific Marvel Studios, a delightful and nostalgia-filled website was unveiled to promote the movie. Since the story of "Captain Marvel" is set in the 1990s, the brilliant minds at the marketing department of Marvel Studios decided to design a website with the right look and feel, which in this case meant using FrontPage and hosting on Angelfire. The "Captain Marvel" promo website is filled with the typography, iconography, glitter, and crudely animated GIFs you would expect from a 1990s creation, including a guestbook, hidden easter eggs, flaming borders, hit counter, and even headers made with Microsoft WordArt. (Image courtesy of Marvel) The site is delightful not just for the dead-on nostalgia trip it provides to visitors, but also because it is very well developed. This is a site with a lot to explore, and it is clearly evident that the website developers met client demands while at the same time thinking about users. This site may look and feel like it was made during the GeoCities era, but it does not make any of the following seven mistakes: Sin #1: Non-Responsiveness In 2019, it is simply inconceivable to think of a web development firm that neglects to make a responsive site. Since 2016, internet traffic flowing through mobile devices has been higher than the traffic originating from desktops and laptops. Current rates are about 53 percent smartphones and tablets versus 47 percent desktops, laptops, kiosks, and smart TVs. Failure to develop responsive websites means potentially alienating more than 50 percent of prospective visitors. As for the "Captain Marvel" website, it is amazingly responsive when considering that internet users in the 1990s barely dreamed about the day when they would be able to access the web from handheld devices (mobile phones were yet to be mass distributed back then). Sin #2: Way too much Jargon (Image courtesy of the Botanical Linguist) Not all website developers have a good sense of readability, and this is something that often shows up when completed projects result in product visitors struggling to comprehend. We’re talking about jargon. There’s a lot of it online, not only in the usual places like the privacy policy and terms of service sections but sometimes in content too. Regardless of how jargon creeps onto your website, it should be rooted out. The "Captain Marvel" website features legal notices written by The Walt Disney Company, and they are very reader-friendly with minimal jargon. The best way to handle jargon is to avoid it as much as possible unless the business developer has good reasons to include it. Sin #3: A noticeable lack of content No content means no message, and this is the reason 46 percent of visitors who land on B2B websites end up leaving without further exploration or interaction. Quality content that is relevant to the intention of a website is crucial in terms of establishing credibility, and this goes beyond B2B websites. In the case of "Captain Marvel," the amount of content is reduced to match the retro sensibility, but there are enough photos, film trailers, character bios, and games to keep visitors entertained. Modern website development firms that provide full-service solutions can either provide or advise clients on the content they need to get started. Furthermore, they can also offer lessons on how to operate content management systems. Sin #4: Making essential information hard to find There was a time when the "mystery meat navigation” issue of website development was thought to have been eradicated through the judicious application of recommended practices, but then mobile apps came around. Even technology giant Google fell victim to mystery meat navigation with its 2016 release of Material Design, which introduced bottom navigation bars intended to offer a more clarifying alternative to hamburger menus. Unless there is a clever purpose for prompting visitors to click or tap on a button, link or page element, that does not explain next steps, mystery meat navigation should be avoided, particularly when it comes to essential information. When the 1990s "Captain Marvel" page loads, visitors can click or tap on labeled links to get information about the film, enjoy multimedia content, play games, interact with the guestbook, or get tickets. There is a mysterious old woman that pops up every now and then from the edges of the screen, but the reason behind this mysterious element is explained in the information section. Sin #5: Website loads too slow (Image courtesy of Horton Marketing Solutions) There is an anachronism related to the "Captain Marvel" website that users who actually used Netscape in the 1990s will notice: all pages load very fast. This is one retro aspect that Marvel Studios decided to not include on this site, and it makes perfect sense. For a fast-loading site, a web design rule of thumb is to simplify and this responsibility lies squarely with the developer. It stands to reason that the more “stuff” you have on a page (images, forms, videos, widgets, shiny things), the longer it takes the server to send over the site files and the longer it takes the browser to render them. Here are a few design best practices to keep in mind: 1 Make the site light - get rid of non-essential elements, especially if they are bandwidth-sucking images or video. 2 Compress your pages - it’s easy with Gzip. 3 Split long pages into several shorter ones 4 Write clean code that doesn’t rely on external sources 5 Optimize images For more web design tips that help your site load in the sub-three second range, like Google expects in 2019, check out our article on current design trends.   Once you have design issues under control, investigate your web host. They aren’t all created equal. Cheap, entry-level shared packages are notoriously slow and unpredictable, especially as your traffic increases. But even beyond that, the reality is that some companies spend money buying better, faster servers and don’t overload them with too many clients. Some do. Recent testing from review site HostingCanada.org checked load times across the leading providers and found variances from a ‘meh’ 2,850 ms all the way down to speedy 226 ms. With pricing amongst credible competitors roughly equal, web developers should know which hosts are the fastest and point clients in that direction. Sin #6: Outdated information Functional and accurate information will always triumph over form. The "Captain Marvel" website is garish to look at by 2019 standards, but all the information is current. The film's theater release date is clearly displayed, and should something happen that would require this date to change, you can be sure that Marvel Studios will fire up FrontPage to promptly make the adjustment. Sin #7: No clear call to action Every website should compel visitors to do something. Even if the purpose is to provide information, the call-to-action or CTA should encourage visitors to remember it and return for updates. The CTA should be as clear as the navigation elements, otherwise, the purpose of the visit is lost. Creating enticements is acceptable, but the CTA message should be explained nonetheless. In the case of "Captain Marvel," visitors can click on "Get Tickets" link to be taken to a Fandango.com page with geolocation redirection for their region. The Bottom Line In the end, the seven mistakes listed herein are easy to avoid. Whenever developers run into clients whose instructions may result in one of these mistakes, proper explanations should be given. Author Bio Gary Stevens is a front-end developer. He’s a full-time blockchain geek and a volunteer working for the Ethereum foundation as well as an active Github contributor. 7 Web design trends and predictions for 2019 How to create a web designer resume that lands you a Job Will Grant’s 10 commandments for effective UX Design
Read more
  • 0
  • 0
  • 44445

article-image-how-get-started-redux-react-native
Emilio Rodriguez
04 Apr 2016
5 min read
Save for later

How To Get Started with Redux in React Native

Emilio Rodriguez
04 Apr 2016
5 min read
In mobile development there is a need for architectural frameworks, but complex frameworks designed to be used in web environments may end up damaging the development process or even the performance of our app. Because of this, some time ago I decided to introduce in all of my React Native projects the leanest framework I ever worked with: Redux. Redux is basically a state container for JavaScript apps. It is 100 percent library-agnostic so you can use it with React, Backbone, or any other view library. Moreover, it is really small and has no dependencies, which makes it an awesome tool for React Native projects. Step 1: Install Redux in your React Native project. Redux can be added as an npm dependency into your project. Just navigate to your project’s main folder and type: npm install --save react-redux By the time this article was written React Native was still depending on React Redux 3.1.0 since versions above depended on React 0.14, which is not 100 percent compatible with React Native. Because of this, you will need to force version 3.1.0 as the one to be dependent on in your project. Step 2: Set up a Redux-friendly folder structure. Of course, setting up the folder structure for your project is totally up to every developer but you need to take into account that you will need to maintain a number of actions, reducers, and components. Besides, it’s also useful to keep a separate folder for your API and utility functions so these won’t be mixing with your app’s core functionality. Having this in mind, this is my preferred folder structure under the src folder in any React Native project: Step 3: Create your first action. In this article we will be implementing a simple login functionality to illustrate how to integrate Redux inside React Native. A good point to start this implementation is the action, a basic function called from the component whenever we want the whole state of the app to be changed (i.e. changing from the logged out state into the logged in state). To keep this example as concise as possible we won’t be doing any API calls to a backend – only the pure Redux integration will be explained. Our action creator is a simple function returning an object (the action itself) with a type attribute expressing what happened with the app. No business logic should be placed here; our action creators should be really plain and descriptive. Step 4: Create your first reducer. Reducers are the ones in charge of updating the state of the app. Unlike in Flux, Redux only has one store for the whole app, but it will be conveniently name-spaced automatically by Redux once the reducers have been applied. In our example, the user reducer needs to be aware of when the user is logged in. Because of that, it needs to import the LOGIN_SUCCESS constant we defined in our actions before and export a default function, which will be called by Redux every time an action occurs in the app. Redux will automatically pass the current state of the app and the action occurred. It’s up to the reducer to realize if it needs to modify the state or not based on the action.type. That’s why almost every time our reducer will be a function containing a switch statement, which modifies and returns the state based on what action occurred. It’s important to state that Redux works with object references to identify when the state is changed. Because of this, the state should be cloned before any modification. It’s also interesting to know that the action passed to the reducers can contain other attributes apart from type. For example, when doing a more complex login, the user first name and last name can be added to the action by the action created and used by the reducer to update the state of the app. Step 5: Create your component. This step is almost pure React Native coding. We need a component to trigger the action and to respond to the change of state in the app. In our case it will be a simple View containing a button that disappears when logged in. This is a normal React Native component except for some pieces of the Redux boilerplate: The three import lines at the top will require everything we need from Redux ‘mapStateToProps’ and ‘mapDispatchToProps’ are two functions bound with ‘connect’ to the component: this makes Redux know that this component needs to be passed a piece of the state (everything under ‘userReducers’) and all the actions available in the app. Just by doing this, we will have access to the login action (as it is used in the onLoginButtonPress) and to the state of the app (as it is used in the !this.props.user.loggedIn statement) Step 6: Glue it all from your index.ios.js. For Redux to apply its magic, some initialization should be done in the main file of your React Native project (index.ios.js). This is pure boilerplate and only done once: Redux needs to inject a store holding the app state into the app. To do so, it requires a ‘Provider’ wrapping the whole app. This store is basically a combination of reducers. For this article we only need one reducer, but a full app will include many others and each of them should be passed into the combineReducers function to be taken into account by Redux whenever an action is triggered. About the Author Emilio Rodriguez started working as a software engineer for Sun Microsystems in 2006. Since then, he has focused his efforts on building a number of mobile apps with React Native while contributing to the React Native project. These contributions helped his understand how deep and powerful this framework is.
Read more
  • 0
  • 0
  • 44376

article-image-5-developers-explain-why-they-use-visual-studio-code-sponsored-by-microsoft
Richard Gall
22 May 2019
7 min read
Save for later

5 developers explain why they use Visual Studio Code [Sponsored by Microsoft]

Richard Gall
22 May 2019
7 min read
Visual Studio Code has quickly become one of the most popular text editors on the planet. While debate will continue to rage about the relative merits of every text editor, it’s nevertheless true that Visual Studio Code is unique in that it is incredibly customizable: it can be as lightweight as a text editor or as feature-rich as an IDE. This post is part of a series brought to you in conjunction with Microsoft. Download Learning Node.js Development for free from Microsoft here. Try Visual Studio Code yourself. Learn more here. This means the range of developers using Visual Studio Code are incredibly diverse. Each one faces a unique set of challenges alongside their personal preferences. I spoke to a few of them about why they use Visual Studio Code and how they make it work for them. “Visual Studio Code is streamlined and flexible” Ben Sibley is the Founder of Complete Themes. He likes Visual Studio Code because it is relatively lightweight while also offering considerable flexibility. “I love how streamlined and flexible Visual Studio Code is. Personally, I don’t need a ton of functionality from my IDE, so I appreciate how simple the default configuration is. There's a very concise set of features built-in like the Git integration. “I was using PHPStorm previously and while it was really feature-rich, it was also overwhelming at times. VSC is faster, lighter, and with the extension market you can pick and choose which additional tools you need. And it’s a popular enough editor that you can usually find a reliable and well-reviewed extension.” Read next: How Visual Studio Code can help bridge the gap between full-stack development and DevOps [Sponsored by Microsoft] “Visual Studio Code is the best in terms of extension ecosystem, language support and configuration” Libby Horacek is a developer at Position Development. She has worked with several different code editors but struggled to find one that allowed her to effectively move between languages. For Libby, Visual Studio Code offered the right level of flexibility. She also explained how the team at Position Development have used VSC’s Live Share feature which allows developers to directly share and collaborate on code inside their editor. “I currently use Visual Studio Code. I’ve tried a LOT of different editors. I’m a polyglot developer, so I need an editor that isn’t just for one language. RubyMine is great for Ruby, and PyCharm is good for Python, but I don’t want to switch editors every time I switch languages (sometimes multiple times a day). My main constraint is Haskell language support — there are plugins for most IDEs now, but some are better than others. “For a long time I used Emacs just because I was able to steal a great configuration setup for it from a coworker, but a few months back it stopped working due to updates and I didn’t want to acquire the Emacs expertise to fix it. So I tried IntelliJ, Visual Studio, Atom, Sublime Text, even Vim… but in the end I liked Visual Studio the best in terms of extension ecosystem, language support, and ease of use and configuration. “My team also uses Visual Studio’s Live Share for pairing. I haven’t tried it personally but it looks like a great option for remote pairing. The only thing my coworkers have cautioned is that they encountered a bug with the “undo” functionality that wiped out most of a file they were working on. Maybe that bug has been fixed by now, but as always, commit early and commit often!” “As a JavaScript dev shop, we love that VSC is written in JavaScript” Cody Swann is the CEO of Gunner Technology, a software development company that builds using JavaScript on AWS for both the public and private sector. “All our developers here [at Gunner Technology] use VSC. “We switched from Sublime about two years ago because Sublime started to feel slow and neglected. “Before that, we used TextMate and abandoned that for the same reasons. “As a JavaScript dev shop, we love that VSC is written in JavaScript. It makes it easier for us to write in-house extensions and such. “Additionally, we love that Microsoft releases monthly updates and keeps improving performance.” Read next: Microsoft Build 2019: Microsoft showcases new updates to MS 365 platform with focus on AI and developer productivity “The Visual Studio Code team pay close attention to the problems developers face” Ajeet Dhaliwal is a software developer at Tesults. He explains he has used several different IDEs and editors but came to Visual Studio Code after spending some time using Node.js and React on Brackets. “I have used Visual Studio Code almost exclusively for the last couple of years. “In years prior to making this switch, the nature of the development work that I did meant that I was broadly limited to using specific IDEs such as Visual Studio and Xcode. Then in 2014 I stated to get into Node.js and was looking for a code editor that would be more suitable. I tried out a few and ultimately settled on Brackets. “I used Brackets for a while but wasn’t always happy with it. The most annoying issue was the way text was rendered on my Mac. “Over time I started doing React work too and every time I revisited VSC the improvements were impressive, it seemed to me that the developers were closely paying attention to the problems developers face, they were creating features I had never even thought I would need and the extensions added highly useful features for Node.js and React dev work. The font rendering was not an issue either so it became an inevitable switch.” “I have to context switch regularly - I expect my brain to be the slowest element, not the IDE” Kyle Balnave is Senior Developer and Squad Manager at High Speed Training. Despite working with numerous editors and IDEs, he likes Visual Studio Code because it allows him to move between different contexts incredibly quickly. Put simply, it allows him to work faster than other IDEs do. "I've used several different editors over the years. They generally fall under two categories: Monolithic (I can do anything you'll ever want to do out of the box). Modular (I do the basics but allow extensions to be added to do most the rest). “The former are IDEs like Netbeans, IntelliJ and Visual Studio. In my experience they are slow to load and need a more powerful development machine to keep responsive. They have a huge range of functionality, but in everyday development I just need it to be an intelligent code editor. “The latter are IDEs like Eclipse, Visual Studio Code, Atom. They load quickly, respond fast and have a wide range of extensions that allow me to develop what I need. They sometimes fall short in their functionality, but I generally find this to be infrequent. “Why do I use VSCode? Because it doesn't slow me down when I code. I have to context switch regularly so I expect my own brain to be the slowest element, not the IDE. Learn how to develop with Node.js on Azure by downloading Learning Node.js with Azure for free, courtesy of Microsoft.
Read more
  • 0
  • 0
  • 44251

article-image-google-is-circumventing-gdpr-reveals-braves-investigation-for-the-authorized-buyers-ad-business-case
Bhagyashree R
06 Sep 2019
6 min read
Save for later

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

Bhagyashree R
06 Sep 2019
6 min read
Last year, Dr. Johnny Ryan, the Chief Policy & Industry Relations Officer at Brave, filed a complaint against Google’s DoubleClick/Authorized Buyers ad business with the Irish Data Protection Commission (DPC). New evidence produced by Brave reveals that Google is circumventing GDPR and also undermining its own data protection measures. Brave calls Google’s Push Pages a GDPR workaround Brave’s new evidence rebuts some of Google’s claims regarding its DoubleClick/Authorized Buyers system, the world’s largest real-time advertising auction house. Google says that it prohibits companies that use its real-time bidding (RTB) ad system “from joining data they receive from the Cookie Matching Service.” In September last year, Google announced that it has removed encrypted cookie IDs and list names from bid requests with buyers in its Authorized Buyers marketplace. Brave’s research, however, found otherwise, “Brave’s new evidence reveals that Google allowed not only one additional party, but many, to match with Google identifiers. The evidence further reveals that Google allowed multiple parties to match their identifiers for the data subject with each other.” When you visit a website that has Google ads embedded on its web pages, Google will run a real-time bidding ad auction to determine which advertiser will get to display its ads. For this, it uses Push Pages, which is the mechanism in question here. Brave hired Zach Edwards, the co-founder of digital analytics startup Victory Medium, and MetaX, a company that audits data supply chains, to investigate and analyze a log of Dr. Ryan’s web browsing. The research revealed that Google's Push Pages can essentially be used as a workaround for user IDs. Google shares a ‘google_push’ identifier with the participating companies to identify a user. Brave says that the problem here is that the identifier that was shared was common to multiple companies. This means that these companies could have cross-referenced what they learned about the user from Google with each other. Used by more than 8.4 million websites, Google's DoubleClick/Authorized Buyers broadcasts personal data of users to 2000+ companies. This data includes the category of what a user is reading, which can reveal their political views, sexual orientation, religious beliefs, as well as their locations. There are also unique ID codes that are specific to a user that can let companies uniquely identify a user. All this information can give these companies a way to keep tabs on what users are “reading, watching, and listening to online.” Brave calls Google’s RTB data protection policies “weak” as they ask these companies to self-regulate. Google does not have much control over what these companies do with the data once broadcast. “Its policy requires only that the thousands of companies that Google shares peoples’ sensitive data with monitor their own compliance, and judge for themselves what they should do,” Brave wrote. A Google spokesperson, as a response to this news, told Forbes, “We do not serve personalised ads or send bid requests to bidders without user consent. The Irish DPC — as Google's lead DPA — and the UK ICO are already looking into real-time bidding in order to assess its compliance with GDPR. We welcome that work and are co-operating in full." Users recommend starting an “information campaign” instead of a penalty that will hardly affect the big tech This news triggered a discussion on Hacker News where users talked about the implications of RTB and what strict actions the EU can take to protect user privacy. A user explained, "So, let's say you're an online retailer, and you have Google IDs for your customers. You probably have some useful and sensitive customer information, like names, emails, addresses, and purchase histories. In order to better target your ads, you could participate in one of these exchanges, so that you can use the information you receive to suggest products that are as relevant as possible to each customer. To participate, you send all this sensitive information, along with a Google ID, and receive similar information from other retailers, online services, video games, banks, credit card providers, insurers, mortgage brokers, service providers, and more! And now you know what sort of vehicles your customers drive, how much they make, whether they're married, how many kids they have, which websites they browse, etc. So useful! And not only do you get all these juicy private details, but you've also shared your customers sensitive purchase history with anyone else who is connected to the exchange." Others said that a penalty is not going to deter Google. "The whole penalty system is quite silly. The fines destroy small companies who are the ones struggling to comply, and do little more than offer extremely gentle pokes on the wrist for megacorps that have relatively unlimited resources available for complete compliance, if they actually wanted to comply." Users suggested that the EU should instead start an information campaign. "EU should ignore the fines this time and start an "information campaign" regarding behavior of Google and others. I bet that hurts Google 10 times more." Some also said that not just Google but the RTB participants should also be held responsible. "Because what Google is doing is not dissimilar to how any other RTB participant is acting, saying this is a Google workaround seems disingenuous." With this case, Brave has launched a full-fledged campaign that aims to “reform the multi-billion dollar RTB industry spans sixteen EU countries.” To achieve this goal it has collaborated with several privacy NGOs and academics including the Open Rights Group, Dr. Michael Veale of the Turing Institute, among others. In other news, a Bloomberg report reveals that Google and other internet companies have recently asked for an amendment to the California Consumer Privacy Act, which will be enacted in 2020. The law currently limits how digital advertising companies collect and make money from user data. The amendments proposed include approval for collecting user data for targeted advertising, using the collected data from websites for their own analysis, and many others. Read the Bloomberg report to know more in detail. Other news in Data Facebook content moderators work in filthy, stressful conditions and experience emotional trauma daily, reports The Verge GDPR complaint in EU claim billions of personal data leaked via online advertising bids European Union fined Google 1.49 billion euros for antitrust violations in online advertising  
Read more
  • 0
  • 0
  • 44217
article-image-detecting-addressing-llm-hallucinations-in-finance
James Bryant, Alok Mukherjee
04 Jan 2024
9 min read
Save for later

Detecting & Addressing LLM 'Hallucinations' in Finance

James Bryant, Alok Mukherjee
04 Jan 2024
9 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, The Future of Finance with ChatGPT and Power BI, by James Bryant, Alok Mukherjee. Enhance decision-making, transform your market approach, and find investment opportunities by exploring AI, finance, and data visualization with ChatGPT's analytics and Power BI's visuals.IntroductionLLMs, such as OpenAI’s GPT series, can sometimes generate responses that are referred to as “hallucinations.” These are instances where the output from the model is factually incorrect, it presents information that it could not possibly know (given it doesn’t have access to real-time or personalized data), or it might output something nonsensical or highly improbable.Let’s explore deeper into what hallucinations are, how to identify them, and what steps can be taken to mitigate their impact, especially in a context where accurate and reliable information is crucial, such as financial analysis, trading, or visual data presentations.Understanding hallucinationsLet’s look at some examples:Factual inaccuracies: Suppose an LLM provides information stating that Apple Inc. was founded in 1985. This is a clear factual inaccuracy because Apple was founded in 1976.Speculative statements: If an LLM were to suggest that “As of 2023, Tesla’s share price has hit $3,000,” this is a hallucination. The model doesn’t know real-time data and any post-2021 prediction or speculation it makes about specific stock prices is unfounded.Confident misinformation: For instance, if an LLM confidently states that “Amazon has declared bankruptcy in late 2022,” this is a hallucination and can have serious consequences if it’s acted upon without verification.How can we spot hallucinations?Here are some useful ways to spot hallucinations:Cross-verification: If an LLM suggests an unusual trading strategy, such as shorting a typically stable blue-chip stock based on some supposed insider information, always cross-verify this advice with other reliable sources or consult a financial advisor.Questioning the source: If an LLM claims that “our internal data shows a bullish trend for cryptocurrency X,” this is likely a hallucination. The model doesn’t have access to proprietary internal data.Time awareness: If the model provides information or trends post-September 2021 without the user explicitly asking for a hypothetical or simulated scenario, consider this a red flag. For example, GPT-4 giving specific “real-time” market cap values for companies in 2023 would be a hallucination.What can we do about hallucinations?Here are some ideas:Promote awareness: If you are developing an AI-assisted trading app that uses an LLM, ensure users are aware of potential hallucinations, perhaps with a disclaimer or notification upon usageImplement checks: You might integrate a news API that could help validate major financial events or claims made by the modelMinimizing hallucinations in the futureThere are various ways we can minimize hallucinations. Here are some examples:Training improvements: Imagine developing a better model that understands context and sticks to the known data more closely, avoiding speculative or incorrect financial statements. Future versions of the model could be specifically trained on financial data, news, and reports to understand the context and semantics of financial trading and investment better. We could do this to ensure that it understands a short squeeze scenario accurately, or is aware that penny stocks typically come with higher risks.Better evaluation metrics: For instance, develop a specific metric that calculates the percentage of the model’s outputs that were flagged as hallucinations during testing. In the development phase, the models could be evaluated on more focused tasks such as generating valid trading strategies or predicting the impact of certain macroeconomic events on stock prices. The better the model performs on these tasks, the lower the chance of hallucinations occurring.Post-processing methods: Develop an algorithm that cross-references model outputs against reliable financial data sources and flags potential inaccuracies. After the model generates a potential trading strategy or investment suggestion, this output could be cross-verified using a rules-based system. For instance, if the model suggests shorting a stock that has consistently performed well without any recent negative news or poor earnings reports, the system might flag this as a potential hallucination.As an example, you can use libraries such as yfinance or pandas_datareader to access real-time or historical financial data:!pip install yfinance pandas_datareader import yfinance as yf def get_stock_data(ticker, start, end): stock = yf.Ticker(ticker) data = stock.history(start=start, end=end) return data # Example Usage: data = get_stock_data("AAPL", "2021-01-01", "2023-01-01")You could also develop a cross-verification algorithm and compare the model’s outputs with the collected financial data to flag potential inaccuracies.Integration with real-time data: While creating Power BI visualizations, data that’s been pulled from the LLM could be cross-verified with real-time data from financial databases or APIs. Any discrepancies, such as inconsistent market share percentages or revenue growth rates, could be flagged. This reduces the risk of presenting hallucinated data in visualizations. Let’s look at some examples: Extracting real-time data: You can continue to use yfinance or pandas_datareader to extract real-time data Cross-verifying with real-time data: You can compare the model’s output with real-time data to identify discrepancies:def real_time_cross_verify(output, real_time_data): # Assume output is a dict with keys 'market_share', 'revenue_ growth', and 'ticker' ticker = output['ticker'] # Fetch real-time data (assuming a function get_real_time_ data is defined) real_time_data = get_real_time_ data(ticker) # Compare the model's output with real-time data if abs(output['market_share'] - real_time_data['market_ share']) > 0.05 or \ abs(output['revenue_growth'] - real_time_data['revenue_ growth']) > 0.05: return True # Flagged as a potential hallucination return False # Not flagged # Example Usage: output = {'market_share': 0.25, 'revenue_growth': 0.08, 'ticker': 'AAPL'} real_time_data = {'market_share': 0.24, 'revenue_growth': 0.07, 'ticker': 'AAPL'} flagged = real_time_cross_verify(output, real_time_data)User feedback loop: A mechanism can be incorporated to allow users to report potential hallucinations. For instance, if a user spots an error in the LLM’s output during a Power BI data analysis session, they can report this. Over time, these reports can be used to further train the model and reduce hallucinations.OpenAI is on the caseTo tackle the chatbot’s missteps, OpenAI engineers are working on ways for its AI models to reward themselves for outputting correct data when moving toward an answer, instead of rewarding themselves only at the point of conclusion. The system could lead to better outcomes as it incorporates more of a human-like chain-of-thought procedure, according to the engineers.These examples should help in illustrating the concept and risks of LLM hallucinations, particularly in high-stakes contexts such as finance. As always, these models should be seen as powerful tools for assistance, but not as a final authority.Trading examplesHallucination scenario: Let’s assume you’ve asked an LLM for a prediction on the future performance of a specific stock, let’s say Tesla. The LLM might generate a response that appears confident and factual, such as “Based on the latest earnings report, Tesla has declared bankruptcy.” If you acted on this hallucinated information, you might rush to sell Tesla shares only to find out that Tesla is not bankrupt at all. This is an example of a potentially disastrous hallucination.Action: Before making any trading decision based on the LLM’s output, always cross-verify the information from a reliable financial news source or the company’s official communications.Power BI visualization examplesHallucination scenario: Suppose you’re using an LLM to generate text descriptions for a Power BI dashboard that tracks the market share of different automakers in the EV market. The LLM might hallucinate and produce a statement such as “Rivian has surpassed Tesla in terms of global EV market share.” This statement might be completely inaccurate as Tesla had a significantly larger market share than Rivian.Action: When using LLMs to generate text descriptions or insights for your Power BI dashboards, it’s crucial to cross-verify any assertions that are made by the model. You can do this by cross-referencing the underlying data in your Power BI dashboard or by referring to reliable external sources of information.To minimize hallucinations in the future, the model can be fine-tuned with a dataset that’s been specifically curated to cover the relevant domain. The use of a structured validation set can help spot and rectify hallucinations during the model training process. Also, employing a robust fact-checking mechanism on the output of the model before acting on its suggestions or insights can help catch and rectify any hallucinations.Remember, while LLMs can provide valuable insights and suggestions, their output should always be used as one of many inputs in your decision-making process, particularly in high-stakes environments such as financial trading and analysis.ConclusionIn the dynamic world of financial analysis and data visualization, the presence of LLM 'hallucinations' poses a challenge. Awareness, verification, and ongoing improvement strategies stand as pillars against these inaccuracies. While LLMs offer invaluable support, their outputs must be scrutinized, verified, and used as one among many tools in decision-making. As we navigate this landscape, vigilance, continuous refinement, and a critical eye will fortify our ability to harness the power of LLMs while mitigating the risks they present in high-stakes financial contexts.Author BioJames Bryant, a finance and technology expert, excels at identifying untapped opportunities and leveraging cutting-edge tools to optimize financial processes. With expertise in finance automation, risk management, investments, trading, and banking, he's known for staying ahead of trends and driving innovation in the financial industry. James has built corporate treasuries like Salesforce and transformed companies like Stanford Health Care through digital innovation. He is passionate about sharing his knowledge and empowering others to excel in finance. Outside of work, James enjoys skiing with his family in Lake Tahoe, running half marathons, and exploring new destinations and culinary experiences with his wife and daughter.Aloke Mukherjee is a seasoned technologist with over a decade of experience in business architecture, digital transformation, and solutions architecture. He excels at applying data-driven solutions to real-world problems and has proficiency in data analytics and planning. Aloke worked at EMC Corp and Genentech and currently spearheads the digital transformation of Finance Business Intelligence at Stanford Health Care. In addition to his work, Aloke is a Certified Personal Trainer and is passionate about helping his clients stay fit. Aloke also has a passion for wine and exploring new vineyards. 
Read more
  • 0
  • 0
  • 44096

article-image-chatgpt-for-exploratory-data-analysis-eda
Rama Kattunga
08 Sep 2023
9 min read
Save for later

ChatGPT for Exploratory Data Analysis (EDA)

Rama Kattunga
08 Sep 2023
9 min read
IntroductionExploratory data analysis (EDA) refers to the initial investigation of data to discover patterns, identify outliers and anomalies, test hypotheses, and check assumptions with the goal of informing future analysis and model building. It is an iterative, exploratory process of questioning, analyzing, and visualizing data.Some key aspects of exploratory data analysis include:Getting to know the data - Examining individual variables, their values, distributions, and relationships between variables.Data cleaning - Checking and handling missing values, outliers, formatting inconsistencies, etc., before further analysis.Univariate analysis - Looking at one variable at a time to understand its distribution, central tendency, spread, outliers, etc.Bivariate analysis - Examining relationships between two variables using graphs, charts, and statistical tests. This helps find correlations.Multivariate analysis - Analyzing patterns between three or more variables simultaneously using techniques like cluster analysis.Hypothesis generation - Coming up with potential explanations or hypotheses about relationships in the data based on initial findings.Data visualization - Creating graphs, plots, and charts to summarize findings and detect patterns and anomalies more easily.The goals of EDA are to understand the dataset, detect useful patterns, formulate hypotheses, and make decisions on how to prepare/preprocess the data for subsequent modeling and analysis. It is an iterative, exploratory process of questioning, analyzing, and visualizing data.Why ChatGPT for EDA?Exploratory data analysis (EDA) is an important but often tedious process with challenges and pitfalls. The use of ChatGPT saves hours on repetitive tasks. ChatGPT handles preparatory data wrangling, exploration, and documentation - freeing you to focus on insights. Its capabilities will only grow through continued learning. Soon, it may autonomously profile datasets and propose multiple exploratory avenues. ChatGPT is the perfect on-demand assistant for solo data scientists and teams seeking an effortless boost to the EDA process. The drawback of ChatGPT is it can only handle small datasets. There are a few methods like handling smaller datasets and generating Python code to do the necessary analysis.The following table provides detailed challenges/pitfalls during EDA:Challenge/PitfallDetailsGetting lost in the weedsSpending too much time on minor details without focusing on the big picture. This leads to analysis paralysis.Premature conclusionsDrawing conclusions without considering all possible factors or testing different hypotheses thoroughly.BiasPersonal biases, preconceptions or domain expertise can skew analysis in a particular direction.Multiple comparisonsTesting many hypotheses without adjusting for Type 1 errors, leading to false discoveries.DocumentationFailing to properly document methods, assumptions, and thought processes along the way.Lack of focusJumping randomly without a clear understanding of the business objective.Ignoring outliersNot handling outliers appropriately, can distort analysis and patterns.Correlation vs causationIncorrectly inferring causation based only on observed correlations.OverfittingFinding patterns in sample data that may not generalize to new data.Publication biasOnly focusing on publishable significant or "interesting" findings.Multiple rolesWearing data analyst and subject expert hats, mixing subjective and objective analysis. With ChatGPT, get an AI assistant to be your co-pilot on the journey of discovery. ChatGPT can provide EDA at various stages of your data analysis within the limits that we discussed earlier. The following table provides different stages of data analysis with prompts (these prompts either generate the output or Python code for you to execute separately):Type of EDAPromptSummary StatisticsDescribe the structure and summary statistics of this dataset. Check for any anomalies in variable distributions or outliers.Univariate AnalysisCreate histograms and density plots of each numeric variable to visualize their distributions and identify any unusual shapes or concentrations of outliers.Bivariate AnalysisGenerate a correlation matrix and heatmap to examine relationships between variables. Flag any extremely high correlations that could indicate multicollinearity issues.Dimensionality ReductionUse PCA to reduce the dimensions of this high-dimensional dataset and project it into 2D. Do any clusters or groupings emerge that provide new insights?ClusteringApply K-Means clustering on the standardized dataset with different values of k. Interpret the resulting clusters and check if they reveal any meaningful segments or categories.Text AnalysisSummarize the topics and sentiments discussed in this text column using topic modeling algorithms like LDA. Do any dominant themes or opinions stand out?Anomaly DetectionImplement an isolation forest algorithm on the dataset to detect outliers independently in each variable. Flag and analyze any suspicious or influential data points.Model PrototypingQuickly prototype different supervised learning algorithms like logistic regression, decision trees, random forest on this classification dataset. Compare their performance and feature importance.Model EvaluationGenerate a correlation matrix between predicted vs actual values from different models. Any low correlations potentially indicate nonlinear patterns worth exploring further.Report GenerationAutogenerate a Jupyter notebook report with key visualizations, findings, concentrations, and recommendations for the next steps based on the exploratory analyses performed.How do we feed data to ChatGPT for EDA?Describe your dataset through natural language prompts, and ChatGPT instantly runs analyses to find hidden insights. No need to write code - let the AI do the heavy lifting! For this article, let’s use the CSV file available at: (https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-1000.csv) (http://tinyurl.com/mphebj4k)Here are some examples of how ChatGPT can be used for exploratory data analysis:Prompts:Describe the structure and summary statistics of this CSV file: [Pasted URL or file contents]What variable types are in this DataFrame? Import Pandas and show column data types.Generate a correlation matrix and heatmap for these variables.Check for missing values in each column and calculate the percentage missing.Create a histogram to visualize the distribution of this numeric variable.Compare the means of this categorical variable across groups using a boxplot.In the below example, I pasted two hundred rows and see the prompt I gave.This resulted in the following.Look at the column details it provided. Asking for the list of categorical and numerical variables in table format produces the below:Asking for a statistical summary of numerical columns would produce the following:Python Code generation:For certain aspects of EDA, ChatGPT produces Python code with pandas library import and the result will be shown like this:import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load the CSV file from the URL into a DataFrame url = "https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-1000.csv" df = pd.read_csv(url) # Display basic information about the DataFrame print("Basic Info About the DataFrame:") print(df.info()) # Display the first few rows of the DataFrame print("\nFirst Few Rows of the DataFrame:") print(df.head()) # Summary statistics of numerical columns print("\nSummary Statistics of Numerical Columns:") print(df.describe()) # Check for missing values print("\nMissing Values:") print(df.isnull().sum()) # Visualize data # Example: Histogram of a numerical column (replace 'col_name' with the column name you want to plot) # plt.hist(df['col_name'], bins=20) # plt.xlabel('X-axis Label') # plt.ylabel('Y-axis Label') # plt.title('Histogram of col_name') # plt.show() # You can create more visualizations and explore relationships between columns as needed. # Correlation matrix heatmap (for numerical columns) correlation_matrix = df.corr() plt.figure(figsize=(10, 8)) sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f") plt.title("Correlation Matrix Heatmap") plt.show()Running this in Spyder (Anaconda UI) produces the following output without a single error: EDA on large datasets with millions of recordsAs mentioned earlier in this article, ChatGPT is very powerful for its size, but there are still limitations since it runs on general consumer hardware rather than massive server clusters. Here are a few things to keep in mind regarding its capabilities with large datasets:ChatGPT works best for datasets under 50-100MB in size. It can handle some operations on larger files up to 1GB, but performance will degrade.For initial exploration of very large datasets, ChatGPT is still useful. It can quickly summarize dimensions, types, distributions, outliers, etc., to help shape hypotheses.Advanced analytics like complex multi-variable modeling may not be feasible on the largest datasets directly in ChatGPT.However, it can help with the data prep - filtering, aggregations, feature engineering, etc. to reduce a large dataset into a more manageable sample for detailed analysis.Integration with tools that can load large datasets directly (e.g., BigQuery, Spark, Redshift) allows ChatGPT to provide insights on files too big to import wholesale.As AI capabilities continue advancing, future versions powered by more computing may be able to handle larger files for a broader set of analytics tasks.ConclusionChatGPT revolutionizes Exploratory Data Analysis (EDA) by streamlining the process and making it accessible to a wider audience. EDA is crucial for understanding data, and ChatGPT automates tasks like generating statistics, visualizations, and even code, simplifying the process.ChatGPT's natural language interface enables users to interact with data using plain language, eliminating the need for extensive coding skills. While it excels in initial exploration and data preparation, it may have limitations with large datasets or complex modeling tasks. ChatGPT is a valuable EDA companion, empowering data professionals to uncover insights and make data-driven decisions efficiently. ChatGPT's role in data analytics is expected to expand as AI technology evolves, offering even more support for data-driven decision-making.Author BioRama Kattunga has been working with data for over 15 years at tech giants like Microsoft, Intel, and Samsung. As a geek and a business wonk with degrees from Kellogg and two technology degrees from India, Rama uses his engineering know-how and strategy savvy to get stuff done with analytics, AI, and unlocking insights from massive datasets. When he is not analyzing data, you can find Rama sharing his thoughts as an author, speaker, and digital transformation specialist. Moreover, Rama also finds joy in experimenting with cooking, using videos as his guide to create delicious dishes that he can share with others. This diverse range of interests and skills highlights his well-rounded and dynamic character. LinkedIn
Read more
  • 0
  • 0
  • 44088

article-image-10-tech-startups-for-2020-that-will-help-the-world-build-more-resilient-secure-and-observable-software
Richard Gall
30 Dec 2019
10 min read
Save for later

10 tech startups for 2020 that will help the world build more resilient, secure, and observable software

Richard Gall
30 Dec 2019
10 min read
The Datadog IPO in September marked an important moment for the tech industry. This wasn’t just because the company was the fourth tech startup to reach a $10 billion market cap in 2019, but also because it announced something that many people, particularly those in and around Silicon Valley, have been aware of for some time: the most valuable software products in the world aren’t just those that offer speed, and efficiency, they’re those that provide visibility, and security across our software systems. It shouldn’t come as a surprise. As software infrastructure becomes more complex, constantly shifting and changing according to the needs of users and businesses, the ability to assume some degree of control emerges as particularly precious. Indeed, the idea of control and stability might feel at odds with a decade or so that has prized innovation at speed. The mantra ‘move fast and break things’ is arguably one of the defining ones of the last decade. And while that lust for change might never disappear, it’s nevertheless the case that we’re starting to see a mindset shift in how business leaders think about technology. If everyone really is a tech company now, there’s now a growing acceptance that software needs to be treated with more respect and care. The Datadog IPO, then, is just the tip of an iceberg in which monitoring, observability, security, and resiliency tools have started to capture the imagination of technology leaders. While what follows is far from exhaustive, it does underline some of the key players in a growing field. Whether you're an investor or technology decision maker, here are ten tech startups you should watch out for in 2020 from across the cloud and DevOps space. Honeycomb Honeycomb has been at the center of the growing conversation around observability. Designed to help you “own production in hi-res,” what makes it unique in the market is that it allows you to understand and visualize your systems through high-cardinality dimensions (eg. at a user by user level, rather than, say, browser type or continent). The driving force behind Honeycomb is Charity Majors, its co-founder and former CEO. I was lucky enough to speak to her at the start of the year, and it was clear that she has an acute understanding of the challenges facing engineering teams. What was particularly striking in our conversation is how she sees Honeycomb as a tool for empowering developers. It gives them ownership over the code they write and the systems they build. “Ownership gives you the power to fix the thing you know you need to fix and the power to do a good job…” she told me. “People who find ownership is something to be avoided – that’s a terrible sign of a toxic culture.” Honeycomb’s investment status At the time of writing, Honeycomb has received $26.9 million in funding, with $11.4 million series A back in September. Firehydrant “You just got paged. Now what?” That’s the first line that greets you on the FireHydrant website. We think it sums up many of the companies on this list pretty well; many of the best tools in the DevOps space are designed to help tackle the challenges on-call developers face. FireHydrant isn't a tech startup with the profile of Honeycomb. However, as an incident management tool that integrates very neatly into a massive range of workflow tools, we’re likely to see it gain traction in 2020. We particularly like the one-click post mortem feature - it’s clear the product has been built in a way that allows developers to focus on the hard stuff and minimize the things that can just suck up time. FireHydrant’s investment status FireHydrant has raised $1.5 million in seed funding. NS1 Managing application traffic can be business-critical. That’s why NS1 exists; with DNS, DHCP and IP address management capabilities, it’s arguably one of the leading tools on the planet for dealing with the diverse and extensive challenges that come with managing massive amounts of traffic across complex interlocking software applications and systems. The company boasts an impressive roster of clients, including DropBox, The Guardian and LinkedIn, which makes it hard to bet against NS1 going from strength to strength in 2020. Like all software adoption, it might take some time to move beyond the realms of the largest and most technically forward-thinking organizations, but it’s surely only a matter of time until it the importance of smarter and more efficient becomes clear to even the smallest businesses. NS1’s investment status NS1 has raised an impressive $78.4 million in funding from investors (although it’s important to note that it’s one of the oldest companies on this list, founded all the way back in 2013). It received $33 million in series C funding at the beginning of October. Rookout “It’s time to liberate your data” Rookout implores us. For too long, the startup’s argument goes, data has been buried inside our applications where it’s useless for developers and engineers. Once it has been freed, it can help inform how we go about debugging and monitoring our systems. Designed to work for modern architectural and deployment patterns such as Kubernetes and serverless, Rookout is a tool that not only brings simplicity in the midst of complexity, it can also save engineering teams a serious amount of time when it comes to debugging and logging - the company claims by 80%. Like FireHydrant, this means engineers can focus on other areas of application performance and resilience. Rookout’s investment status Back in August, Rookout raised $8 million in Series A funding, taking its total funding amount to $12.2 million dollars. LaunchDarkly Feature flags or toggles are a concept that have started to gain traction in engineering teams in the last couple of years or so. They allow engineering teams to “modify system behavior without changing code” (thank you Martin Fowler). LaunchDarkly is a platform specifically built to allow engineers to use feature flags. At a fundamental level, the product allows DevOps teams to deploy code (ie. change features) quickly and with minimal risk. This allows for testing in production and experimentation on a large scale. With support for just about every programming language, it’s not surprising to see LaunchDarkly boast a wealth of global enterprises on its list of customers. This includes IBM and NBC. LaunchDarkly’s investment status LaunchDarkly raised $44 million in series C funding early in 2019. To date, it has raised $76.3 million. It’s certainly one to watch closely in 2020; it's ability to help teams walk the delicate line between innovation and instability is well-suited to the reality of engineering today. Gremlin Gremlin is a chaos engineering platform designed to help engineers to ‘stress test’ their software systems. This is important in today’s technology landscape. With system complexity making unpredictability a day-to-day reality, Gremlin lets you identify weaknesses before they impact customers and revenue. Gremlin’s mission is to “help build a more reliable internet.” That’s not just a noble aim, it’s an urgent one too. What’s more, you can see that the business is really living out its mission. With Gremlin Free launching at the start of 2019, and the second ChaosConf taking place in the fall, it’s clear that the company is thinking beyond the core product: they want to make chaos engineering more accessible to a world where resilience can feel impossible in the face of increasing complexity. Gremlin’s investment status Since being founded back in 2016 by CTO Matt Fornaciari and CEO Kolton Andrus, Gremlin has raised $26.8Million in funding from Redpoint Ventures, Index Ventures, and Amplify Partners. Cockroach Labs Cockroach Labs is the organization behind CockroachDB, the cloud-native distributed SQL database. CockroachDB’s popularity comes from two things: it’s ability to scale from a single instance to thousands, and it’s impressive resilience. Indeed, its resilience is where it takes its name from. Like a cockroach, CockroachDB is built to keep going even after everything else has burned to the ground. It’s been an interesting year for CockroachLabs and CockroachDB - in June the company changed the CockroachDB core licence from open source Apache license to the Business Source License (BSL), developed by the MariaDB team. The reason for this was ultimately to protect the product as it seeks to grow. The BSL still means the source code is accessible for any use other than for a DBaaS (you’ll need an enterprise license for that). A few months later, the company took another step in pushing forward in the market with $55 million series C funding. Both stories were evidence that CockroachLabs is setting itself up for a big 2020. Although the database market will always be seriously competitive, with resilience as a core USP it’s hard to bet against Cockroach Labs. Cockroaches find a way, right? CockroachLabs investment status CockroachLabs total investment, following on from that impressive round of series C funding is now $108.5 million. Logz.io Logz.io is another platform in the observability space that you really need to watch out for in 2020. Built on the ELK stack (ElasticSearch, Logstash, and Kibana), what makes Logz.io really stand out is the use of machine learning to help identify issues across thousands and thousands of logs. Logz.io has been on ‘ones to watch’ lists for a number of years now. This was, we think, largely down to the rising wave of AI hype. And while we wouldn’t want to underplay its machine learning capabilities, it’s perhaps with the increasing awareness of the need for more observable software systems that we’ll see it really pack a punch across the tech industry. Logz.io’s investment status To date, Logz.io has raised $98.9 million. FaunaDB Fauna is the organization behind FaunaDB. It describes itself as “a global serverless database that gives you ubiquitous, low latency access to app data, without sacrificing data correctness and scale.” The database could be big in 2020. With serverless likely to go from strength to strength, and JAMstack increasing as a dominant approach for web developers, everything the Fauna team have been doing looks as though it will be a great fit for the shape of the engineering landscape in the future. Fauna’s investment status In total, Fauna has raised $32.6 million in funding from investors. Clubhouse One thing that gets overlooked when talking about DevOps and other issues in software development processes is simple project management. That’s why Clubhouse is such a welcome entry on this list. Of course, there are a massive range of project management tools available at the moment. But one of the reasons Clubhouse is such an interesting product is that it’s very deliberately built with engineers in mind. And more importantly, it appears it’s been built with an acute sense of the importance of enjoyment in a project management product. Clubhouse’s investment status Clubhouse has, to date, raised $16 million. As we see a continuing emphasis on developer experience, the tool is definitely one to watch in a tough marketplace. Conclusion: embrace the unpredictable The tech industry feels as unpredictable as the software systems we're building and managing. But while there will undoubtedly be some surprises in 2020, the need for greater security and resilience are themes that no one should overlook. Similarly, the need to gain more transparency and build for observability are critical. Whether you're an investor, business leader, or even an engineer, then, exploring the products that are shaping and defining the space is vital.
Read more
  • 0
  • 0
  • 44056
article-image-design-a-restful-web-api-with-java-tutorial
Pavan Ramchandani
12 Jun 2018
12 min read
Save for later

Design a RESTful web API with Java [Tutorial]

Pavan Ramchandani
12 Jun 2018
12 min read
In today's tutorial, you will learn to design REST services. We will break down the key design considerations you need to make when building RESTful web APIs. In particular, we will focus on the core elements of the REST architecture style: Resources and their identifiers Interaction semantics for RESTful APIs (HTTP methods) Representation of resources Hypermedia controls This article is an excerpt from a book written by Balachandar Bogunuva Mohanram, titled RESTful Java Web Services, Second Edition. This book will help you build robust, scalable and secure RESTful web services, making use of the JAX-RS and Jersey framework extensions. Let's start by discussing the guidelines for identifying resources in a problem domain. Richardson Maturity Model—Leonardo Richardson has developed a model to help with assessing the compliance of a service to REST architecture style. The model defines four levels of maturity, starting from level-0 to level-3 as the highest maturity level. The maturity levels are decided considering the aforementioned principle elements of the REST architecture. Identifying resources in the problem domain The basic steps that yoneed to take while building a RESTful web API for a specific problem domain are: Identify all possible objects in the problem domain. This can be done by identifying all the key nouns in the problem domain. For example, if you are building an application to manage employees in a department, the obvious nouns are department and employee. The next step is to identify the objects that can be manipulated using CRUD operations. These objects can be classified as resources. Note that you should be careful while choosing resources. Based on the usage pattern, you can classify resources as top-level and nested resources (which are the children of a top-level resource). Also, there is no need to expose all resources for use by the client; expose only those resources that are required for implementing the business use case. Transforming operations to HTTP methods Once you have identified all resources, as the next step, you may want to map the operations defined on the resources to the appropriate HTTP methods. The most commonly used HTTP methods (verbs) in RESTful web APIs are POST, GET, PUT, and DELETE. Note that there is no one-to-one mapping between the CRUD operations defined on the resources and the HTTP methods. Understanding of idempotent and safe operation concepts will help with using the correct HTTP method. An operation is called idempotent if multiple identical requests produce the same result. Similarly, an idempotent RESTful web API will always produce the same result on the server irrespective of how many times the request is executed with the same parameters; however, the response may change between requests. An operation is called safe if it does not modify the state of the resources. Check out the following table: MethodIdempotentSafeGETYESYESOPTIONSYESYESHEADYESYESPOSTNONOPATCHNONOPUTYESNODELETEYESNO Here are some tips for identifying the most appropriate HTTP method for the operations that you want to perform on the resources: GET: You can use this method for reading a representation of a resource from the server. According to the HTTP specification, GET is a safe operation, which means that it is only intended for retrieving data, not for making any state changes. As this is an idempotent operation, multiple identical GET requests will behave in the same manner. A GET method can return the 200 OK HTTP response code on the successful retrieval of resources. If there is any error, it can return an appropriate status code such as 404 NOT FOUND or 400 BAD REQUEST. DELETE: You can use this method for deleting resources. On successful deletion, DELETE can return the 200 OK status code. According to the HTTP specification, DELETE is an idempotent operation. Note that when you call DELETE on the same resource for the second time, the server may return the 404 NOT FOUND status code since it was already deleted, which is different from the response for the first request. The change in response for the second call is perfectly valid here. However, multiple DELETE calls on the same resource produce the same result (state) on the server. PUT: According to the HTTP specification, this method is idempotent. When a client invokes the PUT method on a resource, the resource available at the given URL is completely replaced with the resource representation sent by the client. When a client uses the PUT request on a resource, it has to send all the available properties of the resource to the server, not just the partial data that was modified within the request. You can use PUT to create or update a resource if all attributes of the resource are available with the client. This makes sure that the server state does not change with multiple PUT requests. On the other hand, if you send partial resource content in a PUT request multiple times, there is a chance that some other clients might have updated some attributes that are not present in your request. In such cases, the server cannot guarantee that the state of the resource on the server will remain identical when the same request is repeated, which breaks the idempotency rule. POST: This method is not idempotent. This method enables you to use the POST method to create or update resources when you do not know all the available attributes of a resource. For example, consider a scenario where the identifier field for an entity resource is generated at the server when the entity is persisted in the data store. You can use the POST method for creating such resources as the client does not have an identifier attribute while issuing the request. Here is a simplified example that illustrates this scenario. In this example, the employeeID attribute is generated on the server: POST hrapp/api/employees HTTP/1.1 Host: packtpub.com {employee entity resource in JSON} On the successful creation of a resource, it is recommended to return the status of 201 Created and the location of the newly created resource. This allows the client to access the newly created resource later (with server-generated attributes). The sample response for the preceding example will look as follows: 201 Created Location: hrapp/api/employees/1001 Best practice Use caching only for idempotent and safe HTTP methods, as others have an impact on the state of the resources. Understanding the difference between PUT and POST A common question that you will encounter while designing a RESTful web API is when you should use the PUT and POST methods? Here's the simplified answer: You can use PUT for creating or updating a resource, when the client has the full resource content available. In this case, all values are with the client and the server does not generate a value for any of the fields. You will use POST for creating or updating a resource if the client has only partial resource content available. Note that you are losing the idempotency support with POST. An idempotent method means that you can call the same API multiple times without changing the state. This is not true for the POST method; each POST method call may result in a server state change. PUT is idempotent, and POST is not. If you have strong customer demands, you can support both methods and let the client choose the suitable one on the basis of the use case. Naming RESTful web resources Resources are a fundamental concept in RESTful web services. A resource represents an entity that is accessible via the URI that you provide. The URI, which refers to a resource (which is known as a RESTful web API), should have a logically meaningful name. Having meaningful names improves the intuitiveness of the APIs and, thereby, their usability. Some of the widely followed recommendations for naming resources are shown here: It is recommended you use nouns to name both resources and path segments that will appear in the resource URI. You should avoid using verbs for naming resources and resource path segments. Using nouns to name a resource improves the readability of the corresponding RESTful web API, particularly when you are planning to release the API over the internet for the general public. You should always use plural nouns to refer to a collection of resources. Make sure that you are not mixing up singular and plural nouns while forming the REST URIs. For instance, to get all departments, the resource URI must look like /departments. If you want to read a specific department from the collection, the URI becomes /departments/{id}. Following the convention, the URI for reading the details of the HR department identified by id=10 should look like /departments/10. The following table illustrates how you can map the HTTP methods (verbs) to the operations defined for the departments' resources: ResourceGETPOSTPUTDELETE/departmentsGet all departmentsCreate a new departmentBulk update on departmentsDelete all departments/departments/10Get the HR department with id=10Not allowedUpdate the HR departmentDelete the HR department While naming resources, use specific names over generic names. For instance, to read all programmers' details of a software firm, it is preferable to have a resource URI of the form /programmers (which tells about the type of resource), over the much generic form /employees. This improves the intuitiveness of the APIs by clearly communicating the type of resources that it deals with. Keep the resource names that appear in the URI in lowercase to improve the readability of the resulting resource URI. Resource names may include hyphens; avoid using underscores and other punctuation. If the entity resource is represented in the JSON format, field names used in the resource must conform to the following guidelines: Use meaningful names for the properties Follow the camel case naming convention: The first letter of the name is in lowercase, for example, departmentName The first character must be a letter, an underscore (_), or a dollar sign ($), and the subsequent characters can be letters, digits, underscores, and/or dollar signs Avoid using the reserved JavaScript keywords If a resource is related to another resource(s), use a subresource to refer to the child resource. You can use the path parameter in the URI to connect a subresource to its base resource. For instance, the resource URI path to get all employees belonging to the HR department (with id=10) will look like /departments/10/employees. To get the details of employee with id=200 in the HR department, you can use the following URI: /departments/10/employees/200. The resource path URI may contain plural nouns representing a collection of resources, followed by a singular resource identifier to return a specific resource item from the collection. This pattern can repeat in the URI, allowing you to drill down a collection for reading a specific item. For instance, the following URI represents an employee resource identified by id=200 within the HR department: /departments/hr/employees/200. Although the HTTP protocol does not place any limit on the length of the resource URI, it is recommended not to exceed 2,000 characters because of the restriction set by many popular browsers. Best practice: Avoid using actions or verbs in the URI as it refers to a resource. Using HATEOAS in response representation Hypertext as the Engine of Application State (HATEOAS) refers to the use of hypermedia links in the resource representations. This architectural style lets the clients dynamically navigate to the desired resource by traversing the hypermedia links present in the response body. There is no universally accepted single format for representing links between two resources in JSON. Hypertext Application Language The Hypertext API Language (HAL) is a promising proposal that sets the conventions for expressing hypermedia controls (such as links) with JSON or XML. Currently, this proposal is in the draft stage. It mainly describes two concepts for linking resources: Embedded resources: This concept provides a way to embed another resource within the current one. In the JSON format, you will use the _embedded attribute to indicate the embedded resource. Links: This concept provides links to associated resources. In the JSON format, you will use the _links attribute to link resources. Here is the link to this proposal: http://tools.ietf.org/html/draft-kelly-json-hal-06. It defines the following properties for each resource link: href: This property indicates the URI to the target resource representation template: This property would be true if the URI value for href has any PATH variable inside it (template) title: This property is used for labeling the URI hreflang: This property specifies the language for the target resource title: This property is used for documentation purposes name: This property is used for uniquely identifying a link The following example demonstrates how you can use the HAL format for describing the department resource containing hyperlinks to the associated employee resources. This example uses the JSON HAL for representing resources, which is represented using the application/hal+json media type: GET /departments/10 HTTP/1.1 Host: packtpub.com Accept: application/hal+json HTTP/1.1 200 OK Content-Type: application/hal+json { "_links": { "self": { "href": "/departments/10" }, "employees": { "href": "/departments/10/employees" }, "employee": { "href": "/employees/{id}", "templated": true } }, "_embedded": { "manager": { "_links": { "self": { "href": "/employees/1700" } }, "firstName": "Chinmay", "lastName": "Jobinesh", "employeeId": "1700", } }, "departmentId": 10, "departmentName": "Administration" } To summarize, we discussed the details of designing RESTful web APIs including identifying the resources, using HTTP methods, and naming the web resources. Additionally we got introduced to Hypertext application language. Read More: Getting started with Django RESTful Web Services Testing RESTful Web Services with Postman Documenting RESTful Java web services using Swagger
Read more
  • 0
  • 0
  • 44031

article-image-clean-social-media-data-analysis-python
Amey Varangaonkar
26 Dec 2017
10 min read
Save for later

How to effectively clean social media data for analysis

Amey Varangaonkar
26 Dec 2017
10 min read
[box type="note" align="" class="" width=""]This article is a book extract from Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Data cleaning and preprocessing is an essential - and often crucial - part of any analytical process. In this excerpt, we explain the different techniques and mechanisms for effective analysis of your social media data. Social media contains different types of data: information about user profiles, statistics (number of likes or number of followers), verbatims, and other media content. Quantitative data is very convenient for an analysis using statistical and numerical methods, but unstructured data such as user comments is much more challenging. To get meaningful information, one has to perform the whole process of information retrieval. It starts with the definition of the data type and data structure. On social media, unstructured data is related to text, images, videos, and sound and we will mostly deal with textual data. Then, the data has to be cleaned and normalized. Only after all these steps can we delve into the analysis. Social media Data type and encoding Comments and conversation are textual data that we retrieve as strings. In brief, a string is a sequence of characters represented by code points. Every string in Python is seen as a Unicode covering the numbers from 0 through 0x10FFFF (1,114,111 decimal). Then, the sequence has to be represented as a set of bytes (values from 0 to 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called encoding. Encoding plays a very important role in natural language processing because people use more and more characters such as emojis or emoticons, which replace whole words and express emotions. Moreover, in many languages, there are accents that go beyond the regular English alphabet. In order to deal with all the processing problems that might be caused by these, we have to use the right encoding, because comparing two strings with different encodings is actually like comparing apples and oranges. The most common one is UTF-8, used by default in Python 3, which can handle any type of character. As a rule of thumb always normalize your data to Unicode UTF-8. Structure of social media data Another question we'll encounter is, What is the right structure for our data? The most natural choice is a list that can store a sequence of data points (verbatims, numbers, and so on). However, the use of lists will not be efficient on large datasets and we'll be constrained to use sequential processing of the data. That is why a much better solution is to store the data in a tabular format in pandas dataframe, which has multiple advantages for further processing. First of all, rows are indexed, so search operations become much faster. There are also many optimized methods for different kinds of processing and above all it allows you to optimize your own processing by using functional programming. Moreover, a row can contain multiple fields with metadata about verbatims, which are very often used in our analysis. It is worth remembering that the dataset in pandas must fit into RAM memory. For bigger datasets, we suggest the use of SFrames. Pre-processing and text normalization Preprocessing is one of the most important parts of the analysis process. It reformats the unstructured data into uniform, standardized form. The characters, words, and sentences identified at this stage are the fundamental units passed to all further processing stages. The quality of the preprocessing has a big impact of the final result on the whole process. There are several stages of the process: from simple text cleaning by removing white spaces, punctuation, HTML tags and special characters up to more sophisticated normalization techniques such as tokenization, stemming or lemmatization. In general, the main aim is to keep all the characters and words that are important for the analysis and, at the same time, get rid of all others, and the text corpus should be maintained in one uniform format. We import all necessary libraries. import re, itertools import nltk from nltk.corpus import stopwords When dealing with raw text, we usually have a set of words including many details we are not interested in, such as whitespace, line breaks, and blank lines. Moreover, many words contain capital letters so programming languages misinterpret for example, "go" and "Go" as two different words. In order to handle such distinctions, we can convert all words to lowercase format with the following steps: Perform basic text mining cleaning. Remove all whitespaces: verbatim = verbatim.strip() Many text processing tasks can be done via pattern matching. We can find words containing a character and replace it with another one or just remove it. Regular expressions give us a powerful and flexible method for describing the character patterns we are interested in. They are commonly used in cleaning punctuation, HTML tags, and URLs paths. 3. Remove punctuation: verbatim = re.sub(r'[^ws]','',verbatim) 4. Remove HTML tags: verbatim = re.sub('<[^<]+?>', '', verbatim) 5. Remove URLs: verbatim = re.sub(r'^https?://.*[rn]*', '', verbatim, flags=re.MULTILINE) Depending on the quality of the text corpus, sometimes there is a need to implement some corrections. This refers to the text sources such as Twitter or forums, where emotions can play a role and the comments contain multiple letters words for example, "happpppy" instead of "happy" 6. Standardize words (remove multiple letters): verbatim = ''.join(''.join(s)[:2] for _, s in itertools.groupby(verbatim)) After removal of punctuation or white spaces, words can be attached. This happens especially when deleting the periods at the end of the sentences. The corpus might look like: "the brown dog is lostEverybody is looking for him". So there is a need to split "lostEverybody" into two separate words. 7. Split attached words: verbatim = " ".join(re.findall('[A-Z][^A-Z]*', verbatim)) Stop words are basically a set of commonly used words in any language: mainly determiners, prepositions, and coordinating conjunctions. By removing the words that are very commonly used in a given language, we can focus only on the important words instead, and improve the accuracy of the text processing. 8. Convert text to lowercase, lower(): verbatim = verbatim.lower() 9. Stop word removal: verbatim = ' '.join([word for word in verbatim.split() if word not in (stopwords.words('english'))]) 10. Stemming and lemmatization: The main aim of stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming reduces word forms to so-called stems, whereas lemmatization reduces word forms to linguistically valid lemmas. Some examples of stemming are cars -> car, men -> man, and went -> Go Such text processing can give added value in some domains, and may improve the accuracy of practical information extraction tasks Tokenization: Tokenization is the process of breaking a text corpus up into words (most commonly), phrases, or other meaningful elements, which are then called tokens. The tokens become the basic units for further text processing. tokens = nltk.word_tokenize(verbatim) Other techniques are spelling correction, domain knowledge, and grammar checking. Duplicate removal Depending on data source we might notice multiple duplicates in our dataset. The decision to remove duplicates should be based on the understanding of the domain. In most cases, duplicates come from errors in data collection process and it is recommended to remove them in order to reduce bias in our analysis, with the help of the following: df = df.drop_duplicates(subset=['column_name']) Knowing basic text cleaning techniques, we can now learn how to store the data in an efficient way. For this purpose, we will explain how to use one of the most convenient NoSQL databases—MongoDB. Capture: Once you have made a connection to your API you need to make a special request and receive the data at your end. This step requires you go through the data to be able to understand it. Often the data is received in a special format called JavaScript Object Notation (JSON). JSON was created to enable a lightweight data interchange between programs. The JSON resembles the old XML format and consists of a key-value pair. Normalization: The data received from platforms are not in an ideal format to perform analysis. With textual data there are many different approaches to normalization. One can be stripping whitespaces surrounding verbatims, or converting all verbatims to lowercase, or changing the encoding to UTF-8. The point is that if we do not maintain a standard protocol for normalization, we will introduce many unintended errors. The goal of normalization is to transform all your data in a consistent manner that ensures a uniform standardization of your data. It is recommended that you create wrapper functions for your normalization techniques, and then apply these wrappers on all your data input points so as to ensure that all the data in your analysis go through exactly the same normalization process. In general, one should always perform the following cleaning steps: Normalize the textual content: Normalization generally contains at least the following steps: Stripping surrounding whitespaces. Lowercasing the verbatim. Universal encoding (UTF-8). 2. Remove special characters (example: punctuation). 3. Remove stop words: Irrespective of the language stop words add no additional informative value to the analysis, except in the case of deep parsing where stop words can be bridge connectors between targeted words. 4. Splitting attached words. 5. Removal of URLs and hyperlinks: URLs and hyperlinks can be studied separately, but due to the lack of grammatical structure they are by convention removed from verbatims. 6. Slang lookups: This is a relatively difficult task, because here we would require a predefined vocabulary of slang words and their proper reference words, for example: luv maps to love. Such dictionaries are available on the open web, but there is always a risk of them being outdated. In the case of studying words and not phrases (or n-grams), it is very important to do the following: Tokenize verbatim Stemming and lemmatization (Optional): This is where different written forms of the same word do not hold additional meaning to your study Some advanced cleaning procedures are: Grammar checking: Grammar checking is mostly learning-based, a huge amount of proper text data is learned, and models are created for the purpose of grammar correction. There are many online tools that are available for grammar correction purposes. This is a very tricky cleaning technique because language style and structure can change from source to source (for example language on Twitter will not correspond with the language from published books). Wrongly correcting grammar can have negative effects on the analysis. Spelling correction: In natural language, misspelled errors are encountered. Companies, such as Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms such as the Levenshtein Distances, Dictionary Lookup, and so on, or other modules and packages to fix these errors. Again take spell correction with a grain of salt, because false positives can affect the results. Storing: Once the data is received, normalized, and/or cleaned, we need to store the data in an efficient storage database. In this book we have chosen MongoDB as the database as it's a modern and scalable database. It's also relatively easy to use and get started. However, other databases such as Cassandra or HBase could also be used depending on expertise and objectives. Data cleaning and preprocessing, although tedious, can simplify your data analysis work. With the effective Python packages like Numpy, SciPy, Pandas etc these tasks become so much easy and save a lot of your time. If you found this piece of information useful, make sure to check out our book Python Social Media Analytics, which will help you draw actionable insights from mining social media portals such as GitHub, Twitter, YouTube, and more!
Read more
  • 0
  • 1
  • 43910