Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-getting-know-generative-models-types
Sunith Shetty
20 Feb 2018
9 min read
Save for later

Getting to know Generative Models and their types

Sunith Shetty
20 Feb 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Rajdeep Dua and Manpreet Singh Ghotra titled Neural Network Programming with Tensorflow. In this book, you will use TensorFlow to build and train neural networks of varying complexities, without any hassle.[/box] In today’s tutorial, we will learn about generative models, and their types. We will also look into how discriminative models differs from generative models. Introduction to Generative models Generative models are the family of machine learning models that are used to describe how data is generated. To train a generative model we first accumulate a vast amount of data in any domain and later train a model to create or generate data like it. In other words, these are the models that can learn to create data that is similar to data that we give them. One such approach is using Generative Adversarial Networks (GANs). There are two kinds of machine learning models: generative models and discriminative models. Let's examine the following list of classifiers: decision trees, neural networks, random forests, generalized boosted models, logistic regression, naive bayes, and Support Vector Machine (SVM). Most of these are classifiers and ensemble models. The odd one out here is Naive Bayes. It's the only generative model in the list. The others are examples of discriminative models. The fundamental difference between generative and discriminative models lies in the underlying probability inference structure. Let's go through some of the key differences between generative and discriminative models. Discriminative versus generative models Discriminative models learn P(Y|X), which is the conditional relationship between the target variable Y and features X. This is how least squares regression works, and it is the kind of inference pattern that gets used. It is an approach to sort out the relationship among variables. Generative models aim for a complete probabilistic description of the dataset. With generative models, the goal is to develop the joint probability distribution P(X, Y), either directly or by computing P(Y | X) and P(X) and then inferring the conditional probabilities required to classify newer data. This method requires more solid probabilistic thought than regression demands, but it provides a complete model of the probabilistic structure of the data. Knowing the joint distribution enables you to generate the data; hence, Naive Bayes is a generative model. Suppose we have a supervised learning task, where xi is the given features of the data points and yi is the corresponding labels. One way to predict y on future x is to learn a function f() from (xi,yi) that takes in x and outputs the most likely y. Such models fall in the category of discriminative models, as you are learning how to discriminate between x's from different classes. Methods like SVMs and neural networks fall into this category. Even if you're able to classify the data very accurately, you have no notion of how the data might have been generated. The second approach is to model how the data might have been generated and learn a function f(x,y) that gives a score to the configuration determined by x and y together. Then you can predict y for a new x by finding the y for which the score f(x,y) is maximum. A canonical example of this is Gaussian mixture models. Another example of this is: you can imagine x to be an image and y to be a kind of object like a dog, namely in the image. The probability written as p(y|x) tells us how much the model believes that there is a dog, given an input image compared to all possibilities it knows about. Algorithms that try to model this probability map directly are called discriminative models. Generative models, on the other hand, try to learn a function called the joint probability p(y, x). We can read this as how much the model believes that x is an image and there is a dog y in it at the same time. These two probabilities are related and that could be written as p(y, x) = p(x) p(y|x), with p(x) being how likely it is that the input x is an image. The p(x) probability is usually called a density function in literature. The main reason to call these models generative ultimately connects to the fact that the model has access to the probability of both input and output at the same time. Using this, we can generate images of animals by sampling animal kinds y and new images x from p(y, x). We can mainly learn the density function p(x) which only depends on the input space. Both models are useful; however, comparatively, generative models have an interesting advantage over discriminative models, namely, they have the potential to understand and explain the underlying structure of the input data even when there are no labels available. This is very desirable when working in the real world. Types of generative models Discriminative models have been at the forefront of the recent success in the field of machine learning. Models make predictions that depend on a given input, although they are not able to generate new samples or data. The idea behind the recent progress of generative modeling is to convert the generation problem to a prediction one and use deep learning algorithms to learn such a problem. Autoencoders One way to convert a generative to a discriminative problem can be by learning the mapping from the input space itself. For example, we want to learn an identity map that, for each image x, would ideally predict the same image, namely, x = f(x), where f is the predictive model. This model may not be of use in its current form, but from this, we can create a generative model. Here, we create a model formed of two main components: an encoder model q(h|x) that maps the input to another space, which is referred to as hidden or the latent space represented by h, and a decoder model q(x|h) that learns the opposite mapping from the hidden input space. These components--encoder and decoder--are connected together to create an end-to-end trainable model. Both the encoder and decoder models are neural networks of different architectures, for example, RNNs and Attention Nets, to get desired outcomes. As the model is learned, we can remove the decoder from the encoder and then use them separately. To generate a new data sample, we can first generate a sample from the latent space and then feed that to the decoder to create a new sample from the output space. GAN As seen with autoencoders, we can think of a general concept to create networks that will work together in a relationship, and training them will help us learn the latent spaces that allow us to generate new data samples. Another type of generative network is GAN, where we have a generator model q(x|h) to map the small dimensional latent space of h (which is usually represented as noise samples from a simple distribution) to the input space of x. This is quite similar to the role of decoders in autoencoders. The deal is now to introduce a discriminative model p(y| x), which tries to associate an input instance x to a yes/no binary answer y, about whether the generator model generated the input or was a genuine sample from the dataset we were training on. Let's use the image example done previously. Assume that the generator model creates a new image, and we also have the real image from our actual dataset. If the generator model was right, the discriminator model would not be able to distinguish between the two images easily. If the generator model was poor, it would be very simple to tell which one was a fake or fraud and which one was real. When both these models are coupled, we can train them end to end by assuring that the generator model is getting better over time to fool the discriminator model, while the discriminator model is trained to work on the harder problem of detecting frauds. Finally, we desire a generator model with outputs that are indistinguishable from the real data that we used for the training. Through the initial parts of the training, the discriminator model can easily detect the samples coming from the actual dataset versus the ones generated synthetically by the generator model, which is just beginning to learn. As the generator gets better at modeling the dataset, we begin to see more and more generated samples that look similar to the dataset. The following example depicts the generated images of a GAN model learning over time: Sequence models If the data is temporal in nature, then we can use specialized algorithms called Sequence Models. These models can learn the probability of the form p(y|x_n, x_1), where i is an index signifying the location in the sequence and x_i is the ith  input sample. As an example, we can consider each word as a series of characters, each sentence as a series of words, and each paragraph as a series of sentences. Output y could be the sentiment of the sentence. Using a similar trick from autoencoders, we can replace y with the next item in the series or sequence, namely y = x_n + 1, allowing the model to learn. To summarize, we learned generative models are a fast advancing area of study and research. As we proceed to advance these models and grow the training and datasets, we can expect to generate data examples that depict completely believable images. This can be used in several applications such as image denoising, painting, structured prediction, and exploration in reinforcement learning. To know more about how to build and optimize neural networks using TensorFlow, do checkout this book Neural Network Programming with Tensorflow.    
Read more
  • 0
  • 0
  • 37175

article-image-roslyn-cookbook
Packt
20 Feb 2018
6 min read
Save for later

Consuming Diagnostic Analyzers in .NET projects

Packt
20 Feb 2018
6 min read
We know how to write diagnostic analyzers to analyze and report issues about .NET source code and contribute them to the .NET developer community. In this article by the author Manish Vasani, of the book Roslyn Cookbook, we will show you how to search, install, view and configure the analyzers that have already been published by various analyzer authors on NuGet and VS Extension gallery. We will cover the following recipes: (For more resources related to this topic, see here.) Searching and installing analyzers through the NuGet package manager. Searching and installing VSIX analyzers through the VS extension gallery. Viewing and configuring analyzers in solution explorer in Visual Studio. Using ruleset file and ruleset editor to configure analyzers. Diagnostic analyzers are extensions to the Roslyn C# compiler and Visual Studio IDE to analyze user code and report diagnostics. User will see these diagnostics in the error list after building the project from Visual Studio and even when building the project on the command line. They will also see the diagnostics live while editing the source code in the Visual Studio IDE. Analyzers can report diagnostics to enforce specific code styles, improve code quality and maintenance, recommend design guidelines or even report very domain specific issues which cannot be covered by the core compiler. Analyzers can be installed to a .NET project either as a NuGet package or as a VSIX. To get a better understanding of these packaging schemes and learn about the differences in the analyzer experience when installed as a NuGet package versus a VSIX. Analyzers are supported on various different flavors of .NET standard, .NET core and .NET framework projects, for example, class library, console app, etc. Searching and installing analyzers through the NuGet package manager In this recipe we will show you how to search and install analyzer NuGet packages in the NuGet package manager in Visual Studio and see how the analyzer diagnostics from an installed NuGet package light up in project build and as live diagnostics during code editing in Visual Studio. Getting ready You will need to have Visual Studio 2017 installed on your machine to this recipe. You can install a free community version of Visual Studio 2017 from https://www.visualstudio.com/thank-you-downloading-visual-studio/?sku=Community&rel=15.  How to do it… Create a C# class library project, say ClassLibrary, in Visual Studio 2017. In solution explorer, right click on the solution or project node and execute Manage NuGet Packages command.  This brings up the NuGet Package Manager, which can be used to search and install NuGet packages to the solution or project. In the search bar type the following text to find NuGet packages tagged as analyzers: Tags:"analyzers" Note that some of the well known packages are tagged as analyzer, so you may also want to search:Tags:"analyzer" Check or uncheck the Include prerelease checkbox to the right of the search bar to search or hide the prerelease analyzer packages respectively. The packages are listed based on the number of downloads, with the highest downloaded package at the top. Select a package to install, say System.Runtime.Analyzers, and pick a specific version, say 1.1.0, and click Install. Click on I Accept button on the License Acceptance dialog to install the NuGet package. Verify the installed analyzer(s) show up under the Analyzers node in the solution explorer. Verify the project file has a new ItemGroup with the following analyzer references from the installed analyzer package: <ItemGroup> <Analyzer Include="..packagesSystem.Runtime.Analyzers.1.1.0analyzersdotnetcsSystem.Runtime.Analyzers.dll" /> <Analyzer Include="..packagesSystem.Runtime.Analyzers.1.1.0analyzersdotnetcsSystem.Runtime.CSharp.Analyzers.dll" /> </ItemGroup> Add the following code to your C# project: namespace ClassLibrary { public class MyAttribute : System.Attribute { } } Verify the analyzer diagnostic from the installed analyzer is shown in the error list: Open a Visual Studio 2017 Developer Command Prompt and build the project to verify that the analyzer is executed on the command line build and the analyzer diagnostic is reported: Create a new C# project in VS2017 and add the same code to it as step 9 and verify no analyzer diagnostic shows up in error list or command line, confirming that the analyzer package was only installed to the selected project in steps 1-6. Note that CA1018 (Custom attribute should have AttributeUsage defined) has been moved to a separate analyzer assembly in future versions of FxCop/System.Runtime.Analyzers package. It is recommended that you install Microsoft.CodeAnalysis.FxCopAnalyzers NuGet package to get the latest group of Microsoft recommended analyzers. Searching and installing VSIX analyzers through the VS extension gallery In this recipe we will show you how to search and install analyzer VSIX packages in the Visual Studio Extension manager and see how the analyzer diagnostics from an installed VSIX light up as live diagnostics during code editing in Visual Studio. Getting ready You will need to have Visual Studio 2017 installed on your machine to this recipe. You can install a free community version of Visual Studio 2017 from https://www.visualstudio.com/thank-you-downloading-visual-studio/?sku=Community&rel=15. How to do it… Create a C# class library project, say ClassLibrary, in Visual Studio 2017. From the top level menu, execute Tools | Extensions and Updates Navigate to Online | Visual Studio Marketplace on the left tab of the dialog to view the available VSIXes in the Visual Studio extension gallery/marketplace. Search analyzers in the search text box in the upper right corner of the dialog and download an analyzer VSIX, say Refactoring Essentials for Visual Studio. Once the download completes, you will get a message at the bottom of the dialog that the install will be scheduled to execute once Visual Studio and related windows are closed. Close the dialog and then close the Visual Studio instance to start the install. In the VSIX Installer dialog, click Modify to start installation. The subsequent message prompts you to kill all the active Visual Studio and satellite processes. Save all your relevant work in all the open Visual Studio instances, and click End Tasks to kill these processes and install the VSIX. After installation, restart VS, click Tools | Extensions And Updates, and verify Refactoring Essentials VSIX is installed. Create a new C# project with the following source code and verify analyzer diagnostic RECS0085 (Redundant array creation expression) in the error list: namespace ClassLibrary { public class Class1 { void Method() { int[] values = new int[] { 1, 2, 3 }; } } } Build the project from Visual Studio 2017 or command line and confirm no analyzer diagnostic shows up in the Output Window or the command line respectively, confirming that the VSIX analyzer did not execute as part of the build. Resources for Article: Further resources on this subject: C++, SFML, Visual Studio, and Starting the first game [article] Connecting to Microsoft SQL Server Compact 3.5 with Visual Studio [article] Creating efficient reports with Visual Studio [article]
Read more
  • 0
  • 0
  • 11880

article-image-introduction-device-management
Packt
20 Feb 2018
10 min read
Save for later

Introduction with Device Management

Packt
20 Feb 2018
10 min read
In this article by Yatish Patil, the author of the book Microsoft Azure IOT Development Cookbook, we will look at device management using different techniques with Azure IoT Hub. We will see the following recipes: Device registry operations Device twins Device direct methods Device jobs (For more resources related to this topic, see here.) Azure IoT Hub has the capabilities that can be used by a developer to build a robust device management. There could be different use cases or scenarios across multiple industries but these device management capabilities, their patterns and the SDK code remains same, saving the significant time in developing and managing as well as maintaining the millions of devices. Device management will be the central part of any IoT solution. The IoT solution is going to help the users to manage the devices remotely, take actions from the cloud based application like disable, update data, run any command, and firmware update. In this article, we are going to perform all these tasks for device management and will start with creating the device. Device registry operations This sample application is focused on device registry operations and how it works, we will create a console application as our first IoT solution and look at the various device management techniques. Getting ready Let’s create a console application to start with IoT: Create a new project in Visual Studio: Create a Console Application Add IoT Hub connectivity extension in Visual Studio: Add the extension for IoT Hub connectivity Now right click on the Solution and go to Add a Connected Services. Select Azure IoT Hub and click Add. Now select Azure subscription and the IoT Hub created: Select IoT Hub for our application Next it will ask you to add device or you can skip this step and click Complete the configuration. How to do it... Create device identity: initialize the Azure IoT Hub registry connection: registryManager = RegistryManager.CreateFromConnectionString(connectionString); Device device = new Device(); try { device = await registryManager.AddDeviceAsync(new Device(deviceId)); success = true; } catch (DeviceAlreadyExistsException) { success = false; } Retrieve device identity by ID: Device device = new Device(); try { device = await registryManager.GetDeviceAsync(deviceId); } catch (DeviceAlreadyExistsException) { return device; } Delete device identity: Device device = new Device(); try { device = GetDevice(deviceId); await registryManager.RemoveDeviceAsync(device); success = true; } catch (Exception ex) { success = false; } List up to 1000 identities: try { var devicelist = registryManager.GetDevicesAsync(1000); return devicelist.Result; } catch (Exception ex) { // Export all identities to Azure blob storage: var blobClient = storageAccount.CreateCloudBlobClient(); string Containername = "iothubdevices"; //Get a reference to a container var container = blobClient.GetContainerReference(Containername); container.CreateIfNotExists(); //Generate a SAS token var storageUri = GetContainerSasUri(container); await registryManager.ExportDevicesAsync(storageUri, "devices1.txt", false); } Import all identities to Azure blob storage: await registryManager.ImportDevicesAsync(storageUri, OutputStorageUri); How it works... Let’s now understand the steps we performed. We initiated by creating a console application and configured it for the Azure IoT Hub solution. The idea behind this is to see the simple operation for device management. In this article, we started with simple operation for provision of the device by adding it to IoT Hub. We need to create connection to the IoT Hub followed by the created object of registry manager which is a part of devices namespace. Once we are connected we can perform operations like, add device, delete device, get device, these methods are asynchronous ones. IoT Hub also provides a way where in it connects with Azure storage blob for bulk operations like export all devices or import all devices, this works on JSON format only, the entire set of IoT devices gets exported in this way. There's more... Device identities are represented as JSON documents. It consists of properties like: deviceId: It represents the unique identification or the IoT device. ETag: A string representing a weak ETag for the device identity. symkey: A composite object containing a primary and a secondary key, stored in base64 format. status: If enabled, the device can connect. If disabled, this device cannot access any device-facing Endpoint. statusReason: A string that can be used to store the reason for the status changes. connectionState: It can be connected or disconnected. Device twins First we need to understand what device twin is and what is the purpose where we can use the device twin in any IoT solution. The device twin is a JSON formatted document that describes the metadata, properties of any device created within IoT Hub. It describes the individual device specific information. The device twin is made up of: tags, desired properties, and the reported properties. The operation that can be done by a IoT solution are basically update this the data, query for any IoT device. Tags hold the device metadata that can be accessed from IoT solution only. Desired properties are set from IoT solution and can be accessed on the device. Whereas the reported properties are set on the device and retrieved at IoT solution end. How to do it... Store device metadata: var patch = new { properties = new { desired = new { deviceConfig = new { configId = Guid.NewGuid().ToString(), DeviceOwner = "yatish", latitude = "17.5122560", longitude = "70.7760470" } }, reported = new { deviceConfig = new { configId = Guid.NewGuid().ToString(), DeviceOwner = "yatish", latitude = "17.5122560", longitude = "70.7760470" } } }, tags = new { location = new { region = "US", plant = "Redmond43" } } }; await registryManager.UpdateTwinAsync(deviceTwin.DeviceId, JsonConvert.SerializeObject(patch), deviceTwin.ETag); Query device metadata: var query = registryManager.CreateQuery("SELECT * FROM devices WHERE deviceId = '" + deviceTwin.DeviceId + "'"); Report current state of device: var results = await query.GetNextAsTwinAsync(); How it works... In this sample, we retrieved the current information of the device twin and updated the desired properties, which will be accessible on the device side. In the code, we will set the co-ordinates of the device with latitude and longitude values, also the device owner name and so on. This same value will be accessible on the device side. In the similar manner, we can set some properties on the device side which will be a part of the reported properties. While using the device twin we must always consider: Tags can be set, read, and accessed only by backend . Reported properties are set by device and can be read by backend. Desired properties are set by backend and can be read by backend. Use version and last updated properties to detect updates when necessary. Each device twin size is limited to 8 KB by default per device by IoT Hub There's more... Device twin metadata always maintains the last updated time stamp for any modifications. This is UTC time stamp maintained in the metadata. Device twin format is JSON format in which the tags, desired, and reported properties are stored, here is sample JSON with different nodes showing how it is stored: "tags": { "$etag": "1234321", "location": { "country": "India" "city": "Mumbai", "zipCode": "400001" } }, "properties": { "desired": { "latitude": 18.75, "longitude": -75.75, "status": 1, "$version": 4 }, "reported": { "latitude": 18.75, "longitude": -75.75, "status": 1, "$version": 4 } } Device direct methods Azure IoT Hub provides a fully managed bi-directional communication between the IoT solution on the backend and the IoT devices in the fields. When there is need for an immediate communication result, a direct method best suites the scenarios. Lets take example in home automation system, one needs to control the AC temperature or on/off the faucet showers. Invoke method from application: public async Task<CloudToDeviceMethodResult> InvokeDirectMethodOnDevice(string deviceId, ServiceClient serviceClient) { var methodInvocation = new CloudToDeviceMethod("WriteToMessage") { ResponseTimeout = TimeSpan.FromSeconds(300) }; methodInvocation.SetPayloadJson("'1234567890'"); var response = await serviceClient.InvokeDeviceMethodAsync(deviceId, methodInvocation); return response; } Method execution on device: deviceClient = DeviceClient.CreateFromConnectionString("", TransportType.Mqtt); deviceClient.SetMethodHandlerAsync("WriteToMessage", new DeviceSimulator().WriteToMessage, null).Wait(); deviceClient.SetMethodHandlerAsync("GetDeviceName", new DeviceSimulator().GetDeviceName, new DeviceData("DeviceClientMethodMqttSample")).Wait(); How it works... Direct method works on request-response interaction with the IoT device and backend solution. It works on timeout basis if no reply within that, it fails. These synchronous requests have by default 30 seconds of timeout, one can modify the timeout and increase up to 3600 depending on the IoT scenarios they have.  The device needs to connect using the MQTT protocol whereas the backend solution can be using HTTP. The JSON data size direct method can work up to 8 KB Device jobs In a typical scenario, device administrator or operators are required to manage the devices in bulk. We look at the device twin which maintains the properties and tags. Conceptually the job is nothing but a wrapper on the possible actions which can be done in bulk. Suppose we have a scenario in which we need to update the properties for multiple devices, in that case one can schedule the job and track the progress of that job. I would like to set the frequency to send the data at every 1 hour instead of every 30 min for 1000 IoT devices. Another example could be to reboot the multiple devices at the same time. Device administrators can perform device registration in bulk using the export and import methods. How to do it... Job to update twin properties. var twin = new Twin(); twin.Properties.Desired["HighTemperature"] = "44"; twin.Properties.Desired["City"] = "Mumbai"; twin.ETag = "*"; return await jobClient.ScheduleTwinUpdateAsync(jobId, "deviceId='"+ deviceId + "'", twin, DateTime.Now, 10); Job status. var twin = new Twin(); twin.Properties.Desired["HighTemperature"] = "44"; twin.Properties.Desired["City"] = "Mumbai"; twin.ETag = "*"; return await jobClient.ScheduleTwinUpdateAsync(jobId, "deviceId='"+ deviceId + "'", twin, DateTime.Now, 10); How it works... In this example, we looked at a job updating the device twin information and we can follow up the job for its status to find out if the job was completed or failed. In this case, instead of having single API calls, a job can be created to execute on multiple IoT devices. The job client object provides the jobs available with the IoT Hub using the connection to it. Once we locate the job using its unique ID we can retrieve the status for it. The code snippet mentioned in the How to do it... preceding recipe, uses the temperature properties and updates the data. The job is scheduled to start execution immediately with 10 seconds of execution timeout set. There's more... For a job, the life cycle begins with initiation from the IoT solution. If any job is in execution, we can query to it and see the status of execution. Another most common scenario where this could be useful is the firmware update, reboot, configuration updates, and so on, apart from the device property read or write. Each device job has properties that helps us working with them. The useful properties are start and end date time, status, and lastly device job statistics which gives the job execution statistics. Summary We have learned the device management using different techniques with Azure IoT Hub in detail. We have explained, how the IoT solution is going to help the users to manage the devices remotely, take actions from the cloud based application like disable, update data, run any command, and firmware update. We also performed different tasks for device management. Resources for Article: Further resources on this subject: Device Management in Zenoss Core Network and System Monitoring: Part 1 [article] Device Management in Zenoss Core Network and System Monitoring: Part 2 [article] Managing Network Devices [article]
Read more
  • 0
  • 0
  • 13929

article-image-what-makes-hadoop-so-revolutionary
Packt
20 Feb 2018
17 min read
Save for later

What makes Hadoop so revolutionary?

Packt
20 Feb 2018
17 min read
In this article by Sourav Gulati and Sumit Kumar authors of book Apache Spark 2.x for Java Developers , explain in classical sense if we are to talk of Hadoop, then it comprises of two components a storage layer called HDFS and a processing layer called MapReduce. The resource management task prior to Hadoop 2.X was done using MapReduce Framework of Hadoop itself, however that changed with the introduction of YARN. In Hadoop 2.0 YARN was introduced as the third component of Hadoop to manage the resources of Hadoop Cluster and make it more Map Reduce agnostic. (For more resources related to this topic, see here.) HDFS Hadoop Distributed File System as the name suggests is a distributed file system based on the lines of Google File System written in Java. In practice HDFS resembles closely like any other UNIX file system with support for common file operations like ls, cp, rm, du, cat and so on. However what makes HDFS stand out despite its simplicity, is its mechanism to handle node failure in Hadoop cluster without effectively changing the seek time for accessing stored files. HDFS cluster consists of two major components: Data Nodes and Name Node. HDFS has a unique way of storing data on HDFS clusters (cheap commodity networked commodity computers). It splits the regular file in smaller chunks called blocks and then makes an exact number of copies of such chunks depending on the replication factor for that file. After that it copies such chunks to different Data Nodes of the Cluster. Name Node Name Node is responsible for managing the metadata of HDFS cluster such as list of files and folders that exist in a cluster, number of splits each file is divided into and their replication and storage at different Data Nodes. It also maintains and manages the namespace and file permission of all the files available in HDFS cluster. Apart from bookkeeping Name Node also has a supervisory role that keeps a watch on the replication factor of all the files and if some block goes missing then issue commands to replicate the missing block of data. It also generates reports to ascertain cluster health too. It is important to note that all the communication for supervisory task happens from Data Node to Name node that is Data Node sends reports a.k.a block reports to Name Node and it is then that Name Node responds to them by issuing different commands or instructions as the need may be. HDFS I/O A HDFS read operation from a client involves: Client requests the NameNode to determine where the actual data blocks are stored for a given file. Name Node obliges by providing the Block IDs and locations of the hosts (Data Node ) where the data can be found. The client contacts the Data Node with respective Block IDs to fetches the data from Data Node while preserving the order of the block files. A HDFS write operation from a client involves: Client contacts the Name Node to update the namespace with the file name and verify necessary permissions. If the file exists then Name Node throws an error else return the client FSDataOutputStream which points to data queue. The data queue negotiates with the NameNode to allocate new blocks on suitable DataNodes. The data is then copied to that DataNode, and as per replication strategy the data it further copied from that DataNode to rest of the DataNodes. It’s important to note that the data is never moved through the NameNode as it would have caused performance bottleneck. YARN Simplest way to understand Yet Another Resource manager (YARN) is to think of it as an operating system on a Cluster; provisioning resources, scheduling jobs & node maintenance. With Hadoop 2.x, MapReduce model of processing the data and managing the cluster (job tracker/task tracker) was divided. While data processing was still left to MapReduce, the cluster’s resource allocation (or rather, scheduling) task was assigned to a new component called YARN. Another objective that YARN met was that it made MapReduce one of the techniques to process the data rather than being the only technology to process data on HDFS as was the case in Hadoop 1.x systems. This paradigm shift opened the flood gate for the development of interesting applications around Hadoop and a new eco-system of not only classical MapReduce processing system evolved. It didn’t take much time after that for Apache Spark to break the hegemony of classical MapReduce and become arguably the most popular processing framework for parallel computing as far as active development and adoption is concerned. In order to serve Multi-tenancy, fault tolerance, and resource isolation in YARN, it developed below components to manage the cluster seamlessly. ResourceManager: It negotiates resources for different compute programmes on a Hadoop cluster while guaranteeing the following: resource isolation, data locality, fault tolerance, task prioritization and effective cluster capacity utilization. A configurable scheduler allows Resource Manager the flexibility to schedule and prioritize different applications as per the need. Tasks served by RM while serving clients: Using client or APIs user can submit or terminate an application. The user can also gather statistics on submitted application, cluster and queue information. RM also priorities ADMIN tasks higher over any other task to perform clean up or maintenance activities on a cluster like refreshing node-list, the queues configuration. Tasks served by RM while serving Cluster Nodes: Provisioning and de-provisioning of new nodes forms an important task of RM. Each node sends a heartbeat at a configured interval, default being 10 minutes. Any failure of node in doing so is treated as dead node. As a clean-up activity all the supposedly running process including containers are marked dead too. Tasks served by RM while serving Application Master: RM registers new AM while terminating the successfully executed ones. Just like Cluster Nodes if the heartbeat of AM is not received within a preconfigured duration, default value being 10 minutes, then AM is marked dead and all the associated containers too are marked dead. But since YARN is reliable as far as Application execution is concerned hence a new AM is rescheduled to try another execution on a new container until it reaches the retry configurable default count of 4. Scheduling and other miscellaneous tasks served by RM: RM maintains a list of running, submitted and executed applications along with its statistics such as execution time , status etc. Privileges of user as well as of applications are maintained and compared while serving various requests of user per application life cycle. RM scheduler oversees resource allocation for application such as memory allocation. Two common scheduling algorithms used in YARN are fair scheduling and capacity scheduling algorithms. NodeManager: NM exist per node of the cluster on a slightly similar fashion as to what slave nodes are in master slave architecture. When a NM starts it sends the information to RM for its availability to share its resources for upcoming jobs. There on NM sends periodic signal also called heartbeat to RM informing them of its status as being alive in the cluster. Primarily NM is responsible for launching containers that has been requested by AM with certain resource requirement such as memory, disk and so on. Once the containers are up and running the NM keeps a watch not on the status of the container’s task but on the resource utilization of the container and kill them if the container start utilizing more resources then it has been provisioned for. Apart from managing the life cycle of the container the NM also keeps RM informed about node’s health. ApplicationMaster: AM gets launched per submitted application and manages the life cycle of submitted application. However the first and foremost task AM does is to negotiate resources from RM to launch task specific containers at different nodes. Once containers are launched the AM keeps track of all the containers’ task status. If any node goes down or the container gets killed because of using excess resources or otherwise in such cases AM renegotiates resources from RM and launch those pending tasks again. AM also keeps reporting the status of the submitted application directly to the user and other such statistics to RM. ApplicationMaster implementation is framework specific and it is because of this reason application/framework specific code if transferred the AM , and it the AM that distributes it further across. This important feature also makes YARN technology agnostic as any framework can implement its ApplicationMaster and then utilized the resources of YARN cluster seamlessly. Container: Container in an abstract sense is a set of minimal resources such as CPU, RAM, Disk I/O, Disk space etc. that are required to run a task independently on a node. The first container after submitting the job is launched by RM to host ApplicationMaster. It is the AM which then negotiates resources from RM in the form of containers, which then gets hosted in different nodes across the Hadoop Cluster. Process flow of application submission in YARN: Step 1: Using a client or APIs the user submits the application let’s say a Spark Job jar. Resource Manager, whose primary task is to gather and report all the applications running on entire Hadoop cluster and available resources on respective Hadoop nodes, depending on the privileges of the user submitting the job accepts the newly submitted task. Step2: After this RM delegates the task to scheduler. The scheduler then searches for a container which can host the application-specific Application Master. While Scheduler does takes into consideration parameters like availability of resources, task priority, data locality etc. before scheduling or launching an Application Master, it has no role in monitoring or restarting a failed job. It is the responsibility of RM to keep track of AM and restart them in a new container when be it fails. Step 3: Once the Application Master gets launched it becomes the prerogative of AM to oversee the resources negotiation with RM for launching task specific containers. Negotiations with RM is typically over:    The priority of the tasks at hand.    Number of containers to be launched to complete the tasks.    The resources need to execute the tasks i.e. RAM, CPU (since Hadoop 3.x).    Available nodes where job containers can be launched with required resources    Depending on the priority and availability of resources the RM grants containers represented by container ID and hostname of the node on which it can be launched. Step 4: The AM then request the NM of the respective hosts to launch the containers with specific ID’s and resource configuration. The NM then launches the containers but keeps a watch on the resources usage of the task. If for example the container starts utilizing more resources than it has been provisioned for then in such scenario the said containers are killed by the NM. This greatly improves the job isolation and fair sharing of resources guarantee that YARN provides as otherwise it would have impacted the execution of other containers. However, it is important to note that the job status and application status as a whole is managed by AM. It falls in the domain of AM to continuously monitor any delay or dead containers, simultaneously negotiating with RM to launch new containers to reassign the task of dead containers. Step 5: The Containers executing on different nodes sends Application specific statistics to AM at specific intervals. Step 6: AM also reports the status of the application directly to the client that submitted the specific application, in our case a Spark Job. Step 7: NM monitors the resources being utilized by all the containers on the respective nodes and keeps sending a periodic update to RM. Step 8: The AM sends periodic statistics such application status, task failure, log information to RM Overview Of MapReduce Before delving deep into MapReduce implementation in Hadoop, let’s first understand the MapReduce as a concept in parallel computing and why it is a preferred way of computing. MapReduce comprises two mutually exclusive but dependent phases each capable of running on two different machines or nodes: Map: In Map phase transformation of data takes place. It splits data into key value pair by splitting it on a keyword. Suppose we have a text file and we would want to do an analysis such as to count total number of words or even the frequency with which the word has occurred in the text file. This is the classical Word Count problem of MapReduce, now to address this problem first we will have to identify the splitting keyword so that the data can be spilt and be converted into a key value pair. Let’s begin with John Lennon's song Imagine. Sample Text: Imagine there's no heaven It's easy if you try No hell below us Above us only sky Imagine all the people living for today After running Map phase on the sampled text and splitting it over <space> it will get converted to key value pair as follows: <imagine, 1> <there's, 1> <no, 1> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <no, 1> <hell, 1> <below, 1> <us, 1> <above, 1> <us, 1> <only, 1> <sky, 1> <imagine, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] The key here represents the word and value represents the count, also it should be noted that we have converted all the keys to lowercase to reduce any further complexity arising out of matching case sensitive keys. Reduce: Reduce phase deals with aggregation of Map phase result and hence all the key value pairs are aggregated over key. So the Map output of the text would get aggregated as follows: [<imagine, 2> <there's, 1> <no, 2> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <hell, 1> <below, 1> <us, 2> <above, 1> <only, 1> <sky, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] As we can see both Map and Reduce phase can be run exclusively and hence can use independent nodes in cluster to process the data. This approach of separation of tasks into smaller units called Map and Reduce has revolutionized general purpose distributed/parallel computing, which we now know as MapReduce. Apache Hadoop's MapReduce has been implemented pretty much the same way as discussed except for adding extra features into how the data from Map phase of each node gets transferred to their designated Reduce phase node. Hadoop's implementation of MapReduce enriches the Map and Reduce phase by adding few more concrete steps in between to make it fault tolerant and truly distributed. We can describe MR jobs on YARN in five stages. Job Submission Stage: When a client submits a MR Job following things happen RM is requested for an application ID. Input data location is checked and if present then file split size is computed. Job's output location need to exist as well. If all the three conditions are met then the MR job jar along with its configuration ,details of input split are copied to HDFS in a directory named the application ID provided by RM. And then the job is submitted to RM to launch a job specific Application Master, MRAppMaster. MAP Stage: Once RM receives the client's request for launching MRAppMaster, a call is made to YARN scheduler for assigning a container. As per resource availability the container is granted and hence the MRAppMaster is launched at the designated node with provisioned resources. After this MRAppMaster fetches input split information from the HDFS path that was submitted by the client and computes the number of Mapper task that will be launched based on the splits. Depending on number of Mappers it also calculates the required number of Reducers as per configuration, If MRAppMaster now finds the number of Mapper ,Reducer & size of input files to be small enough to be run in the same JVM then it goes ahead in doing so, such tasks are called Uber task. However, in other scenarios MRAppMaster negotiates container resources from RM for running these tasks albeit Mapper tasks having higher order and priority. This is so as Mapper tasks must finish before sorting phase can start. Data locality is another concern for containers hosting Mappers as data local nodes are preferred over rack local, with least preference being given to remote node hosted data. But when it comes to Reduce phase no such preference of data locality exist for containers. Containers hosting Mapper function first copy mapReduce JAR & configuration files locally and then launch a class YarnChild in the JVM. The mapper then start reading the input files, process them by making key value pairs and writes them in a circular buffer. Shuffle and Sort Phase: Considering circular buffer has size constraint, after a certain percentage where default being 80, a thread gets spawned which spills the data from buffer. But before copying the spilled data to disk, it is first partitioned with respect to its Reducer then the background thread also sorts the partitioned data on key and if combiner is mentioned then combines the data too. This process optimizes the data once it is copied to their respective partitioned folder. This process is continued until all the data from circular buffer gets written to disk. A background thread again checks if the number of spilled files in each partition is within the range of configurable parameter or else the files are merged and combiner is run over them until it falls within the limit of the parameter. Map task keeps updating the status to ApplicationMaster its entire life cycle, it is only when 5 percent of Map task has been completed that the reduce task start. An auxiliary service in the NodeManager serving Reduce task starts a Netty web server that makes a request to MRAppMaster for Mapper hosts having specific Mapper partitioned files. All the partitioned files that pertain to the Reducer is copied to their respective nodes in similar fashion. Since multiple files gets copied as data from various nodes representing that reduce nodes gets collected, a background thread merges the sorted map file again sorts them and if Combiner is configured then combines the result too. Reduce Stage: It is important to note here that at this stage every input file of each reducer should have been sorted by key, this is the presumption with which Reducer starts processing these records and converts the key value pair into aggregated list. Once reducer processes the data it writes them to the output folder as was mentioned during Job submission. Clean up stage: Each Reducer sends periodic update to MRAppMaster about the task completion, once the Reduce task is over the application master starts the clean-up activity. The submitted job status is changed from running to successful, all the temporary and intermediate files and folders are deleted .The application statistics are archived to job history server. Summary In this article we saw what is HDFS and YARN along with MapReduce in which we learned different function of MapReduce and HDFS I/O. Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames [article] Five common questions for .NET/Java developers learning JavaScript and Node.js [article] Getting Started with Apache Hadoop and Apache Spark [article]
Read more
  • 0
  • 0
  • 41115

article-image-develop-stock-price-predictive-model-using-reinforcement-learning-tensorflow
Aaron Lazar
20 Feb 2018
12 min read
Save for later

How to develop a stock price predictive model using Reinforcement Learning and TensorFlow

Aaron Lazar
20 Feb 2018
12 min read
[box type="note" align="" class="" width=""]This article is an extract from the book Predictive Analytics with TensorFlow, authored by Md. Rezaul Karim. This book helps you build, tune, and deploy predictive models with TensorFlow.[/box] In this article we’ll show you how to create a predictive model to predict stock prices, using TensorFlow and Reinforcement Learning. An emerging area for applying Reinforcement Learning is the stock market trading, where a trader acts like a reinforcement agent since buying and selling (that is, action) particular stock changes the state of the trader by generating profit or loss, that is, reward. The following figure shows some of the most active stocks on July 15, 2017 (for an example): Now, we want to develop an intelligent agent that will predict stock prices such that a trader will buy at a low price and sell at a high price. However, this type of prediction is not so easy and is dependent on several parameters such as the current number of stocks, recent historical prices, and most importantly, on the available budget to be invested for buying and selling. The states in this situation are a vector containing information about the current budget, current number of stocks, and a recent history of stock prices (the last 200 stock prices). So each state is a 202-dimensional vector. For simplicity, there are only three actions to be performed by a stock market agent: buy, sell, and hold. So, we have the state and action, what else do you need? Policy, right? Yes, we should have a good policy, so based on that an action will be performed in a state. A simple policy can consist of the following rules: Buying (that is, action) a stock at the current stock price (that is, state) decreases the budget while incrementing the current stock count Selling a stock trades it in for money at the current share price Holding does neither, and performing this action simply waits for a particular time period and yields no reward To find the stock prices, we can use the yahoo_finance library in Python. A general warning you might experience is "HTTPError: HTTP Error 400: Bad Request". But keep trying. Now, let's try to get familiar with this module: >>> from yahoo_finance import Share >>> msoft = Share('MSFT') >>> print(msoft.get_open()) 72.24= >>> print(msoft.get_price()) 72.78 >>> print(msoft.get_trade_datetime()) 2017-07-14 20:00:00 UTC+0000 >>> So as of July 14, 2017, the stock price of Microsoft Inc. went higher, from 72.24 to 72.78, which means about a 7.5% increase. However, this small and just one-day data doesn't give us any significant information. But, at least we got to know the present state for this particular stock or instrument. To install yahoo_finance, issue the following command: $ sudo pip3 install yahoo_finance Now it would be worth looking at the historical data. The following function helps us get the historical data for Microsoft Inc: def get_prices(share_symbol, start_date, end_date, cache_filename): try: stock_prices = np.load(cache_filename) except IOError: share = Share(share_symbol) stock_hist = share.get_historical(start_date, end_date) stock_prices = [stock_price['Open'] for stock_price in stock_ hist] np.save(cache_filename, stock_prices) return stock_prices The get_prices() method takes several parameters such as the share symbol of an instrument in the stock market, the opening date, and the end date. You will also like to specify and cache the historical data to avoid repeated downloading. Once you have downloaded the data, it's time to plot the data to get some insights. The following function helps us to plot the price: def plot_prices(prices): plt.title('Opening stock prices') plt.xlabel('day') plt.ylabel('price ($)') plt.plot(prices) plt.savefig('prices.png') Now we can call these two functions by specifying a real argument as follows: if __name__ == '__main__': prices = get_prices('MSFT', '2000-07-01', '2017-07-01', 'historical_stock_prices.npy') plot_prices(prices) Here I have chosen a wide range for the historical data of 17 years to get a better insight. Now, let's take a look at the output of this data: The goal is to learn a policy that gains the maximum net worth from trading in the stock market. So what will a trading agent be achieving in the end? Figure 8 gives you some clue: Well, figure 8 shows that if the agent buys a certain instrument with price $20 and sells at a peak price say at $180, it will be able to make $160 reward, that is, profit. So, implementing such an intelligent agent using RL algorithms is a cool idea? From the previous example, we have seen that for a successful RL agent, we need two operations well defined, which are as follows: How to select an action How to improve the utility Q-function To be more specific, given a state, the decision policy will calculate the next action to take. On the other hand, improve Q-function from a new experience of taking an action. Also, most reinforcement learning algorithms boil down to just three main steps: infer, perform, and learn. During the first step, the algorithm selects the best action (a) given a state (s) using the knowledge it has so far. Next, it performs the action to find out the reward (r) as well as the next state (s'). Then, it improves its understanding of the world using the newly acquired knowledge (s, r, a, s') as shown in the following figure: Now, let's start implementing the decision policy based on which action will be taken for buying, selling, or holding a stock item. Again, we will do it an incremental way. At first, we will create a random decision policy and evaluate the agent's performance. But before that, let's create an abstract class so that we can implement it accordingly: class DecisionPolicy: def select_action(self, current_state, step): pass def update_q(self, state, action, reward, next_state): pass The next task that can be performed is to inherit from this superclass to implement a random decision policy: class RandomDecisionPolicy(DecisionPolicy): def __init__(self, actions): self.actions = actions def select_action(self, current_state, step): action = self.actions[random.randint(0, len(self.actions) - 1)] return action The previous class did nothing except defi ning a function named select_action (), which will randomly pick an action without even looking at the state. Now, if you would like to use this policy, you can run it on the real-world stock price data. This function takes care of exploration and exploitation at each interval of time, as shown in the following figure that form states S1, S2, and S3. The policy suggests an action to be taken, which we may either choose to exploit or otherwise randomly explore another action. As we get rewards for performing an action, we can update the policy function over time: Fantastic, so we have the policy and now it's time to utilize this policy to make decisions and return the performance. Now, imagine a real scenario—suppose you're trading on Forex or ForTrade platform, then you can recall that you also need to compute the portfolio and the current profit or loss, that is, reward. Typically, these can be calculated as follows: portfolio = budget + number of stocks * share value reward = new_portfolio - current_portfolio At first, we can initialize values that depend on computing the net worth of a portfolio, where the state is a hist+2 dimensional vector. In our case, it would be 202 dimensional. Then we define the range of tuning the range up to: Length of the prices selected by the user query – (history + 1), since we start from 0, we subtract 1 instead. Then, we should calculate the updated value of the portfolio and from the portfolio, we can calculate the value of the reward, that is, profit. Also, we have already defined our random policy, so we can then select an action from the current policy. Then, we repeatedly update the portfolio values based on the action in each iteration and the new portfolio value after taking the action can be calculated. Then, we need to compute the reward from taking an action at a state. Nevertheless, we also need to update the policy after experiencing a new action. Finally, we compute the final portfolio worth: def run_simulation(policy, initial_budget, initial_num_stocks, prices, hist, debug=False): budget = initial_budget num_stocks = initial_num_stocks share_value = 0 transitions = list() for i in range(len(prices) - hist - 1): if i % 100 == 0: print('progress {:.2f}%'.format(float(100*i) / (len(prices) - hist - 1))) current_state = np.asmatrix(np.hstack((prices[i:i+hist], budget, num_stocks))) current_portfolio = budget + num_stocks * share_value action = policy.select_action(current_state, i) share_value = float(prices[i + hist + 1]) if action == 'Buy' and budget >= share_value: budget -= share_value num_stocks += 1 elif action == 'Sell' and num_stocks > 0: budget += share_value num_stocks -= 1 else: action = 'Hold' new_portfolio = budget + num_stocks * share_value reward = new_portfolio - current_portfolio next_state = np.asmatrix(np.hstack((prices[i+1:i+hist+1], budget, num_stocks))) transitions.append((current_state, action, reward, next_ state)) policy.update_q(current_state, action, reward, next_state) portfolio = budget + num_stocks * share_value if debug: print('${}t{} shares'.format(budget, num_stocks)) return portfolio The previous simulation predicts a somewhat good result; however, it produces random results too often. Thus, to obtain a more robust measurement of success, let's run the simulation a couple of times and average the results. Doing so may take a while to complete, say 100 times, but the results will be more reliable: def run_simulations(policy, budget, num_stocks, prices, hist): num_tries = 100 final_portfolios = list() for i in range(num_tries): final_portfolio = run_simulation(policy, budget, num_stocks, prices, hist) final_portfolios.append(final_portfolio) avg, std = np.mean(final_portfolios), np.std(final_portfolios) return avg, std The previous function computes the average portfolio and the standard deviation by iterating the previous simulation function 100 times. Now, it's time to evaluate the previous agent. As already stated, there will be three possible actions to be taken by the stock trading agent such as buy, sell, and hold. We have a state vector of 202 dimension and budget only $1000. Then, the evaluation goes as follows: actions = ['Buy', 'Sell', 'Hold'] hist = 200 policy = RandomDecisionPolicy(actions) budget = 1000.0 num_stocks = 0 avg,std=run_simulations(policy,budget,num_stocks,prices, hist) print(avg, std) >>> 1512.87102405 682.427384814 The first one is the mean and the second one is the standard deviation of the final portfolio. So, our stock prediction agent predicts that as a trader you/we could make a profit about $513. Not bad. However, the problem is that since we have utilized a random decision policy, the result is not so reliable. To be more specific, the second execution will definitely produce a different result: >>> 1518.12039077 603.15350649 Therefore, we should develop a more robust decision policy. Here comes the use of neural network-based QLearning for decision policy. Next, we will see a new hyperparameter epsilon to keep the solution from getting stuck when applying the same action over and over. The lesser its value, the more often it will randomly explore new actions: Next, I am going to write a class containing their functions: Constructor: This helps to set the hyperparameters from the Q-function. It also helps to set the number of hidden nodes in the neural networks. Once we have these two, it helps to define the input and output tensors. It then defines the structure of the neural network. Further, it defines the operations to compute the utility. Then, it uses an optimizer to update model parameters to minimize the loss and sets up the session and initializes variables. select_action: This function exploits the best option with probability 1-epsilon. update_q: This updates the Q-function by updating its model parameters. Refer to the following code: class QLearningDecisionPolicy(DecisionPolicy): def __init__(self, actions, input_dim): self.epsilon = 0.9 self.gamma = 0.001 self.actions = actions output_dim = len(actions) h1_dim = 200 self.x = tf.placeholder(tf.float32, [None, input_dim]) self.y = tf.placeholder(tf.float32, [output_dim]) W1 = tf.Variable(tf.random_normal([input_dim, h1_dim])) b1 = tf.Variable(tf.constant(0.1, shape=[h1_dim])) h1 = tf.nn.relu(tf.matmul(self.x, W1) + b1) W2 = tf.Variable(tf.random_normal([h1_dim, output_dim])) b2 = tf.Variable(tf.constant(0.1, shape=[output_dim])) self.q = tf.nn.relu(tf.matmul(h1, W2) + b2) loss = tf.square(self.y - self.q) self.train_op = tf.train.GradientDescentOptimizer(0.01). minimize(loss) self.sess = tf.Session() self.sess.run(tf.initialize_all_variables()) def select_action(self, current_state, step): threshold = min(self.epsilon, step / 1000.) if random.random() < threshold: # Exploit best option with probability epsilon action_q_vals = self.sess.run(self.q, feed_dict={self.x: current_state}) action_idx = np.argmax(action_q_vals) action = self.actions[action_idx] else: # Random option with probability 1 - epsilon action = self.actions[random.randint(0, len(self.actions) - 1)] return action def update_q(self, state, action, reward, next_state): action_q_vals = self.sess.run(self.q, feed_dict={self.x: state}) next_action_q_vals = self.sess.run(self.q, feed_dict={self.x: next_state}) next_action_idx = np.argmax(next_action_q_vals) action_q_vals[0, next_action_idx] = reward + self.gamma * next_action_q_vals[0, next_action_idx] action_q_vals = np.squeeze(np.asarray(action_q_vals)) self.sess.run(self.train_op, feed_dict={self.x: state, self.y: action_q_vals}) There you go! We have a stock price predictive model running and we’ve built it using Reinforcement Learning and TensorFlow. If you found this tutorial interesting and would like to learn more, head over to grab this book, Predictive Analytics with TensorFlow, by Md. Rezaul Karim.    
Read more
  • 0
  • 1
  • 19813

article-image-introduction-performance-testing-and-jmeter
Packt
20 Feb 2018
11 min read
Save for later

Introduction to Performance Testing and JMeter

Packt
20 Feb 2018
11 min read
In this article by Bayo Erinle, the author of the book Performance Testing with JMeter 3, will explore some of the options that make JMeter a great tool of choice for performance testing.  (For more resources related to this topic, see here.) Performance testing and tuning There is a strong relationship between performance testing and tuning, in the sense that one often leads to the other. Often, end-to-end testing unveils system or application bottlenecks that are regarded unacceptable with project target goals. Once those bottlenecks are discovered, the next step for most teams is a series of tuning efforts to make the application perform adequately. Such efforts normally include, but are not limited to, the following: Configuring changes in system resources Optimizing database queries Reducing round trips in application calls, sometimes leading to redesigning and re-architecting problematic modules Scaling out application and database server capacity Reducing application resource footprint Optimizing and refactoring code, including eliminating redundancy and reducing execution time Tuning efforts may also commence if the application has reached acceptable performance but the team wants to reduce the amount of system resources being used, decrease the volume of hardware needed, or further increase system performance. After each change (or series of changes), the test is re-executed to see whether the performance has improved or declined due to the changes. The process will be continued with the performance results having reached acceptable goals. The outcome of these test-tuning circles normally produces a baseline. Baselines Baseline is a process of capturing performance metric data for the sole purpose of evaluating the efficacy of successive changes to the system or application. It is important that all characteristics and configurations, except those specifically being varied for comparison, remain the same in order to make effective comparisons as to which change (or series of changes) is driving results toward the targeted goal. Armed with such baseline results, subsequent changes can be made to the system configuration or application and testing results can be compared to see whether such changes were relevant or not. Some considerations when generating baselines include the following: They are application-specific They can be created for system, application, or modules They are metrics/results They should not be over generalized They evolve and may need to be redefined from time to time They act as a shared frame of reference They are reusable They help identify changes in performance Load and stress testing Load testing is the process of putting demand on a system and measuring its response, that is, determining how much volume the system can handle. Stress testing is the process of subjecting the system to unusually high loads far beyond its normal usage pattern to determine its responsiveness. These are different from performance testing, whose sole purpose is to determine the response and effectiveness of a system, that is, how fast the system is. Since load ultimately affects how a system responds, performance testing is always done in conjunction with stress testing. JMeter to the rescue One of the areas performance testing covers is testing tools. Which testing tool do you use to put the system and application under load? There are numerous testing tools available to perform this operation, from free to commercial solutions. However, our focus will be on Apache JMeter, a free, open source, cross-platform desktop application from the Apache Software foundation. JMeter has been around since 1998 according to historic change logs on its official site, making it a mature, robust, and reliable testing tool. Cost may also have played a role in its wide adoption. Small companies usually may not want to foot the bill for commercial end testing tools, which often place restrictions, for example, on how many concurrent users one can spin off. My first encounter with JMeter was exactly a result of this. I worked in a small shop that had paid for a commercial testing tool, but during the course of testing, we had outrun the licensing limits of how many concurrent users we needed to simulate for realistic test plans. Since JMeter was free, we explored it and were quite delighted with the offerings and the share amount of features we got for free. Here are some of its features: Performance tests of different server types, including web (HTTP and HTTPS), SOAP, database, LDAP, JMS, mail, and native commands or shell scripts Complete portability across various operating systems Full multithreading framework allowing concurrent sampling by many threads and simultaneous sampling of different functions by separate thread groups Full featured Test IDE that allows fast Test Plan recording, building, and debugging Dashboard Report for detailed analysis of application performance indexes and key transactions In-built integration with real-time reporting and analysis tools, such as Graphite, InfluxDB, and Grafana, to name a few Complete dynamic HTML reports Graphical User Interface (GUI) HTTP proxy recording server Caching and offline analysis/replaying of test results High extensibility Live view of results as testing is being conducted JMeter allows multiple concurrent users to be simulated on the application, allowing you to work toward most of the target goals obtained earlier, such as attaining baseline and identifying bottlenecks. It will help answer questions, such as the following: Will the application still be responsive if 50 users are accessing it concurrently? How reliable will it be under a load of 200 users? How much of the system resources will be consumed under a load of 250 users? What will the throughput look like with 1000 users active in the system? What will be the response time for the various components in the application under load? JMeter, however, should not be confused with a browser. It doesn't perform all the operations supported by browsers; in particular, JMeter does not execute JavaScript found in HTML pages, nor does it render HTML pages the way a browser does. However, it does give you the ability to view request responses as HTML through many of its listeners, but the timings are not included in any samples. Furthermore, there are limitations to how many users can be spun on a single machine. These vary depending on the machine specifications (for example, memory, processor speed, and so on) and the test scenarios being executed. In our experience, we have mostly been able to successfully spin off 250-450 users on a single machine with a 2.2 GHz processor and 8 GB of RAM. Up and running with JMeter Now, let's get up and running with JMeter, beginning with its installation. Installation JMeter comes as a bundled archive, so it is super easy to get started with it. Those working in corporate environments behind a firewall or machines with non-admin privileges appreciate this more. To get started, grab the latest binary release by pointing your browser to http://jmeter.apache.org/download_jmeter.cgi. At the time of writing this, the current release version is 3.1. The download site offers the bundle as both a .zip file and a .tgz file. We go with the .zip file option, but feel free to download the .tgz file if that's your preferred way of grabbing archives. Once downloaded, extract the archive to a location of your choice. The location you extracted the archive to will be referred to as JMETER_HOME. Provided you have a JDK/JRE correctly installed and a JAVA_HOME environment variable set, you are all set and ready to run! The following screenshot shows a trimmed down directory structure of a vanilla JMeter install: JMETER_HOME folder structure The following are some of the folders in Apache-JMeter-3.2, as shown in the preceding screenshot: bin: This folder contains executable scripts to run and perform other operations in JMeter docs: This folder contains a well-documented user guide extras: This folder contains miscellaneous items, including samples illustrating the usage of the Apache Ant build tool (http://ant.apache.org/) with JMeter, and bean shell scripting lib: This folder contains utility JAR files needed by JMeter (you may add additional JARs here to use from within JMeter; we will cover this in detail later) printable_docs: This is the printable documentation Installing Java JDK Follow these steps to install Java JDK: Go to http://www.oracle.com/technetwork/java/javase/downloads/index.html. Download Java JDK (not JRE) compatible with the system that you will use to test. At the time of writing, JDK 1.8 (update 131) was the latest. Double-click on the executable and follow the onscreen instructions. On Windows systems, the default location for the JDK is under Program Files. While there is nothing wrong with this, the issue is that the folder name contains a space, which can sometimes be problematic when attempting to set PATH and run programs, such as JMeter, depending on the JDK from the command line. With this in mind, it is advisable to change the default location to something like C:toolsjdk. Setting up JAVA_HOME Here are the steps to set up the JAVA_HOME environment variable on Windows and Unix operating systems. On Windows For illustrative purposes, assume that you have installed Java JDK at C:toolsjdk: Go to Control Panel. Click on System. Click on Advance System settings. Add Environment to the following variables:     Value: JAVA_HOME     Path: C:toolsjdk Locate Path (under system variables, bottom half of the screen). Click on Edit. Append %JAVA_HOME%/bin to the end of the existing path value (if any). On Unix For illustrative purposes, assume that you have installed Java JDK at /opt/tools/jdk: Open up a Terminal window. Export JAVA_HOME=/opt/tools/jdk. Export PATH=$PATH:$JAVA_HOME. It is advisable to set this in your shell profile settings, such as .bash_profile (for bash users) or .zshrc (for zsh users), so that you won't have to set it for each new Terminal window you open. Running JMeter Once installed, the bin folder under the JMETER_HOME folder contains all the executable scripts that can be run. Based on the operating system that you installed JMeter on, you either execute the shell scripts (.sh file) for operating systems that are Unix/Linux flavored, or their batch (.bat file) counterparts on operating systems that are Windows flavored. JMeter files are saved as XML files with a .jmx extension. We refer to them as test scripts or JMX files. These scripts include the following: jmeter.sh: This script launches JMeter GUI (the default) jmeter-n.sh: This script launches JMeter in non-GUI mode (takes a JMX file as input) jmeter-n-r.sh: This script launches JMeter in non-GUI mode remotely jmeter-t.sh: This opens a JMX file in the GUI jmeter-server.sh: This script starts JMeter in server mode (this will be kicked off on the master node when testing with multiple machines remotely) mirror-server.sh: This script runs the mirror server for JMeter shutdown.sh: This script gracefully shuts down a running non-GUI instance stoptest.sh: This script abruptly shuts down a running non-GUI instance   To start JMeter, open a Terminal shell, change to the JMETER_HOME/bin folder, and run the following command on Unix/Linux: ./jmeter.sh Alternatively, run the following command on Windows: jmeter.bat Take a moment to explore the GUI. Hover over each icon to see a short description of what it does. The Apache JMeter team has done an excellent job with the GUI. Most icons are very similar to what you are used to, which helps ease the learning curve for new adapters. Some of the icons, for example, stop and shutdown, are disabled for now till a scenario/test is being conducted. The JVM_ARGS environment variable can be used to override JVM settings in the jmeter.bat or jmeter.sh script. Consider the following example: export JVM_ARGS="-Xms1024m -Xmx1024m -Dpropname=propvalue". Command-line options To see all the options available to start JMeter, run the JMeter executable with the -? command. The options provided are as follows: . ./jmeter.sh -? -? print command line options and exit -h, --help print usage information and exit -v, --version print the version information and exit -p, --propfile <argument> the jmeter property file to use -q, --addprop <argument> additional JMeter property file(s) -t, --testfile <argument> the jmeter test(.jmx) file to run -l, --logfile <argument> the file to log samples to -j, --jmeterlogfile <argument> jmeter run log file (jmeter.log) -n, --nongui run JMeter in nongui mode ... -J, --jmeterproperty <argument>=<value> Define additional JMeter properties -G, --globalproperty <argument>=<value> Define Global properties (sent to servers) e.g. -Gport=123 or -Gglobal.properties -D, --systemproperty <argument>=<value> Define additional system properties -S, --systemPropertyFile <argument> additional system property file(s) This is a snippet (non-exhaustive list) of what you might see if you did the same. Summary In this article we have learnt relationship between performance testing and tuning, and how to install and run JMeter.   Resources for Article: Further resources on this subject: Functional Testing with JMeter [article] Creating an Apache JMeter™ test workbench [article] Getting Started with Apache Spark DataFrames [article]
Read more
  • 0
  • 0
  • 3287
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-k-nearest-neighbors
Packt
20 Feb 2018
10 min read
Save for later

K Nearest Neighbors

Packt
20 Feb 2018
10 min read
In this article by Gavin Hackeling, author of book Mastering Machine Learning with scikit-learn - Second Edition, we will start with K Nearest Neighbors (KNN) which is a simple model for regression and classification tasks. It is so simple that its name describes most of its learning algorithm. The titular neighbors are representations of training instances in a metric space. A metric space is a feature space in which the distances between all members of a set are defined. (For more resources related to this topic, see here.) For classification tasks, a set of tuples of feature vectors and class labels comprise the training set. KNN is a capable of binary, multi-class, and multi-label classification. We will focus on binary classification in this article. The simplest KNN classifiers use the mode of the KNN labels to classify test instances, but other strategies can be used. k is often set to an odd number to prevent ties. In regression tasks, the features vectors are each associated with a response variable that takes a real-valued scalar instead of a label. The prediction is the mean or weighted mean of the k nearest neighbors’ response variables. Lazy learning and non-parametric models KNN is a lazy learner. Also known as instance-based learners, lazy learners simply store the training data set with little or no processing. In contrast to eager learners, such as simple linear regression, KNN does not estimate the parameters of a model that generalizes the training data during a training phase. Lazy learning has advantages and disadvantages. Training an eager learner is often computationally costly, but prediction with the resulting model is often inexpensive. For simple linear regression, prediction consists only of multiplying the learned coefficient by the feature, and adding the learned intercept parameter. A lazy learner can predict almost immediately, but making predictions can be costly. In the simplest implementation of KNN, prediction requires calculating the distances between a test instance and all of the training instances. In contrast to most of the other models we will discuss, KNN is a non-parametric model. A parametric model uses a fixed number of parameters, or coefficients, to define the model that summarizes the data. The number of parameters is independent of the number of training instances. Non-parametric may seem to be a misnomer, as it does not mean that the model has no parameters; rather, non-parametric means that the number of parameters of the model is not fixed, and may grow with the number of training instances. Non-parametric models can be useful when training data is abundant and you have little prior knowledge about the relationship between the response and explanatory variables. KNN makes only one assumption: instances that are near each other are likely to have similar values of the response variable. The flexibility provided by non-parametric models is not always desirable; a model that makes assumptions about the relationship can be useful if training data is scarce or if you already know about the relationship. Classification with KNN The goal of classification tasks is to use one or more features to predict the value of a discrete response variable. Let’s work through a toy classification problem. Assume that you must use a person’s height and weight to predict his or her sex. This problem is called binary classification because the response variable can take one of two labels. The following table records nine training instances. height weight label 158 cm 64 kg male 170 cm 66 kg male 183 cm 84 kg male 191 cm 80 kg male 155 cm 49 kg female 163 cm 59 kg female 180 cm 67 kg female 158 cm 54 kg female 178 cm 77 kg female We are now using features from two explanatory variables to predict the value of the response variable. KNN is not limited to two features; the algorithm can use an arbitrary number of features, but more than three features cannot be visualized. Let’s visualize the data by creating a scatter plot with matplotlib. # In[1]: import numpy as np import matplotlib.pyplot as plt X_train = np.array([ [158, 64], [170, 86], [183, 84], [191, 80], [155, 49], [163, 59], [180, 67], [158, 54], [170, 67] ]) y_train = ['male', 'male', 'male', 'male', 'female', 'female', 'female', 'female', 'female'] plt.figure() plt.title('Human Heights and Weights by Sex') plt.xlabel('Height in cm') plt.ylabel('Weight in kg') for i, x in enumerate(X_train): # Use 'x' markers for instances that are male and diamond markers for instances that are female plt.scatter(x[0], x[1], c='k', marker='x' if y_train[i] == 'male' else 'D') plt.grid(True) plt.show() From the plot we can see that men, denoted by the x markers, tend to be taller and weigh more than women. This observation is probably consistent with your experience. Now let’s use KNN to predict whether a person with a given height and weight is a man or a woman. Let’s assume that we want to predict the sex of a person who is 155 cm tall and who weighs 70 kg. First, we must define our distance measure. In this case we will use Euclidean distance, the straight line distance between points in a Euclidean space. Euclidean distance in a two-dimensional space is given by the following: Next we must calculate the distances between the query instance and all of the training instances. height weight label Distance from test instance 158 cm 64 kg male 170 cm 66 kg male 183 cm 84 kg male 191 cm 80 kg male 155 cm 49 kg female 163 cm 59 kg female 180 cm 67 kg female 158 cm 54 kg female 178 cm 77 kg female We will set k to 3, and select the three nearest training instances. The following script calculates the distances between the test instance and the training instances, and identifies the most common sex of the nearest neighbors. # In[2]: x = np.array([[155, 70]]) distances = np.sqrt(np.sum((X_train - x)**2, axis=1)) distances # Out[2]: array([ 6.70820393, 21.9317122 , 31.30495168, 37.36308338, 21. , 13.60147051, 25.17935662, 16.2788206 , 15.29705854]) # In[3]: nearest_neighbor_indices = distances.argsort()[:3] nearest_neighbor_genders = np.take(y_train, nearest_neighbor_indices) nearest_neighbor_genders # Out[3]: array(['male', 'female', 'female'], dtype='|S6') # In[4]: from collections import Counter b = Counter(np.take(y_train, distances.argsort()[:3])) b.most_common(1)[0][0] # Out[4]: 'female' The following plots the query instance, indicated by the circle, and its three nearest neighbors, indicated by the enlarged markers: Two of the neighbors are female, and one is male. We therefore predict that the test instance is female. Now let’s implement a KNN classifier using scikit-learn. # In[5]: from sklearn.preprocessing import LabelBinarizer from sklearn.neighbors import KNeighborsClassifier lb = LabelBinarizer() y_train_binarized = lb.fit_transform(y_train) y_train_binarized # Out[5]: array([[1], [1], [1], [1], [0], [0], [0], [0], [0]]) # In[6]: K = 3 clf = KNeighborsClassifier(n_neighbors=K) clf.fit(X_train, y_train_binarized.reshape(-1)) prediction_binarized = clf.predict(np.array([155, 70]).reshape(1, -1))[0] predicted_label = lb.inverse_transform(prediction_binarized) predicted_label # Out[6]: array(['female'], dtype='|S6') Our labels are strings; we first use LabelBinarizer to convert them to integers. LabelBinarizer implements the transformer interface, which consists of the methods fit, transform, and fit_transform. fit prepares the transformer; in this case, it creates a mapping from label strings to integers. transform applies the mapping to input labels. fit_transform is a convenience method that calls fit and transform. A transformer should be fit only on the training set. Independently fitting and transforming the training and testing sets could result in inconsistent mappings from labels to integers; in this case, male might be mapped to 1 in the training set and 0 in the testing set. Fitting on the entire dataset should also be avoided, as for some transformers it will leak information about the testing set in to the model. This advantage won't be available in production, so performance measures on the test set may be optimistic. We wil discuss this pitfall more when we extract features from text. Next, we initialize a KNeighborsClassifier. Even through KNN is a lazy learner, it still implements the estimator interface. We call fit and predict just as we did with our simple linear regression object. Finally, we can use our fit LabelBinarizer to reverse the transformation and return a string label. Now let’s use our classifier to make predictions for a test set, and evaluate the performance of our classifier. height weight label 168 cm 65 kg male 170 cm 61 kg male 160 cm 52 kg female 169 cm 67 kg female # In[7]: X_test = np.array([ [168, 65], [180, 96], [160, 52], [169, 67] ]) y_test = ['male', 'male', 'female', 'female'] y_test_binarized = lb.transform(y_test) print('Binarized labels: %s' % y_test_binarized.T[0]) predictions_binarized = clf.predict(X_test) print('Binarized predictions: %s' % predictions_binarized) print('Predicted labels: %s' % lb.inverse_transform(predictions_binarized)) # Out[7]: Binarized labels: [1 1 0 0] Binarized predictions: [0 1 0 0] Predicted labels: ['female' 'male' 'female' 'female'] By comparing our test labels to our classifier's predictions, we find that it incorrectly predicted that one of the male test instances was female. There are two types of errors in binary classification tasks: false positives and false negatives. There are many performance measures for classifiers; some measures may be more appropriate than others depending on the consequences of the types of errors in your application. We will assess our classifier using several common performance measures, including accuracy, precision, and recall. Accuracy is the proportion of test instances that were classified correctly. Our model classified one of the four instances incorrectly, so the accuracy is 75%. # In[8]: from sklearn.metrics import accuracy_score print('Accuracy: %s' % accuracy_score(y_test_binarized, predictions_binarized)) # Out[8]: Accuracy: 0.75 Precision is the proportion of test instances that were predicted to be positive that are truly positive. In this example the positive class is male. The assignment of male and “female” to the positive and negative classes is arbitrary, and could be reversed. Our classifier predicted that one of the test instances is the positive class. This instance is truly the positive class, so the classifier’s precision is 100%. # In[9]: from sklearn.metrics import precision_score print('Precision: %s' % precision_score(y_test_binarized, predictions_binarized)) # Out[9]: Precision: 1.0 Recall is the proportion of truly positive test instances that were predicted to be positive. Our classifier predicted that one of the two truly positive test instances is positive. Its recall is therefore 50%. # In[10]: from sklearn.metrics import recall_score print('Recall: %s' % recall_score(y_test_binarized, predictions_binarized)) # Out[10]: Recall: 0.5 Sometimes it is useful to summarize precision and recall with a single statistic, called the F1-score or F1-measure. The F1-score is the harmonic mean of precision and recall. # In[11]: from sklearn.metrics import f1_score print('F1 score: %s' % f1_score(y_test_binarized, predictions_binarized)) # Out[11]: F1 score: 0.666666666667 Note that the arithmetic mean of the precision and recall scores is the upper bound of the F1 score. The F1 score penalizes classifiers more as the difference between their precision and recall scores increases. Finally, the Matthews correlation coefficient is an alternative to the F1 score for measuring the performance of binary classifiers. A perfect classifier’s MCC is 1. A trivial classifier that predicts randomly will score 0, and a perfectly wrong classifier will score -1. MCC is useful even when the proportions of the classes in the test set is severely imbalanced. # In[12]: from sklearn.metrics import matthews_corrcoef print('Matthews correlation coefficient: %s' % matthews_corrcoef(y_test_binarized, predictions_binarized)) # Out[12]: Matthews correlation coefficient: 0.57735026919 scikit-learn also provides a convenience function, classification_report, that reports the precision, recall and F1 score. # In[13]: from sklearn.metrics import classification_report print(classification_report(y_test_binarized, predictions_binarized, target_names=['male'], labels=[1])) # Out[13]: precision recall f1-score support male 1.00 0.50 0.67 2 avg / total 1.00 0.50 0.67 2 Summary In this article we learned about K Nearest Neighbors in which we saw that KNN is lazy learner as well as non-parametric model. We also saw about the classification of KNN. Resources for Article: Further resources on this subject: Introduction to Scikit-Learn [article] Machine Learning in IPython with scikit-learn [article] Machine Learning Models [article]
Read more
  • 0
  • 0
  • 11248

article-image-the-great-unbundling-tech-stack-developer-learning
Dave Maclean
19 Feb 2018
5 min read
Save for later

The great unbundling of the tech stack and developer learning

Dave Maclean
19 Feb 2018
5 min read
The nineties: the large software vendors dominate When I started my first tech publisher, Wrox, back in 1994, tech was dominated by a few large IT vendors: IBM, Oracle, SAP, DEC [!], CA. The insurgent company was Microsoft, and the disruptive technology was Unix. Microsoft grew out from their ownership of the desktop back into the enterprise, and had a uniquely open strategy to encourage a third party developer ecosystem. Unix was an open ecosystem from the start. Both of these created a window for start-up tech publishers like Wrox and O’Reilly to get started. No-one had ever published a profitable book on an IBM tool, say DB2.  IBM had that whole world locked down with a vertically integrated model across hardware, software, services and education. The research firm IDC has at various times modelled the total value of IT software and services spending in the US at $100bn - $200bn, and IT training to be about 2-5% of this total, usually bundled in with the total package. [An old version of IDC analysis] The “hidden” market for IT skills training, both for users and for developers, bundled in with closed vendor ecosystems was an estimated $5bn - $10bn in the US alone. SAP Education claims to train 500,000 customers and implementers each year. The impact of the internet Then the internet arrived. The internet is a machine for unbundling everything. By joining everyone and everything for zero marginal cost the internet relentlessly breaks companies, industries, products and services into independent functional elements. All media business models have been unravelling for years, well-documented by the excellent Ben Thompson in his Stratechery Blog. More on the next 10 years of unbundling is mapped on the incomparable CBinsights Trends report. In parallel, unbundling is also happening to IT vendor stacks. We’re moving relentlessly away from one vendor, and all projects toward a dynamic stack mix, project by project.  The internet is both the driver and enabler of this great unbundling. As every aspect of our lives moves online, every organisation is becoming a network of software systems. Nobody nails this concept better than Marc Andreesen in his classicSoftware is Eating the World. The only way to meet this demand is to break free of a vertical vendor stack. SAP might make great accounting software but who wants customer facing apps from SAP? From the point of view of software development, the internet has been a story of relentless atomisation. The internet itself grew out of the granddaddy of all open platforms, Unix. Open source exploded in the 2000s as the global community of developers built shared tools and code together.  We started Packt in 2003 to publish books for open source developers. We saw that in a friction-free world, tools would become more fragmented and specialised, and would need a new business model for niche technical content. Open source tools try and do one job well, leaving developers to assemble specific solutions, case-by-case.  Cloud and micro-services follow the same logic. There’s a nice timeline here from CapGemini. This podcast from a16z is a good take on the emerging world of micro services, and has a great quote from Chris Dixon, saying how many cloud and micro-services start-ups are offering one old school Unix shell command as a whole company. The impact on the way developers learn Clearly where the tech stack unbundles, so does developer learning. If your project has a dozen elements from different vendors, you’ve got two learning problems. Firstly, you have to navigate the developer support and training centres from each project and vendor. Secondly, you have to work out how the tech fits together. It’s the logic of unbundling that is creating the market for developer eLearning that big players like Pluralsight and Lynda have surged into over the last 5-8 years. Get all your learning across the stack from one place. Pluralsight are probably growing at 50% YonY and cover all vendors, all stacks in one place. Consistently our top titles at Packt show developers how to combine tools and technologies together, ensuring developer learning matches the way software is actually created. At Packt, we think that the market for developer eLearning globally is around $1-$3bn today and will double over the next 5 years as learning peels away from the closed vendor worlds. That’s the market we're playing in, along with more and more innovative competitors. But in the same way as CIOs now have to curate an open ecosystem of technology, so each developer has to curate their own ecosystem of learning. And if software is unbundling, there is a need for analysis, insight and curation to help developers and CIOs make sense of it. Packt, and its online learning platform Mapt, aim to help make sense of this fragmented landscape. There’s a big question lurking behind this: is the emergence of public clouds like AWS with a rich aggregation of micro-services a return to a closed vendor ecosystem? But that’s for another discussion...
Read more
  • 0
  • 0
  • 2361

article-image-crud-create-read-update-delete-operations-elasticsearch
Pravin Dhandre
19 Feb 2018
5 min read
Save for later

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

Pravin Dhandre
19 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book is for beginners who want to start performing distributed search analytics and visualization using core functionalities of Elasticsearch, Kibana and Logstash.[/box] In this tutorial, we will look at how to perform basic CRUD operations using Elasticsearch. Elasticsearch has a very well designed REST API, and the CRUD operations are targeted at documents. To understand how to perform CRUD operations, we will cover the following APIs. These APIs fall under the category of Document APIs that deal with documents: Index API Get API Update API Delete API Index API In Elasticsearch terminology, adding (or creating) a document into a type within an index of Elasticsearch is called an indexing operation. Essentially, it involves adding the document to the index by parsing all fields within the document and building the inverted index. This is why this operation is known as an indexing operation. There are two ways we can index a document: Indexing a document by providing an ID Indexing a document without providing an ID Indexing a document by providing an ID We have already seen this version of the indexing operation. The user can provide the ID of the document using the PUT method. The format of this request is PUT /<index>/<type>/<id>, with the JSON document as the body of the request: PUT /catalog/product/1 { "sku": "SP000001", "title": "Elasticsearch for Hadoop", "description": "Elasticsearch for Hadoop", "author": "Vishal Shukla", "ISBN": "1785288997", "price": 26.99 } Indexing a document without providing an ID If you don't want to control the ID generation for the documents, you can use the POST method. The format of this request is POST /<index>/<type>, with the JSON document as the body of the request: POST /catalog/product { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } The ID in this case will be generated by Elasticsearch. It is a hash string, as highlighted in the response: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } As per pure REST conventions, POST is used for creating a new resource and PUT is used for updating an existing resource. Here, the usage of PUT is equivalent to saying I know the ID that I want to assign, so use this ID while indexing this document. Get API The Get API is useful for retrieving a document when you already know the ID of the document. It is essentially a get by primary key operation: GET /catalog/product/AVrASKqgaBGmnAMj1SBe The format of this request is GET /<index>/<type>/<id>. The response would be as Expected: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "found": true, "_source": { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } } Update API The Update API is useful for updating the existing document by ID. The format of an update request is POST <index>/<type>/<id>/_update with a JSON request as the body: POST /catalog/product/1/_update { "doc": { "price": "28.99" } } The properties specified under the "doc" element are merged into the existing document. The previous version of this document with ID 1 had price of 26.99. This update operation just updates the price and leaves the other fields of the document unchanged. This type of update means "doc" is specified and used as a partial document to merge with an existing document; there are other types of updates supported. The response of the update request is as follows: { "_index": "catalog", "_type": "product", "_id": "1", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 } } Internally, Elasticsearch maintains the version of each document. Whenever a document is updated, the version number is incremented. The partial update that we have seen above will work only if the document existed beforehand. If the document with the given id did not exist, Elasticsearch will return an error saying that document is missing. Let us understand how do we do an upsert operation using the Update API. The term upsert loosely means update or insert, i.e. update the document if it exists otherwise insert new document. The parameter doc_as_upsert checks if the document with the given id already exists and merges the provided doc with the existing document. If the document with the given id doesn't exist, it inserts a new document with the given document contents. The following example uses doc_as_upsert to merge into the document with id 3 or insert a new document if it doesn't exist. POST /catalog/product/3/_update { "doc": { "author": "Albert Paro", "title": "Elasticsearch 5.0 Cookbook", "description": "Elasticsearch 5.0 Cookbook Third Edition", "price": "54.99" }, "doc_as_upsert": true } We can update the value of a field based on the existing value of that field or another field in the document. The following update uses an inline script to increase the price by two for a specific product: POST /catalog/product/AVrASKqgaBGmnAMj1SBe/_update { "script": { "inline": "ctx._source.price += params.increment", "lang": "painless", "params": { "increment": 2 } } } Scripting support allows for the reading of the existing value, incrementing the value by a variable, and storing it back in a single operation. The inline script used here is Elasticsearch's own painless scripting language. The syntax for incrementing an existing variable is similar to most other programming languages. Delete API The Delete API lets you delete a document by ID:  DELETE /catalog/product/AVrASKqgaBGmnAMj1SBe  The response of the delete operations is as follows: { "found": true, "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 4, "result": "deleted", "_shards": { "total": 2, "successful": 1, "failed": 0 } } This is how basic CRUD operations are performed with Elasticsearch using simple document APIs from any data source in any format securely and reliably. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0  and start building end-to-end real-time data processing solutions for your enterprise analytics applications.
Read more
  • 0
  • 0
  • 31543

article-image-how-to-classify-digits-using-keras-and-tensorflow
Sugandha Lahoti
19 Feb 2018
13 min read
Save for later

How to Classify Digits using Keras and TensorFlow

Sugandha Lahoti
19 Feb 2018
13 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book provides a practical approach to building efficient machine learning models using ensemble techniques with real-world use cases.[/box] Today we will look at how we can create, train, and test a neural network to perform digit classification using Keras and TensorFlow. This article uses MNIST dataset with images of handwritten digits.It contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. There have been a number of scientific papers on attempts to achieve the lowest error rate. One paper, by using a hierarchical system of CNNs, manages to get an error rate on the MNIST database of 0.23 percent. The original creators of the database keep a list of some of the methods tested on it. In their original paper, they used a support vector machine to get an error rate of 0.8 percent. Images in the dataset look like this: So let's not waste our time and start implementing our very first neural network in Python. Let’s start the code by importing the supporting projects. # Imports for array-handling and plotting import numpy as np import matplotlib import matplotlib.pyplot as plt Keras already has the MNIST dataset as a sample dataset, so we can import it as it is. Generally, it downloads the data over the internet and stores it into the database. So, if your system does not have the dataset, Internet will be required to download it: # Keras imports for the dataset and building our neural network from keras.datasets import mnist Now, we will import the Sequential and load_model classes from the keras.model class. We are working with sequential networks as all layers will be in forward sequence only. We are not using any split in the layers. The Sequential class will create a sequential model by combining the layers sequentially. The load_model class will help us to load the trained model for testing and evaluation purposes: #Import Sequential and Load model for creating and loading model from keras.models import Sequential, load_model In the next line, we will call three types of layers from the keras library. Dense layer means a fully connected layer; that is, each neuron of current layer will have a connection to the each neuron of the previous as well as next layer. The dropout layer is for reducing overfitting in our model. It randomly selects some neurons and does not use them for training for that iteration. So there are less chances that two different neurons of the same layer learn the same features from the input. By doing this, it prevents redundancy and correlation between neurons in the network, which eventually helps prevent overfitting in the network. The activation layer applies the activation function to the output of the neuron. We will use rectified linear units (ReLU) and the softmax function as the activation layer. We will discuss their operation when we use them in network creation: #We will use Dense, Drop out and Activation layers from keras.layers.core import Dense, Dropout, Activation from keras.utils import np_utils So we will start with loading our dataset by mnist.load. It will give us training and testing input and output instances. Then, we will visualize some instances so that we know what kind of data we are dealing with. We will use matplotlib to plot them. As the images have gray values, we can easily plot a histogram of the images, which can give us the pixel intensity distribution: #Let's Start by loading our dataset (X_train, y_train), (X_test, y_test) = mnist.load_data() #Plot the digits to verify plt.figure() for i in range(9): plt.subplot(3,3,i+1) plt.tight_layout() plt.imshow(X_train[i], cmap='gray', interpolation='none') plt.title("Digit: {}".format(y_train[i])) plt.xticks([]) plt.yticks([]) plt.show() When we execute  our code for the preceding code block, we will get the output as: #Lets analyze histogram of the image plt.figure() plt.subplot(2,1,1) plt.imshow(X_train[0], cmap='gray', interpolation='none') plt.title("Digit: {}".format(y_train[0])) plt.xticks([]) plt.yticks([]) plt.subplot(2,1,2) plt.hist(X_train[0].reshape(784)) plt.title("Pixel Value Distribution") plt.show() The histogram of an image will look like this: # Print the shape before we reshape and normalize print("X_train shape", X_train.shape) print("y_train shape", y_train.shape) print("X_test shape", X_test.shape) print("y_test shape", y_test.shape) Currently, this is shape of the dataset we have: X_train shape (60000, 28, 28) y_train shape (60000,) X_test shape (10000, 28, 28) y_test shape (10000,) As we are working with 2D images, we cannot train them as with our neural network. For training 2D images, there are different types of neural networks available; we will discuss those in the future. To remove this data compatibility issue, we will reshape the input images into 1D vectors of 784 values (as images have size 28X28). We have 60000 such images in training data and 10000 in testing: # As we have data in image form convert it to row vectors X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) X_train = X_train.astype('float32') X_test = X_test.astype('float32') Normalize the input data into the range of 0 to 1 so that it leads to a faster convergence of the network. The purpose of normalizing data is to transform our dataset into a bounded range; it also involves relativity between the pixel values. There are various kinds of normalizing techniques available such as mean normalization, min-max normalization, and so on: # Normalizing the data to between 0 and 1 to help with the training X_train /= 255 X_test /= 255 # Print the final input shape ready for training print("Train matrix shape", X_train.shape) print("Test matrix shape", X_test.shape) Let's print the shape of the data: Train matrix shape (60000, 784) Test matrix shape (10000, 784) Now, our training set contains output variables as discrete class values; say, for an image of number eight, the output class value is eight. But our output neurons will be able to give an output only in the range of zero to one. So, we need to convert discrete output values to categorical values so that eight can be represented as a vector of zero and one with the length equal to the number of classes. For example, for the number eight, the output class vector should be: 8 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] # One-hot encoding using keras' numpy-related utilities n_classes = 10 print("Shape before one-hot encoding: ", y_train.shape) Y_train = np_utils.to_categorical(y_train, n_classes) Y_test = np_utils.to_categorical(y_test, n_classes) print("Shape after one-hot encoding: ", Y_train.shape) After one-hot encoding of our output, the variable’s shape will be modified as: Shape before one-hot encoding:  (60000,) Shape after one-hot encoding:     (60000, 10) So, you can see that now we have an output variable of 10 dimensions instead of 1. Now, we are ready to define our network parameters and layer architecture. We will start creating our network by creating a Sequential class object, model. We can add different layers to this model as we have done in the following code block. We will create a network of an input layer, two hidden layers, and one output layer. As the input layer is always our data layer, it doesn't have any learning parameters. For hidden layers, we will use 512 neurons in each. At the end, for a 10-dimensional output, we will use 10 neurons in the final layer: # Here, we will create model of our ANN # Create a linear stack of layers with the sequential model model = Sequential() #Input Layer with 512 Weights model.add(Dense(512, input_shape=(784,))) #We will use relu as Activation model.add(Activation('relu')) #Put Drop out to prevent over-fitting model.add(Dropout(0.2)) #Add Hidden layer with 512 neurons with relu activation model.add(Dense(512)) model.add(Activation('relu')) model.add(Dropout(0.2)) #This is our Output layer with 10 neurons model.add(Dense(10))model.add(Activation('softmax')) After defining the preceding structure, our neural network will look something like this: The Shape field in each layer shows the shape of the data matrix in that layer, and it is quite intuitive. As we first get the multiplication of input with length of 784 values to 512 neurons, the data shape at Hidden-1 will be 784 X 512. It will be calculated similarly for the other two layers. We have used two different kinds of activation functions here. The first one is ReLU and the second one is sofmax probabilities. We will give some time to discuss these two. ReLU prevent the output of the neuron from becoming negative. The expression for relu function is: So if any neuron produces an output less than 0, it converts it to 0. We can write it in conditional form as: You just need to know that ReLU is a slightly better activation function than sigmoid. If we plot a sigmoid function, it will look like: If you look closer, the sigmoid function starts getting saturated before reaching its minimum (0) or maximum (1) values. So at the time of gradient calculation, values in the saturated region result in a very small gradient. That causes a very small change in the weight values, which is not sufficient to optimize the cost function. Now, as we go more backward during the backpropagation, that small change becomes smaller and almost reaches zero. This problem is known as the problem of vanishing gradients. So, in practical cases, we avoid sigmoid activation when our network has many stacked layers. Whereas if we see the expression of ReLU activation, it is more like a straight line: So, the gradient of the preceding function will always a non-zero value until and unless the output itself is a zero value. Thus, it prevents the problem of vanishing gradients. We have discussed the significance of the dropout layer earlier and I don’t think that it is further required. We are using 20% neuron dropout during the training time. We will not use the dropout layer during the testing time. Now, we are all set to train our very first ANN, but before starting training, we have to define the values of the network hyperparameters. We will use SGD using adaptive momentum. There are many algorithms to optimize the performance of the SGD algorithm. You just need to know that adaptive momentum is a better choice than simple gradient descent because it modifies the learning rate using previous errors created by the network. So, there are less chances of getting trapped at the local minima or missing the global minima conditions. We are using SGD with ADAM, using its default parameters. Here, we use  batch_size of 128 samples. That means we will update the weights after calculating the error on these 128 samples. It is a sufficient batch size for our total data population. We are going to train our network for 20 epochs for the time being. Here, one epoch means one complete training cycle of all mini-batches. Now, let's start training our network: #Here we will be compiling the sequential model model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') # Start training the model and saving metrics in history history = model.fit(X_train, Y_train, batch_size=128, epochs=20, verbose=2, validation_data=(X_test, Y_test)) We will save our trained model on disk so that we can use it for further fine-tuning whenever required. We will store the model in the HDF5 file format: # Saving the model on disk path2save = 'E:/PyDevWorkSpaceTest/Ensembles/Chapter_10/keras_mnist.h5' model.save(path2save) print('Saved trained model at %s ' % path2save) # Plotting the metrics fig = plt.figure() plt.subplot(2,1,1) plt.plot(history.history['acc']) plt.plot(history.history['val_acc']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='lower right') plt.subplot(2,1,2) plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper right') plt.tight_layout() plt.show() Let's analyze the loss with each iteration during the training of our neural network; we will also plot the accuracies for validation and test set. You should always monitor validation and training loss as it can help you know whether your model is underfitting or overfitting: Test Loss 0.0824991761778 Test Accuracy 0.9813 As you can see, we are getting almost similar performance for our training and validation sets in terms of loss and accuracy. You can see how accuracy is increasing as the number of epochs increases. This shows that our network is learning. Now, we have trained and stored our model. It's time to reload it and test it with the 10000 test instances: #Let's load the model for testing data path2save = 'D:/PyDevWorkspace/EnsembleMachineLearning/Chapter_10/keras_mnist.h5' mnist_model = load_model(path2save) #We will use Evaluate function loss_and_metrics = mnist_model.evaluate(X_test, Y_test, verbose=2) print("Test Loss", loss_and_metrics[0]) print("Test Accuracy", loss_and_metrics[1]) #Load the model and create predictions on the test set mnist_model = load_model(path2save) predicted_classes = mnist_model.predict_classes(X_test) #See which we predicted correctly and which not correct_indices = np.nonzero(predicted_classes == y_test)[0] incorrect_indices = np.nonzero(predicted_classes != y_test)[0] print(len(correct_indices)," classified correctly") print(len(incorrect_indices)," classified incorrectly") So, here is the performance of our model on the test set: 9813  classified correctly 187  classified incorrectly As you can see, we have misclassified 187 instances out of 10000, which I think is a very good accuracy on such a complex dataset. In the next code block, we will analyze such cases where we detect false labels: #Adapt figure size to accomodate 18 subplots plt.rcParams['figure.figsize'] = (7,14) plt.figure() # plot 9 correct predictions for i, correct in enumerate(correct_indices[:9]): plt.subplot(6,3,i+1) plt.imshow(X_test[correct].reshape(28,28), cmap='gray', interpolation='none') plt.title( "Predicted: {}, Truth: {}".format(predicted_classes[correct], y_test[correct])) plt.xticks([]) plt.yticks([]) # plot 9 incorrect predictions for i, incorrect in enumerate(incorrect_indices[:9]): plt.subplot(6,3,i+10) plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none') plt.title( "Predicted {}, Truth: {}".format(predicted_classes[incorrect], y_test[incorrect])) plt.xticks([]) plt.yticks([]) plt.show() If you look closely, our network is failing on such cases that are very difficult to identify by a human, too. So, we can say that we are getting quite a good accuracy from a very simple model. We saw how to create, train, and test a neural network to perform digit classification using Keras and TensorFlow. If you found our post  useful, do check out this book Ensemble Machine Learning to build ensemble models using TensorFlow and Python libraries such as scikit-learn and NumPy.  
Read more
  • 0
  • 0
  • 10190
article-image-implement-memory-oltp-sql-server-linux
Fatema Patrawala
17 Feb 2018
11 min read
Save for later

How to implement In-Memory OLTP on SQL Server in Linux

Fatema Patrawala
17 Feb 2018
11 min read
[box type="note" align="" class="" width=""]Below given article is an excerpt from the book SQL Server on Linux, authored by Jasmin Azemović. This book is a handy guide to setting up and implementing your SQL Server solution on the open source Linux platform.[/box] Today we will learn about the basics of In-Memory OLTP and how to implement it on SQL Server on Linux through the following topics: Elements of performance What is In-Memory OLTP Implementation Elements of performance How do you know if you have a performance issue in your database environment? Well, let's put it in these terms. You notice it (the good), users start calling technical support and complaining about how everything is slow (the bad) or you don't know about your performance issues (the ugly). Try to never get in to the last category. The good Achieving best performance is an iterative process where you need to define a set of tasks that you will execute on a regular basics and monitor their results. Here is a list that will give you an idea and guide you through this process: Establish the baseline Define the problem Fix one thing at a time Test and re-establish the baseline Repeat everything Establishing the baseline is the critical part. In most case scenarios, it is not possible without real stress testing. Example: How many users' systems can you handle on the current configuration? The next step is to measure the processing time. Do your queries or stored procedures require milliseconds, seconds, or minutes to execute? Now you need to monitor your database server using a set of tools and correct methodologies. During that process, you notice that some queries show elements of performance degradation. This is the point that defines the problem. Let's say that frequent UPDATE and DELETE operations are resulting in index fragmentation. The next step is to fix this issue with REORGANIZE or REBUILD index operations. Test your solution in the control environment and then in the production. Results can be better, same, or worse. It depends and there is no magic answer here. Maybe now something else is creating the problem: disk, memory, CPU, network, and so on. In this step, you should re-establish the old or a new baseline. Measuring performance process is something that never ends. You should keep monitoring the system and stay alert. The bad If you are in this category, then you probably have an issue with establishing the baseline and alerting the system. So, users are becoming your alerts and that is a bad thing. The rest of the steps are the same except re-establishing the baseline. But this can be your wake-up call to move yourself in the good category. The ugly This means that you don't know or you don't want to know about performance issues. The best case scenario is a headline on some news portal, but that is the ugly thing. Every decent DBA should try to be light years away from this category. What do you need to start working with performance measuring, monitoring, and fixing? Here are some tips that can help you: Know the data and the app Know your server and its capacity Use dynamic management views—DMVs: Sys.dm_os_wait_stats Sys.dm_exec_query_stats sys.dm_db_index_operational_stats Look for top queries by reads, writes, CPU, execution count Put everything in to LibreOffice Calc or another spreadsheet application and do some basic comparative math Fortunately, there is something in the field that can make your life really easy. It can boost your environment to the scale of warp speed (I am a Star Trek fan). What is In-Memory OLTP? SQL Server In-Memory feature is unique in the database world. The reason is very simple; because it is built-in to the databases' engine itself. It is not a separate database solution and there are some major benefits of this. One of these benefits is that in most cases you don't have to rewrite entire SQL Server applications to see performance benefits. On average, you will see 10x more speed while you are testing the new In-Memory capabilities. Sometimes you will even see up to 50x improvement, but it all depends on the amount of business logic that is done in the database via stored procedures. The greater the logic in the database, the greater the performance increase. The more the business logic sits in the app, the less opportunity there is for performance increase. This is one of the reasons for always separating database world from the rest of the application layer. It has built-in compatibility with other non-memory tables. This way you can optimize thememory you have for the most heavily used tables and leave others on the disk. This also means you won't have to go out and buy expensive new hardware to make large InMemory databases work; you can optimize In-Memory to fit your existing hardware. In-Memory was started in SQL Server 2014. One of the first companies that has started to use this feature during the development of the 2014 version was Bwin. This is an online gaming company. With In-Memory OLTP they improved their transaction speed by 16x, without investing in new expensive hardware. The same company has achieved 1.2 Million requests/second on SQL Server 2016 with a single machine using In-Memory OLTP: https://blogs.msdn.microsoft.com/sqlcat/2016/10/26/how-bwin-is-using-sql-server-2016-in-memory-oltp-to-achieve-unprecedented-performance-and-scale/ Not every application will benefit from In-Memory OLTP. If an application is not suffering from performance problems related to concurrency, IO pressure, or blocking, it's probably not a good candidate. If the application has long-running transactions that consume large amounts of buffer space, such as ETL processing, it's probably not a good candidate either. The best applications for consideration would be those that run high volumes of small fast transactions, with repeatable query plans such as order processing, reservation systems, stock trading, and ticket processing. The biggest benefits will be seen on systems that suffer performance penalties from tables that are having concurrency issues related to a large number of users and locking/blocking. Applications that heavily use the tempdb for temporary tables could benefit from In-Memory OLTP by creating the table as memory optimized, and performing the expensive sorts, and groups, and selective queries on the tables that are memory optimized. In-Memory OLTP quick start An important thing to remember is that the databases that will contain memory-optimized tables must have a MEMORY_OPTIMIZED_DATA filegroup. This filegroup is used for storing the checkpoint needed by SQL Server to recover the memory-optimized tables. Here is a simple DDL SQL statement to create a database that is prepared for In-Memory tables: 1> USE master 2> GO 1> CREATE DATABASE InMemorySandbox 2> ON 3> PRIMARY (NAME = InMemorySandbox_data, 4> FILENAME = 5> '/var/opt/mssql/data/InMemorySandbox_data_data.mdf', 6> size=500MB), 7> FILEGROUP InMemorySandbox_fg 8> CONTAINS MEMORY_OPTIMIZED_DATA 9> (NAME = InMemorySandbox_dir, 10> FILENAME = 11> '/var/opt/mssql/data/InMemorySandbox_dir') 12> LOG ON (name = InMemorySandbox_log, 13> Filename= 14>'/var/opt/mssql/data/InMemorySandbox_data_data.ldf', 15> size=500MB) 16 GO   The next step is to alter the existing database and configure it to access memory-optimized tables. This part is helpful when you need to test and/or migrate current business solutions: --First, we need to check compatibility level of database. -- Minimum is 130 1> USE AdventureWorks 2> GO 3> SELECT T.compatibility_level 4> FROM sys.databases as T 5> WHERE T.name = Db_Name(); 6> GO compatibility_level ------------------- 120 (1 row(s) affected) --Change the compatibility level 1> ALTER DATABASE CURRENT 2> SET COMPATIBILITY_LEVEL = 130; 3> GO --Modify the transaction isolation level 1> ALTER DATABASE CURRENT SET 2> MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON 3> GO --Finlay create memory optimized filegroup 1> ALTER DATABASE AdventureWorks 2> ADD FILEGROUP AdventureWorks_fg CONTAINS 3> MEMORY_OPTIMIZED_DATA 4> GO 1> ALTER DATABASE AdventureWorks ADD FILE 2> (NAME='AdventureWorks_mem', 3> FILENAME='/var/opt/mssql/data/AdventureWorks_mem') 4> TO FILEGROUP AdventureWorks_fg 5> GO   How to create memory-optimized table? The syntax for creating memory-optimized tables is almost the same as the syntax for creating classic disk-based tables. You will need to specify that the table is a memory-optimized table, which is done using the MEMORY_OPTIMIZED = ON clause. A memory-optimized table can be created with two DURABILITY values: SCHEMA_AND_DATA (default) SCHEMA_ONLY If you defined a memory-optimized table with DURABILITY=SCHEMA_ONLY, it means that changes to the table's data are not logged and the data is not persisted on disk. However, the schema is persisted as part of the database metadata. A side effect is that an empty table will be available after the database is recovered during a restart of SQL Server on Linux service.   The following table is a summary of key differences between those two DURABILITY Options. When you create a memory-optimized table, the database engine will generate DML routines just for accessing that table, and load them as DLLs files. SQL Server itself does not perform data manipulation, instead it calls the appropriate DLL: Now let's add some memory-optimized tables to our sample database:     1> USE InMemorySandbox 2> GO -- Create a durable memory-optimized table 1> CREATE TABLE Basket( 2> BasketID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED, 4> UserID INT NOT NULL INDEX ix_UserID 5> NONCLUSTERED HASH WITH (BUCKET_COUNT=1000000), 6> CreatedDate DATETIME2 NOT NULL,   7> TotalPrice MONEY) WITH (MEMORY_OPTIMIZED=ON) 8> GO -- Create a non-durable table. 1> CREATE TABLE UserLogs ( 2> SessionID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT=400000), 4> UserID int NOT NULL, 5> CreatedDate DATETIME2 NOT NULL, 6> BasketID INT, 7> INDEX ix_UserID 8> NONCLUSTERED HASH (UserID) WITH (BUCKET_COUNT=400000)) 9> WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY) 10> GO -- Add some sample records 1> INSERT INTO UserLogs VALUES 2> (432, SYSDATETIME(), 1), 3> (231, SYSDATETIME(), 7), 4> (256, SYSDATETIME(), 7), 5> (134, SYSDATETIME(), NULL), 6> (858, SYSDATETIME(), 2), 7> (965, SYSDATETIME(), NULL) 8> GO 1> INSERT INTO Basket VALUES 2> (231, SYSDATETIME(), 536), 3> (256, SYSDATETIME(), 6547), 4> (432, SYSDATETIME(), 23.6), 5> (134, SYSDATETIME(), NULL) 6> GO -- Checking the content of the tables 1> SELECT SessionID, UserID, BasketID 2> FROM UserLogs 3> GO 1> SELECT BasketID, UserID 2> FROM Basket 3> GO   What is natively compiled stored procedure? This is another great feature that comes comes within In-Memory package. In a nutshell, it is a classic SQL stored procedure, but it is compiled into machine code for blazing fast performance. They are stored as native DLLs, enabling faster data access and more efficient query execution than traditional T-SQL. Now you will create a natively compiled stored procedure to insert 1,000,000 rows into Basket: 1> USE InMemorySandbox 2> GO 1> CREATE PROCEDURE dbo.usp_BasketInsert @InsertCount int 2> WITH NATIVE_COMPILATION, SCHEMABINDING AS 3> BEGIN ATOMIC 4> WITH 5> (TRANSACTION ISOLATION LEVEL = SNAPSHOT, 6> LANGUAGE = N'us_english') 7> DECLARE @i int = 0 8> WHILE @i < @InsertCount 9> BEGIN 10> INSERT INTO dbo.Basket VALUES (1, SYSDATETIME() , NULL) 11> SET @i += 1 12> END 13> END 14> GO --Add 1000000 records 1> EXEC dbo.usp_BasketInsert 1000000 2> GO   The insert part should be blazing fast. Again, it depends on your environment (CPU, RAM, disk, and virtualization). My insert was done in less than three seconds, on an average machine. But significant improvement should be visible now. Execute the following SELECT statement and count the number of records:   1> SELECT COUNT(*) 2> FROM dbo.Basket 3> GO ----------- 1000004 (1 row(s) affected)   In my case, counting of one million records was less than one second. It is really hard to achieve this performance on any kind of disk. Let's try another query. We want to know how much time it will take to find the top 10 records where the insert time was longer than 10 microseconds:   1> SELECT TOP 10 BasketID, CreatedDate 2> FROM dbo.Basket 3> WHERE DATEDIFF 4> (MICROSECOND,'2017-05-30 15:17:20.9308732', CreatedDate) 5> >10 6> GO   Again, query execution time was less than a second. Even if you remove TOP and try to get all the records it will take less than a second (in my case scenario). Advantages of InMemory tables are more than obvious.   We learnt about the basic concepts of In-Memory OLTP and how to implement it on new and existing database. We also got to know that a memory-optimized table can be created with two DURABILITY values and finally, we created an In-Memory table. If you found this article useful, check out the book SQL Server on Linux, which covers advanced SQL Server topics, demonstrating the process of setting up SQL Server database solution in the Linux environment.        
Read more
  • 0
  • 0
  • 26486

article-image-use-labview-data-acquisition
Fatema Patrawala
17 Feb 2018
14 min read
Save for later

How to use LabVIEW for data acquisition

Fatema Patrawala
17 Feb 2018
14 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Data Acquisition Using LabVIEW written by Behzad Ehsani. In this book you will learn to transform physical phenomena into computer-acceptable data using an object-oriented language.[/box] Today we will discuss basics of LabVIEW, focus on its installation with an example of a LabVIEW program which is generally known as Virtual Instrument (VI). Introduction to LabVIEW LabVIEW is a graphical developing and testing environment unlike any other test and development tool available in the industry. LabVIEW sets itself apart from traditional programming environments by its completely graphical approach to programming. As an example, while representation of a while loop in a text-based language such as C consists of several predefined, extremely compact, and sometimes extremely cryptic lines of text, a while loop in LabVIEW is actually a graphical loop. The environment is extremely intuitive and powerful, which makes for a short learning curve for the beginner. LabVIEW is based on what is called the G language, but there are still other languages, especially C, under the hood. However, the ease of use and power of LabVIEW is somewhat deceiving to a novice user. Many people have attempted to start projects in LabVIEW only because, at first glance, the graphical nature of the interface and the concept of drag and drop used in LabVIEW appears to do away with the required basics of programming concepts and classical education in programming science and engineering. This is far from the reality of using LabVIEW as the predominant development environment. While it is true that, in many higher-level development and testing environments, especially when using complicated test equipment and complex mathematical calculations or even creating embedded software, LabVIEW's approach will be a much more time-efficient and bug-free environment which otherwise would require several lines of code in a traditional text based programming environment, one must be aware of LabVIEW's strengths and possible weaknesses.   LabVIEW does not completely replace the need for traditional text based languages and, depending on the entire nature of a project, LabVIEW or another traditional text based language such as C may be the most suitable programming or test environment. Installing LabVIEW Installation of LabVIEW is very simple and it is just as routine as any modern-day program installation; that is, insert the DVD 1 and follow the onscreen guided installation steps. LabVIEW comes in one DVD for the Mac and Linux versions but in four or more DVDs for the Windows edition (depending on additional software, different licensing, and additional libraries and packages purchased). In this book, we will use the LabVIEW 2013 Professional Development version for Windows. Given the target audience of this book, we assume the user is fully capable of installing the program. Installation is also well documented by National Instruments (NI) and the mandatory 1-year support purchase with each copy of LabVIEW is a valuable source of live and e-mail help. Also, the NI website (www.ni.com) has many user support groups that are also a great source of support, example codes, discussion groups, local group events and meetings of fellow LabVIEW developers, and so on. It's worth noting for those who are new to the installation of LabVIEW that the installation DVDs include much more than what an average user would need and pay for. We do strongly suggest that you install additional software (beyond what has been purchased and licensed or immediately needed!). This additional software is fully functional in demo mode for 7 days, which may be extended for about a month with online registration. This is a very good opportunity to have hands-on experience with even more of the power and functionality that LabVIEW is capable of offering. The additional information gained by installing the other software available on the DVDs may help in further development of a given project. Just imagine, if the current development of a robot only encompasses mechanical movements and sensors today, optical recognition is probably going to follow sooner than one may think. If data acquisition using expensive hardware and software may be possible in one location, the need for web sharing and remote control of the setup is just around the corner. It is very helpful to at least be aware of what packages are currently available and be able to install and test them prior to a full purchase and implementation. The following screenshot shows what may be installed if almost all the software on all the DVDs is selected: When installing a fresh version of LabVIEW, if you do decide to observe the given advice, make sure to click on the + sign next to each package you decide to install and prevent any installation of LabWindows/CVI... and Measurement Studio... for Visual Studio. LabWindows, according to NI, is an ANSI C integrated development environment. Also note that, by default, NI device drivers are not selected to be installed. Device drivers are an essential part of any data acquisition and appropriate drivers for communications and instrument(s) control must be installed before LabVIEW can interact with external equipment. Also, note that device drivers (on Windows installations) come on a separate DVD, which means that one does not have to install device drivers at the same time that the main application and other modules are installed; they can be installed at any time later on. Almost all well-established vendors are packaging their product with LabVIEW drivers and example codes. If a driver is not readily available, NI has programmers that would do just that. But this would come at a cost to the user. VI Package Manager, now installed as a part of standard installation, is also a must these days. NI distributes third-party software and drivers and public domain packages via VI Package Manager. We are going to use examples using Arduino (http://www.arduino.cc) microcontrollers in later chapters of this book. Appropriate software and drivers for these microcontrollers are installed via VI Package Manager. You can install many public domain packages that further install many useful LabVIEW toolkits to a LabVIEW installation and can be used just as those that are delivered professionally by NI. Finally, note that the more modules, packages, and software that are selected to be installed, the longer it will take to complete the installation. This may sound like making an obvious point but, surprisingly enough, installation of all software on the three DVDs (for Windows) takes up over 5 hours! On a standard laptop or PC we used. Obviously, a more powerful PC (such as one with a solid state hard drive) may not take such long time. Basic LabVIEW VI Once the LabVIEW application is launched, by default two blank windows open simultaneously–a Front Panel and a Block Diagram window–and a VI is created: VIs are the heart and soul of LabVIEW. They are what separate LabVIEW from all other text-based development environments. In LabVIEW, everything is an object which is represented graphically. A VI may only consist of a few objects or hundreds of objects embedded in many subVIs. These graphical representations of a thing, be it a simple while loop, a complex mathematical concept such as Polynomial Interpolation, or simply a Boolean constant, are all graphically represented. To use an object, right-click inside the Block Diagram or Front Panel window, a pallet list appears. Follow the arrow and pick an object from the list of objects from subsequent pallet and place it on the appropriate window. The selected object can now be dragged and placed on different locations on the appropriate window and is ready to be wired. Depending on what kind of object is selected, a graphical representation of the object appears on both windows. Of course, there are many exceptions to this rule. For example, a while loop can only be selected in Block Diagram and, by itself, a while loop does not have a graphical representation on the Front Panel window. Needless to say, LabVIEW also has keyboard combinations that expedite selecting and placing any given toolkit objects onto the appropriate window: Each object has one (or several) wire connections going into it as input(s) and coming out as its output(s). A VI becomes functional when a minimum number of wires are appropriately connected to the input and output of one or more objects. Example 1 – counter with a gauge This is a fairly simple program with simple user interaction. Once the program has been launched, it uses a while loop to wait for the user input. This is a typical behavior of almost any user-friendly program. For example, if the user launches Microsoft Office, the program launches and waits for the user to pick a menu item, click on a button, or perform any other action that the program may provide. Similarly, this program starts execution but waits in a loop for the user to choose a command. In this case only a simple Start or Stop is available. If the Start button is clicked, the program uses a for loop function to simply count from 0 to 10 in intervals of 200 milliseconds. After each count is completed, the gauge on the Front Panel, the GUI part of the program, is updated to show the current count. The counter is then set to the zero location of the gauge and the program awaits subsequent user input. If the Start button is clicked again, this action is repeated, and, obviously, if the Stop button is clicked, the program exits. Although very simple, in this example, you can find many of the concepts that are often used in a much more elaborate program. Let's walk through the code and point out some of these concepts. The following steps not only walk the reader through the example code but are also a brief tutorial on how to use LabVIEW, how to utilize each working window, and how to wire objects. Launch LabVIEW and from the File menu, choose New VI and follow the steps:    Right-click on the Block Diagram window.    From Programming Functions, choose Structures and select While Loop.    Click (and hold) and drag the cursor to create a (resizable) rectangle. On the bottom-left corner, right-click on the wire to the stop loop and choose Create a control. Note that a Stop button appears on both the Block Diagram and Front panel windows. Inside the while loop box, right-click on the Block Diagram window and from Programming Function, choose Structures and select Case Structures. Click and (and hold) and drag the cursor to create a (resizable) rectangle. On the Front Panel window, next to the Stop button created, right-click and from Modern Controls, choose Boolean and select an OK button. Double-click on the text label of the OK button and replace the OK button text with Start. Note that an OK button is also created on the Block Diagram window and the text label on that button also changed when you changed the text label on the Front Panel window. On the Front Panel window, drag-and-drop the newly created Start button next to the tiny green question mark on the left-hand side of the Case Structure box, outside of the case structure but inside the while loop. Wire the Start button to the Case Structure. Inside the Case Structure box, right-click on the Block Diagram window and from Programming Function, choose Structures and select For Loop. Click and (and hold) and drag the cursor to create a (resizable) rectangle. Inside the Case Structure box, right-click on N on the top-left side of the Case Structure and choose Create Constant. An integer blue box with a value of 0 will be connected to the For Loop. This is the number of irritations the for loop is going to have. Change 0 to 11. Inside the For Loop box, right click on the Block Diagram widow and from Programming Function, choose Timing and select Wait(ms). Right-click on the Wait function created in step 10 and connect a integer value of 200 similar to step 9. On the Front Panel window, right-click and from Modern functions, choose Gauge. Note that a Gauge function will appear on the Block Diagram window too. If the function is not inside the For Loop, drag and drop it inside the For Loop. Inside the For loop, on the Block Diagram widow, connect the iteration count i to the Gauge. On the Block Diagram, right-click on the Gauge, and under the Create submenu, choose Local variable. If it is not already inside the while loop, drag and drop it inside the while loop but outside of the case structure. Right-click on the local variable created in step 15 and connect a Zero to the input of the local variable. Click on the Clean Up icon on the main menu bar on the Block Diagram window and drag and move items on the Front Panel window so that both windows look similar to the following screenshots: Creating a project is a must When LabVIEW is launched, a default screen such as in the following screenshot appears on the screen: The most common way of using LabVIEW, at least in the beginning of a small project or test program, is to create a new VI. A common rule of programming is that each function, or in this case VI, should not be larger than a page. Keep in mind that, by nature, LabVIEW will have two windows to begin with and, being a graphical programming environment only, each VI may require more screen space than the similar text based development environment. To start off development and in order to set up all devices and connections required for tasks such as data acquisition, a developer may get the job done by simply creating one, and, more likely several VIs. Speaking from experience among engineers and other developers (in other words, in situations where R&D looms more heavily on the project than collecting raw data), quick VIs are more efficient initially, but almost all projects that start in this fashion end up growing very quickly and other people and other departments will need be involved and/or be fed the gathered data. In most cases, within a short time from the beginning of the project, technicians from the same department or related teams may be need to be trained to use the software in development. This is why it is best to develop the habit of creating a new project from the very beginning. Note the center button on the left-hand window in the preceding screenshot. Creating a new project (as opposed to creating VIs and sub-VIs) has many advantages and it is a must if the program created will have to run as an executable on computers that do not have LabVIEW installed on them. Later versions of LabVIEW have streamlined the creation of a project and have added many templates and starting points to them. Although, for the sake of simplicity, we created our first example with the creation of a simple VI, one could almost as easily create a project and choose from many starting points, templates, and other concepts (such as architecture) in LabVIEW. The most useful starting point for a complete and user-friendly application for data acquisition would be a state machine. Throughout the book, we will create simple VIs as a quick and simple way to illustrate a point but, by the end of the book, we will collect all of the VIs, icons, drivers, and sub-VIs in one complete state machine, all collected in one complete project. From the project created, we will create a standalone application that will not need the LabVIEW environment to execute, which could run on any computer that has LabVIEW runtime engine installed on it. To summarize, we went through the basics of LabVIEW and the main functionality of each of its icons by way of an actual user-interactive example. LabVIEW is capable of developing embedded systems, fuzzy logic, and almost everything in between! If you are interested to know more about LabVIEW, check out this book Data Acquisition Using LabVIEW.    
Read more
  • 0
  • 0
  • 7491

article-image-perform-predictive-forecasting-sap-analytics-cloud
Kunal Chaudhari
17 Feb 2018
7 min read
Save for later

How to perform predictive forecasting in SAP Analytics Cloud

Kunal Chaudhari
17 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Riaz Ahmed titled Learning SAP Analytics Cloud. This book involves features of the SAP Analytics Cloud which will help you collaborate, predict and solve business intelligence problems with cloud computing.[/box] In this article we will learn how to use predictive forecasting with the help of a trend time series chart to see revenue trends in a range of a year. Time series forecasting is only supported for planning models in SAP Analytics Cloud. So, you need planning rights and a planning license to run a predictive time-series forecast. However, you can add predictive forecast by creating a trend time series chart based on an analytical model to estimate future values. In this article, you will use a trend time series chart to view net revenue trends throughout the range of a year. A predictive time-series forecast runs an algorithm on historical data to predict future values for specific measures. For this type of chart, you can forecast a maximum of three different measures, and you have to specify the time for the prediction and the past time periods to use as historical data. Add a blank chart from the Insert toolbar. Set Data Source to the BestRun_Demo model. Select the Time Series chart from the Trend category. In the Measures section, click on the Add Measure link, and select Net Revenue. Finally, click on the Add Dimension link in the Time section, and select Date as the chart’s dimension: The output of your selections is depicted in the first view in the following screenshot. Every chart you create on your story page has its own unique elements that let you navigate and drill into details. The trend time series chart also allows you to zoom in to different time periods and scroll across the entire timeline. For example, the first figure in the following illustration provides a one-year view (A) of net revenue trends, that is from January to December 2015. Click on the six months link (B) to see the corresponding output, as illustrated in the second view. Drag the rectangle box (C) to the left or right to scroll across the entire timeline: Adding a forecast Click on the last data point representing December 2015, and select Add Forecast from the More Actions menu (D) to add a forecast: You see the Predictive Forecast panel on the right side, which displays the maximum number of forecast periods. Using the slider (E) in this section, you can reduce the number of forecast periods. By default, you see the maximum number (in the current scenario, it is seven) in the slider, which is determined by the amount of historical data you have. In the Forecast On section, you see the measure (F) you selected for the chart. If required, you can forecast a maximum of three different measures in this type of chart that you can add in the Builder panel. For the time being, click on OK to accept the default values for the forecast, as illustrated in the following screenshot: The forecast will be added to the chart. It is indicated by a highlighted area (G) and a dotted line (H). Click on the 1 year link (I) to see an output similar to the one illustrated in the following screenshot under the Modifying forecast section. As you can see, there are several data points that represent forecast. The top and bottom of the highlighted area indicate the upper and lower bounds of the prediction range, and the data points fall in the middle (on the dotted line) of the forecast range for each time period. Select a data point to see the Upper Confidence Bound (J) and Lower Confidence Bound (K) values. Modifying forecast You can modify a forecast using the link provided in the Forecast section at the bottom of the Builder panel. Select the chart, and scroll to the bottom of the Builder panel. Click on the Edit icon (L) to see the Predictive Forecast panel again. Review your settings, and make the required changes in this panel. For example, drag the slider toward the left to set the Forecast Periods value to 3 (M). Click on OK to save your settings. The chart should now display the forecast for three months--January, February, and March 2016 (N): Adding a time calculation If you want to display values such as year-over-year sales trends or year-to-date totals in your chart, then you can utilize the time calculation feature of SAP Analytics Cloud. The time calculation feature provides you with several calculation options. In order to use this feature, your chart must contain a time dimension with the appropriate level of granularity. For example, if you want to see quarter-over-quarter results, the time dimension must include quarterly or even monthly results. The space constraint prevents us from going through all these options. However, we will utilize the year-over-year option to compare yearly results in this article to get an idea about this feature. Execute the following instructions to first create a bar chart that shows the sold quantities of the four product categories. Then, add a time calculation to the chart to reveal the year-over-year changes in quantity sold for each category. As usual, add a blank chart to the page using the chart option on the Insert toolbar. Select the Best Run model as Data Source for the chart. Select the Bar/Column chart from the Comparison category. In the Measures section, click on the Add Measure link, and select Quantity Sold. Click on the Add Dimension link in the Dimensions section, and select Product as the chart’s dimension, as shown here: The chart appears on the page. At this stage, if you click on the More icon representing Quantity sold, you will see that the Add Time Calculation option (A) is grayed out. This is because time calculations require a time dimension to the chart, which we will add next. Click on the Add Dimension link in the Dimensions section, and select Date to add this time dimension to the chart. The chart transforms, as illustrated in the following screenshot: To display the results in the chart at the year level, you need to apply a filter as follows: Click on the filter icon in the Date dimension, and select Filter by Member. In the Set Members for Date dialog box, expand the all node, and select 2014, 2015, and 2016, individually. Once again, the chart changes to reflect the application of filter, as illustrated in the following screenshot: Now that a time dimension has been added to the chart, we can add a time calculation to it as follows: Click on the More icon in the Quantity sold measure. Select Add Time Calculation from the menu. Choose Year Over Year. New bars (A) and a corresponding legend (B) will be added to the chart, which help you compare yearly results, as shown in the following screenshot: To summarize, we provided hands-on exposure on predictive forecasting in SAP Analytics Cloud, where you learned about how to use a trend time series chart to view net revenue trends throughout the range of a year. If you enjoyed this excerpt, check out the book Learning SAP Analytics Cloud, to get an understanding of SAP Analytics Cloud platform and how to create better BI solutions.  
Read more
  • 0
  • 0
  • 17375
article-image-auto-generate-texts-shakespeare-writing-using-deep-recurrent-neural-networks
Savia Lobo
16 Feb 2018
6 min read
Save for later

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

Savia Lobo
16 Feb 2018
6 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled as Natural Language Processing with Python Cookbook. This book will give unique recipes to know various aspects of performing Natural Language Processing with NLTK—a leading Python platform for NLP.[/box] Today we will learn to use deep recurrent neural networks (RNN) to predict the next character based on the given length of a sentence. This way of training a model is able to generate automated text continuously, which can imitate the writing style of the original writer with enough training on the number of epochs and so on. Getting ready... The Project Gutenberg eBook of the complete works of William Shakespeare's dataset is used to train the network for automated text generation. Data can be downloaded from http:// www.gutenberg.org/ for the raw file used for training: >>> from  future import print_function >>> import numpy as np >>> import random >>> import sys The following code is used to create a dictionary of characters to indices and vice-versa mapping, which we will be using to convert text into indices at later stages. This is because deep learning models cannot understand English and everything needs to be mapped into indices to train these models: >>> path = 'C:UsersprataDocumentsbook_codes NLP_DL shakespeare_final.txt' >>> text = open(path).read().lower() >>> characters = sorted(list(set(text))) >>> print('corpus length:', len(text)) >>> print('total chars:', len(characters)) >>> char2indices = dict((c, i) for i, c in enumerate(characters)) >>> indices2char = dict((i, c) for i, c in enumerate(characters)) How to do it… Before training the model, various preprocessing steps are involved to make it work. The following are the major steps involved: Preprocessing: Prepare X and Y data from the given entire story text file and converting them into indices vectorized format. Deep learning model training and validation: Train and validate the deep learning model. Text generation: Generate the text with the trained model. How it works... The following lines of code describe the entire modeling process of generating text from Shakespeare's writings. Here we have chosen character length. This needs to be considered as 40 to determine the next best single character, which seems to be very fair to consider. Also, this extraction process jumps by three steps to avoid any overlapping between two consecutive extractions, to create a dataset more fairly: # cut the text in semi-redundant sequences of maxlen characters >>> maxlen = 40 >>> step = 3 >>> sentences = [] >>> next_chars = [] >>> for i in range(0, len(text) - maxlen, step): ... sentences.append(text[i: i + maxlen]) ... next_chars.append(text[i + maxlen]) ... print('nb sequences:', len(sentences)) The following screenshot depicts the total number of sentences considered, 193798, which is enough data for text generation: The next code block is used to convert the data into a vectorized format for feeding into deep learning models, as the models cannot understand anything about text, words, sentences and so on. Initially, total dimensions are created with all zeros in the NumPy array and filled with relevant places with dictionary mappings: # Converting indices into vectorized format >>> X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool) >>> y = np.zeros((len(sentences), len(characters)), dtype=np.bool) >>> for i, sentence in enumerate(sentences): ... for t, char in enumerate(sentence): ... X[i, t, char2indices[char]] = 1 ... y[i, char2indices[next_chars[i]]] = 1 >>> from keras.models import Sequential >>> from keras.layers import Dense, LSTM,Activation,Dropout >>> from keras.optimizers import RMSprop The deep learning model is created with RNN, more specifically Long Short-Term Memory networks with 128 hidden neurons, and the output is in the dimensions of the characters. The number of columns in the array is the number of characters. Finally, the softmax function is used with the RMSprop optimizer. We encourage readers to try with other various parameters to check out how results vary: #Model Building >>> model = Sequential() >>> model.add(LSTM(128, input_shape=(maxlen, len(characters)))) >>> model.add(Dense(len(characters))) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01)) >>> print (model.summary()) As mentioned earlier, deep learning models train on number indices to map input to output (given a length of 40 characters, the model will predict the next best character). The following code is used to convert the predicted indices back to the relevant character by determining the maximum index of the character: # Function to convert prediction into index >>> def pred_indices(preds, metric=1.0): ... preds = np.asarray(preds).astype('float64') ... preds = np.log(preds) / metric ... exp_preds = np.exp(preds) ... preds = exp_preds/np.sum(exp_preds) ... probs = np.random.multinomial(1, preds, 1) ... return np.argmax(probs) The model will be trained over 30 iterations with a batch size of 128. And also, the diversity has been changed to see the impact on the predictions: # Train and Evaluate the Model >>> for iteration in range(1, 30): ... print('-' * 40) ... print('Iteration', iteration) ... model.fit(X, y,batch_size=128,epochs=1).. ... start_index = random.randint(0, len(text) - maxlen - 1) ... for diversity in [0.2, 0.7,1.2]: ... print('n----- diversity:', diversity) ... generated = '' ... sentence = text[start_index: start_index + maxlen] ... generated += sentence ... print('----- Generating with seed: "' + sentence + '"') ... sys.stdout.write(generated) ... for i in range(400): ... x = np.zeros((1, maxlen, len(characters))) ... for t, char in enumerate(sentence): ... x[0, t, char2indices[char]] = 1. ... preds = model.predict(x, verbose=0)[0] ... next_index = pred_indices(preds, diversity) ... pred_char = indices2char[next_index] ... generated += pred_char ... sentence = sentence[1:] + pred_char ... sys.stdout.write(pred_char) ... sys.stdout.flush() ... print("nOne combination completed n") The results are shown in the next screenshot to compare the first iteration (Iteration 1) and final iteration (Iteration 29). It is apparent that with enough training, the text generation seems to be much better than with Iteration 1: Text generation after Iteration 29 is shown in this image: Though the text generation seems to be magical, we have generated text using Shakespeare's writings, proving that with the right training and handling, we can imitate any style of writing of a particular writer. If you found this post useful, you may check out this book Natural Language Processing with Python Cookbook to analyze sentence structure and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and other NLP techniques.  
Read more
  • 0
  • 0
  • 33693

article-image-install-elasticsearch-ubuntu-windows
Fatema Patrawala
16 Feb 2018
3 min read
Save for later

How to install Elasticsearch in Ubuntu and Windows

Fatema Patrawala
16 Feb 2018
3 min read
[box type="note" align="" class="" width=""]This article is an extract from the book, Mastering Elastic Stack  co-authored by Ravi Kumar Gupta and Yuvraj Gupta.This book will brush you up with basic knowledge on implementing the Elastic Stack and then dives deep into complex and advanced implementations. [/box] In today’s tutorial we aim to learn Elasticsearch v5.1.1 installation for Ubuntu and Windows. Installation of Elasticsearch on Ubuntu 14.04 In order to install Elasticsearch on Ubuntu, refer to the following steps: Download Elasticsearch 5.1.1 as a debian package using terminal: wget https://artifacts.elastic.co /downloads/elasticsearch/elasticsearch-5.1.1.deb 2. Install the debian package using following command: sudo dpkg -i elasticsearch-5.1.1.deb Elasticsearch will be installed in /usr/share/elasticsearch directory. The configuration files will be present at /etc/elasticsearch. The init script will be present at /etc/init.d/elasticsearch. The log files will be present within /var/log/elasticsearch directory. 3. Configure Elasticsearch to run automatically on bootup . If you are using SysV init distribution, then run the following command: sudo update-rc.d elasticsearch defaults 95 10 The preceding command will print on screen: Adding system startup for, /etc/init.d/elasticsearch Check status of Elasticsearch using following command: sudo service elasticsearch status Run Elasticsearch as a service using following command: sudo service elasticsearch start Elasticsearch may not start if you have any plugin installed which is not supported in ES-5.0.x version onwards. As plugins have been deprecated, it is required to uninstall any plugin if exists in prior version of ES. Remove a plugin after going to ES Home using following command: bin/elasticsearch-plugin remove head Usage of Elasticsearch command: sudo service elasticsearch {start|stop|restart|force- reload|status} If you are using systemd distribution, then run following command: sudo /bin/systemctl daemon-reload sudo /bin/systemctl enable elasticsearch.service To verify elasticsearch installation open open http://localhost:9200 in browser or run the following command from command line: curl -X GET http://localhost:9200 Installation of Elasticsearch on Windows In order to install Elasticsearch on Windows, refer to the following steps: Download Elasticsearch 5.1.1 version from its site using the following link: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch -5.1.1.zip Upon opening the link, click on it and it will download the ZIP package. 2. Extract the downloaded ZIP package by unzipping it using WinRAR, 7-Zip, and other such extracting softwares (if you don't have one of these then download it). This will extract the files and folders in the directory. 3. Then click on the extracted folder and navigate the folder to reach inside the bin folder. 4. Click on the elasticsearch.bat file to run Elasticsearch. If this window is closed Elasticsearch will stop running, as the node will shut down. 5. To verify Elasticsearch installation, open http://localhost:9200 in the browser: Installation of Elasticsearch as a service After installing Elasticsearch as previously mentioned, open Command Prompt after navigating to the bin folder and use the following command: elasticsearch-service.bat install Usage: elasticsearch-service.bat install | remove | start | stop | manager To summarize, we learnt installation of Elasticsearch on Ubuntu and Windows. If you are keen to know more about how to work with the Elastic Stack in a production environment, you can grab our comprehensive guide Mastering Elastic Stack.  
Read more
  • 0
  • 0
  • 56496