Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-making-simple-web-based-ssh-client-using-nodejs-and-socketio
Jakub Mandula
28 Oct 2015
7 min read
Save for later

Making a simple Web based SSH client using Node.js and Socket.io

Jakub Mandula
28 Oct 2015
7 min read
If you are reading this post, you probably know what SSH stands for. But just for the sake of formality, here we go: SSH stands for Secure Shell. It is a network protocol for secure access to the shell on a remote computer. You can do much more over SSH besides commanding your computer. Here you can find further information: http://en.wikipedia.org/wiki/Secure_Shell. In this post, we are going to create a very simple web terminal. And when I say simple, I mean it! However much you like colors, it will not support them because the parsing is just beyond the scope of this post. If you want a good client-side terminal library use term.js. It is made by the same guy who wrote pty.js, which we will be using. It is able to handle pretty much all key events and COLORS!!!! Installation I am going to assume you already have your node and npm installed. First we will install all of the npm packages we will be using: npm install express pty.js socket.io Express is a super cool web framework for Node. We are going to use it to serve our static files. I know it is a bit overkill, but I like Express. pty.js is where the magic will be happening. It forks processes into virtual pseudo terminals and provides bindings for communication. Socket.io is what we will use to transmit the data from the web browser to the server and back. It uses modern WebSockets, but provides fallbacks for backward compatibility. Anytime you want to create a real-time application, Socket.io is the way to go. Planning First things first, we need to think what we want the program to do. We want the program to create an instance of a shell on the server (remote machine) and send all of the text to the browser. Back in the browser, we want to capture any user events and send them back to the server shell. The WebSSH server This is the code that will power the terminal forwarding. Open a new file named server.js and start by importing all of the libraries: var express = require('express'); var https = require('https'); var http = require('http'); var fs = require('fs'); var pty = require('pty.js'); Set up express: // Setup the express app var app = express(); // Static file serving app.use("/",express.static("./")); Next we are going to create the server. // Creating an HTTP server var server = http.createServer(app).listen(8080) If you want to use HTTPS, which you probably will, you need to generate a key and certificate and import them as shown. var options = { key: fs.readFileSync('keys/key.pem'), cert: fs.readFileSync('keys/cert.pem') }; Then use the options object to create the actual server. Notice that this time we are using the https package. // Create an HTTPS server var server = https.createServer(options, app).listen(8080) CAUTION: Even if you use HTTPS, do not use this example program on the Internet. You are not authenticating the client in any way and thus providing a free open gate to your computer. Please make sure you only use this on your Private network protected by a firewall!!! Now bind the socket.io instance to the server: var io = require('socket.io')(server); After this, we can set up the place where the magic happens. // When a new socket connects io.on('connection', function(socket){ // Create terminal var term = pty.spawn('sh', [], { name: 'xterm-color', cols: 80, rows: 30, cwd: process.env.HOME, env: process.env }); // Listen on the terminal for output and send it to the client term.on('data', function(data){ socket.emit('output', data); }); // Listen on the client and send any input to the terminal socket.on('input', function(data){ term.write(data); }); // When socket disconnects, destroy the terminal socket.on("disconnect", function(){ term.destroy(); console.log("bye"); }); }); In this block, all we do is wait for new connections. When we get one, we spawn a new virtual terminal and start to pump the data from the terminal to the socket and vice versa. After the socket disconnects, we make sure to destroy the terminal. If you have noticed, I am using the simple sh shell. I did this mainly because I don't have a fancy prompt on it. Because we are not adding any parsing logic, my bash prompt would show up like this: ]0;piman@mothership: ~ _[01;32m✓ [33mpiman_[0m ↣ _[1;34m[~]_[37m$[0m - Eww! But you may use any shell you like. This is all that we need on the server side. Save the file and close it. Client side The client side is going to be just a very simple HTML file. Start with a very simple HTML markup: <!doctype html> <html> <head> <title>SSH Client</title> <script type="text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/socket.io/1.3.5/socket.io.min.js"></script> <script type="text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script> <style> body { margin: 0; padding: 0; } .terminal { font-family: monospace; color: white; background: black; } </style> </head> <body> <h1>SSH</h1> <div class="terminal"> </div> <script> </script> </body> </html> I am downloading the client side libraries jquery and socket.io from cdnjs. All of the client code will be written in the script tag below the terminal div. Surprisingly the code is very simple: // Connect to the socket.io server var socket = io.connect('http://localhost:8080'); // Wait for data from the server socket.on('output', function (data) { // Insert some line breaks where they belong data = data.replace("n", "<br>"); data = data.replace("r", "<br>"); // Append the data to our terminal $('.terminal').append(data); }); // Listen for user input and pass it to the server $(document).on("keypress",function(e){ var char = String.fromCharCode(e.which); socket.emit("input", char); }); Notice that we do not have to explicitly append the text the client types to the terminal mainly because the server echos it back anyways. Now we are done! Run the server and open up the URL in your browser. node server.js You should see a small prompt and be able to start typing commands. You can now explore you machine from the browser! Remember that our Web Terminal does not support Tab, Ctrl, Backspace or Esc characters. Implementing this is your homework. Conclusion I hope you found this tutorial useful. You can apply the knowledge in any real-time application where communication with the server is critical. All the code is available here. Please note, that if you'd like to use a browser terminal I strongly recommend term.js. It supports colors and styles and all the basic keys including Tabs, Backspace etc. I use it in my PiDashboard project. It is much cleaner and less tedious than the example I have here. I can't wait what amazing apps you will invent based on this. About the Author Jakub Mandula is a student interested in anything to do with technology, computers, mathematics or science.
Read more
  • 0
  • 6
  • 48193

article-image-how-to-handle-categorical-data-for-machine-learning-algorithms
Packt Editorial Staff
20 Sep 2019
9 min read
Save for later

How to handle categorical data for machine learning algorithms

Packt Editorial Staff
20 Sep 2019
9 min read
The quality of data and the amount of useful information are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that we make sure to encode categorical variables correctly, before we feed data into a machine learning algorithm. In this article, with simple yet effective examples we will explain how to deal with categorical data in computing machine learning algorithms and how we to map ordinal and nominal feature values to integer representations. The article is an excerpt from the book Python Machine Learning - Third Edition by Sebastian Raschka and Vahid Mirjalili. This book is a comprehensive guide to machine learning and deep learning with Python. It acts as both a clear step-by-step tutorial, and a reference you’ll keep coming back to as you build your machine learning systems.  It is not uncommon that real-world datasets contain one or more categorical feature columns. When we are talking about categorical data, we have to further distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue. Categorical data encoding with pandas Before we explore different techniques to handle such categorical data, let's create a new DataFrame to illustrate the problem: >>> import pandas as pd >>> df = pd.DataFrame([ ...            ['green', 'M', 10.1, 'class1'], ...            ['red', 'L', 13.5, 'class2'], ...            ['blue', 'XL', 15.3, 'class1']]) >>> df.columns = ['color', 'size', 'price', 'classlabel'] >>> df color  size price  classlabel 0   green     M 10.1     class1 1     red   L 13.5      class2 2    blue   XL 15.3      class1 As we can see in the preceding output, the newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column. Mapping ordinal features To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, XL = L + 1 = M + 2: >>> size_mapping = { ...                 'XL': 3, ...                 'L': 2, ...                 'M': 1} >>> df['size'] = df['size'].map(size_mapping) >>> df color  size price  classlabel 0   green     1 10.1     class1 1     red   2 13.5      class2 2    blue     3 15.3     class1 If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()} that can then be used via the pandas map method on the transformed feature column, similar to the size_mapping dictionary that we used previously. We can use it as follows: >>> inv_size_mapping = {v: k for k, v in size_mapping.items()} >>> df['size'].map(inv_size_mapping) 0   M 1   L 2   XL Name: size, dtype: object Encoding class labels Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0: >>> import numpy as np >>> class_mapping = {label:idx for idx,label in ...                  enumerate(np.unique(df['classlabel']))} >>> class_mapping {'class1': 0, 'class2': 1} Next, we can use the mapping dictionary to transform the class labels into integers: >>> df['classlabel'] = df['classlabel'].map(class_mapping) >>> df     color  size price  classlabel 0   green     1 10.1         0 1     red   2 13.5           1 2    blue     3 15.3           0 We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class labels back to the original string representation: >>> inv_class_mapping = {v: k for k, v in class_mapping.items()} >>> df['classlabel'] = df['classlabel'].map(inv_class_mapping) >>> df     color  size price  classlabel 0   green     1 10.1     class1 1     red   2 13.5      class2 2    blue     3 15.3     class1 Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this: >>> from sklearn.preprocessing import LabelEncoder >>> class_le = LabelEncoder() >>> y = class_le.fit_transform(df['classlabel'].values) >>> y array([0, 1, 0]) Note that the fit_transform method is just a shortcut for calling fit and transform separately, and we can use the inverse_transform method to transform the integer class labels back into their original string representation: >>> class_le.inverse_transform(y) array(['class1', 'class2', 'class1'], dtype=object) Performing a technique ‘one-hot encoding’ on nominal features In the Mapping ordinal features section, we used a simple dictionary-mapping approach to convert the ordinal size feature into integers. Since scikit-learn's estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows: >>> X = df[['color', 'size', 'price']].values >>> color_le = LabelEncoder() >>> X[:, 0] = color_le.fit_transform(X[:, 0]) >>> X array([[1, 1, 10.1],        [2, 2, 13.5],        [0, 3, 15.3]], dtype=object) After executing the preceding code, the first column of the NumPy array X now holds the new color values, which are encoded as follows: blue = 0 green = 1 red = 2 If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal. A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of an example; for example, a blue example can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in scikit-learn's preprocessing module: >>> from sklearn.preprocessing import OneHotEncoder >>> X = df[['color', 'size', 'price']].values >>> color_ohe = OneHotEncoder() >>> color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()  array([[0., 1., 0.],            [0., 0., 1.],            [1., 0., 0.]]) Note that we applied the OneHotEncoder to a single column (X[:, 0].reshape(-1, 1))) only, to avoid modifying the other two columns in the array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer that accepts a list of (name, transformer, column(s)) tuples as follows: >>> from sklearn.compose import ColumnTransformer >>> X = df[['color', 'size', 'price']].values >>> c_transf = ColumnTransformer([ ...     ('onehot', OneHotEncoder(), [0]), ...     ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X) .astype(float)     array([[0.0, 1.0, 0.0, 1, 10.1],            [0.0, 0.0, 1.0, 2, 13.5],            [1.0, 0.0, 0.0, 3, 15.3]]) In the preceding code example, we specified that we only want to modify the first column and leave the other two columns untouched via the 'passthrough' argument. An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged: >>> pd.get_dummies(df[['price', 'color', 'size']])     price  size color_blue  color_green color_red 0    10.1     1   0 1          0 1    13.5     2   0 0          1 2    15.3     3   1 0          0 When we are using one-hot encoding datasets, we have to keep in mind that it introduces multicollinearity, which can be an issue for certain methods (for instance, methods that require matrix inversion). If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue. If we use the get_dummies function, we can drop the first column by passing a True argument to the drop_first parameter, as shown in the following code example: >>> pd.get_dummies(df[['price', 'color', 'size']], ...                drop_first=True)     price  size color_green  color_red 0    10.1     1     1 0 1    13.5     2     0 1 2    15.3     3     0 0 In order to drop a redundant column via the OneHotEncoder , we need to set drop='first' and set categories='auto' as follows: >>> color_ohe = OneHotEncoder(categories='auto', drop='first') >>> c_transf = ColumnTransformer([  ...            ('onehot', color_ohe, [0]), ...            ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X).astype(float) array([[  1. , 0. ,  1. , 10.1],        [  0. ,  1. , 2. ,  13.5],        [  0. ,  0. , 3. ,  15.3]]) In this article, we have gone through some of the methods to deal with categorical data in datasets. We distinguished between nominal and ordinal features, and with examples we explained how they can be handled. To harness the power of the latest Python open source libraries in machine learning check out this book Python Machine Learning - Third Edition, written by Sebastian Raschka and Vahid Mirjalili. Other interesting read in data! The best business intelligence tools 2019: when to use them and how much they cost Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data & Society report
Read more
  • 0
  • 0
  • 48131

article-image-what-is-lstm
Richard Gall
11 Apr 2018
3 min read
Save for later

What is LSTM?

Richard Gall
11 Apr 2018
3 min read
What does LSTM stand for? LSTM stands for long short term memory. It is a model or architecture that extends the memory of recurrent neural networks. Typically, recurrent neural networks have 'short term memory' in that they use persistent previous information to be used in the current neural network. Essentially, the previous information is used in the present task. That means we do not have a list of all of the previous information available for the neural node. Find out how LSTM works alongside recurrent neural networks. Watch this short video tutorial. How does LSTM work? LSTM introduces long-term memory into recurrent neural networks. It mitigates the vanishing gradient problem, which is where the neural network stops learning because the updates to the various weights within a given neural network become smaller and smaller. It does this by using a series of 'gates'. These are contained in memory blocks which are connected through layers, like this: There are three types of gates within a unit: Input Gate: Scales input to cell (write) Output Gate: Scales output to cell (read) Forget Gate: Scales old cell value (reset) Each gate is like a switch that controls the read/write, thus incorporating the long-term memory function into the model. Applications of LSTM There are a huge range of ways that LSTM can be used, including: Handwriting recognition Time series anomaly detection Speech recognition Learning grammar Composing music The difference between LSTM and GRU There are many similarities between LSTM and GRU (Gated Recurrent Units). However, there are some important differences that are worth remembering: A GRU has two gates, whereas an LSTM has three gates. GRUs don't possess any internal memory that is different from the exposed hidden state. They don't have the output gate, which is present in LSTMs. There is no second nonlinearity applied when computing the output in GRU. GRU as a concept, is a little newer than LSTM. It is generally more efficient - it trains models at a quicker rate than LSTM. It is also easier to use. Any modifications you need to make to a model can be done fairly easily. However, LSTM should perform better than GRU where longer term memory is required. Ultimately, comparing performance is going to depend on the data set you are using. 4 ways to enable Continual learning into Neural Networks Build a generative chatbot using recurrent neural networks (LSTM RNNs) How Deep Neural Networks can improve Speech Recognition and generation
Read more
  • 0
  • 0
  • 48111

article-image-how-to-deploy-serverless-applications-in-go-using-aws-lambda-tutorial
Savia Lobo
03 Oct 2018
12 min read
Save for later

How to deploy Serverless Applications in Go using AWS Lambda [Tutorial]

Savia Lobo
03 Oct 2018
12 min read
Building a serverless application allows you to focus on your application code instead of managing and operating infrastructure. If you choose AWS for this purpose, you do not have to think about provisioning or configuring servers since AWS will handle all of this for you. This reduces your infrastructure management burden and helps you get faster time-to-market. This tutorial is an excerpt taken from the book Hands-On Serverless Applications with Go written by Mohamed Labouardy. In this book, you will learn how to design and build a production-ready application in Go using AWS serverless services with zero upfront infrastructure investment. This article will cover the following points: Build, deploy, and manage our Lambda functions going through some advanced AWS CLI commands Publish multiple versions of the API Learn how to separate multiple deployment environments (sandbox, staging, and production) with aliases Cover the usage of the API Gateway stage variables to change the method endpoint's behavior. Lambda CLI commands In this section, we will go through the various AWS Lambda commands that you might use while building your Lambda functions. We will also learn how you can use them to automate your deployment process. The list-functions command As its name implies, it lists all Lambda functions in the AWS region you provided. The following command will return all Lambda functions in the North Virginia region: aws lambda list-functions --region us-east-1 For each function, the response includes the function's configuration information (FunctionName, Resources usage, Environment variables, IAM Role, Runtime environment, and so on), as shown in the following screenshot: To list only some attributes, such as the function name, you can use the query filter option, as follows: aws lambda list-functions --query Functions[].FunctionName[] The create-function command You should be familiar with this command as it has been used multiple times to create a new Lambda function from scratch. In addition to the function's configuration, you can use the command to provide the deployment package (ZIP) in two ways: ZIP file: It provides the path to the ZIP file of the code you are uploading with the --zip-file option: aws lambda create-function --function-name UpdateMovie \ --description "Update an existing movie" \ --runtime go1.x \ --role arn:aws:iam::ACCOUNT_ID:role/UpdateMovieRole \ --handler main \ --environment Variables={TABLE_NAME=movies} \ --zip-file fileb://./deployment.zip \ --region us-east-1a S3 Bucket object: It  provides the S3 bucket and object name with the --code option: aws lambda create-function --function-name UpdateMovie \ --description "Update an existing movie" \ --runtime go1.x \ --role arn:aws:iam::ACCOUNT_ID:role/UpdateMovieRole \ --handler main \ --environment Variables={TABLE_NAME=movies} \ --code S3Bucket=movies-api-deployment-package,S3Key=deployment.zip \ --region us-east-1 The as-mentioned commands will return a summary of the function's settings in a JSON format, as follows: It's worth mentioning that while creating your Lambda function, you might override the compute usage and network settings based on your function's behavior with the following options: --timeout: The default execution timeout is three seconds. When the three seconds are reached, AWS Lambda terminates your function. The maximum timeout you can set is five minutes. --memory-size: The amount of memory given to your function when executed. The default value is 128 MB and the maximum is 3,008 MB (increments of 64 MB). --vpc-config: This deploys the Lambda function in a private VPC. While it might be useful if the function requires communication with internal resources, it should ideally be avoided as it impacts the Lambda performance and scaling. AWS doesn't allow you to set the CPU usage of your function as it's calculated automatically based on the memory allocated for your function. CPU usage is proportional to the memory. The update-function-code command In addition to AWS Management Console, you can update your Lambda function's code with AWS CLI. The command requires the target Lambda function name and the new deployment package. Similarly to the previous command, you can provide the package as follows: The path to the new .zip file: aws lambda update-function-code --function-name UpdateMovie \ --zip-file fileb://./deployment-1.0.0.zip \ --region us-east-1 The S3 bucket where the .zip file is stored: aws lambda update-function-code --function-name UpdateMovie \ --s3-bucket movies-api-deployment-packages \ --s3-key deployment-1.0.0.zip \ --region us-east-1   This operation prints a new unique ID (called RevisionId) for each change in the Lambda function's code: The get-function-configuration command In order to retrieve the configuration information of a Lambda function, issue the following command: aws lambda get-function-configuration --function-name UpdateMovie --region us-east-1 The preceding command will provide the same information in the output that was displayed when the create-function command was used. To retrieve configuration information for a specific Lambda version or alias (following section), you can use the --qualifier option. The invoke command So far, we invoked our Lambda functions directly from AWS Lambda Console and through HTTP events with API Gateway. In addition to that, Lambda can be invoked from the AWS CLI with the invoke command: aws lambda invoke --function-name UpdateMovie result.json The preceding command will invoke the UpdateMovie function and save the function's output in result.json file: The status code is 400, which is normal, as UpdateFunction is expecting a JSON input. Let's see how to provide a JSON to our function with the invoke command. Head back to the DynamoDB movies table, and pick up a movie that you want to update. In this example, we will update the movie with the ID as 13, shown as follows: Create a JSON file with a body attribute that contains the new movie item attribute, as the Lambda function is expecting the input to be in the API Gateway Proxy request format: { "body": "{\"id\":\"13\", \"name\":\"Deadpool 2\"}" } Finally, run the invoke function command again with the JSON file as the input parameter: aws lambda invoke --function UpdateMovie --payload file://input.json result.json If you print the result.json content, the updated movie should be returned, shown as follows: You can verify that the movie's name is updated in the DynamoDB table by invoking the FindAllMovies function: aws lambda invoke --function-name FindAllMovies result.json The body attribute should contain the new updated movie, shown as follows: Head back to DynamoDB Console; the movie with the ID of 13 should have a new name, as shown in the following  screenshot: The delete-function command To delete a Lambda function, you can use the following command: aws lambda delete-function --function-name UpdateMovie By default, the command will delete all function versions and aliases. To delete a specific version or alias, you might want to use the --qualifier option. By now, you should be familiar with all the AWS CLI commands you might use and need while building your serverless applications in AWS Lambda. In the upcoming section, we will see how to create different versions of your Lambda functions and maintain multiple environments with aliases. Versions and aliases When you're building your serverless application, you must separate your deployment environments to test new changes without impacting your production. Therefore, having multiple versions of your Lambda functions makes sense. Versioning A version represents a state of your function's code and configuration in time. By default, each Lambda function has the $LATEST version pointing to the latest changes of your function, as shown in the following screenshot: In order to create a new version from the $LATEST version, click on Actions and Publish new version. Let's call it 1.0.0, as shown in  the next screenshot:   The new version will be created with an ID=1 (incremental). Note the ARN Lambda function at the top of the window in the following screenshot; it has the version ID: Once the version is created, you cannot update the function code, shown as follows: Moreover, advanced settings, such as IAM roles, network configuration, and compute usage, cannot be changed, shown as follows: Versions are called immutable, which means they cannot be changed once they're published; only the $LATEST version is editable. Now, we know how to publish a new version from the console. Let's publish a new version with the AWS CLI. But first, we need to update the FindAllMovies function as we cannot publish a new version if no changes were made to $LATEST since publishing version 1.0.0. The new version will have a pagination system. The function will return only the number of items requested by the user. The following code will read the Count header parameter, convert it to a number, and use the Scan operation with the Limit parameter to fetch the movies from DynamoDB: func findAll(request events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) { size, err := strconv.Atoi(request.Headers["Count"]) if err != nil { return events.APIGatewayProxyResponse{ StatusCode: http.StatusBadRequest, Body: "Count Header should be a number", }, nil } ... svc := dynamodb.New(cfg) req := svc.ScanRequest(&dynamodb.ScanInput{ TableName: aws.String(os.Getenv("TABLE_NAME")), Limit: aws.Int64(int64(size)), }) ... } Next, we update the FindAllMovies Lambda function's code with the update-function-code command: aws lambda update-function-code --function-name FindAllMovies \ --zip-file fileb://./deployment.zip Then, publish a new version, 1.1.0, based on the current configuration and code with the following command: aws lambda publish-version --function-name FindAllMovies --description 1.1.0 Go back to AWS Lambda Console and navigate to your FindAllMovies; a new version should be created with a new ID=2, as shown in the following screenshot: Now that our versions are created, let's test them out by using the AWS CLI invoke command. FindAllMovies v1.0.0 Invoke the FindAllMovies v1.0.0 version with its ID in the qualifier parameter with the following command: aws lambda invoke --function-name FindAllMovies --qualifier 1 result.json result.json should have all the movies in the DynamoDB movies table, shown as follows: The output showing all the movies in the DynamoDB movies tableTo know more about the output in the FindAllMovies v1.1.0 and more about Semantic versioning, head over to the book. Aliases The alias is a pointer to a specific version, it allows you to promote a function from one environment to another (such as staging to production). Aliases are mutable, unlike versions, which are immutable. To illustrate the concept of aliases, we will create two aliases, as illustrated in the following diagram: a Production alias pointing to FindAllMovies Lambda function 1.0.0 version, and a Staging alias that points to function 1.1.0 version. Then, we will configure API Gateway to use these aliases instead of the $LATEST version: Head back to the FindAllMovies configuration page. If you click on the Qualifiers drop-down list, you should see a default alias called Unqualified pointing to your $LATEST version, as shown in the following screenshot: To create a new alias, click on Actions and then Create a new alias called Staging. Select the 5 version as the target, shown as follows: Once created, the new version should be added to the list of Aliases, shown as follows: Next, create a new alias for the Production environment that points to version 1.0.0 using the AWS command line: aws lambda create-alias --function-name FindAllMovies \ --name Production --description "Production environment" \ --function-version 1 Similarly, the new alias should be successfully created: Now that our aliases have been created, let's configure the API Gateway to use those aliases with Stage variables. Stage variables Stage variables are environment variables that can be used to change the behavior at runtime of the API Gateway methods for each deployment stage. The following section will illustrate how to use stage variables with API Gateway.   On the API Gateway Console, navigate to the Movies API, click on the GET method, and update the target Lambda Function to use a stage variable instead of a hardcoded Lambda function name, as shown in the following screenshot: When you save it, a new prompt will ask you to grant the permissions to API Gateway to call your Lambda function aliases, as shown in the following screenshot: Execute the following commands to allow API Gateway to invoke the Production and Staging aliases: Production alias: aws lambda add-permission --function-name "arn:aws:lambda:us-east-1:ACCOUNT_ID:function:FindAllMovies:Production" \ --source-arn "arn:aws:execute-api:us-east-1:ACCOUNT_ID:API_ID/*/GET/movies" \ --principal apigateway.amazonaws.com \ --statement-id STATEMENT_ID \ --action lambda:InvokeFunction Staging alias: aws lambda add-permission --function-name "arn:aws:lambda:us-east-1:ACCOUNT_ID:function:FindAllMovies:Staging" \ --source-arn "arn:aws:execute-api:us-east-1:ACCOUNT_ID:API_ID/*/GET/movies" \ --principal apigateway.amazonaws.com \ --statement-id STATEMENT_ID \ --action lambda:InvokeFunction Then, create a new stage called production, as shown in next screenshot: Next, click on the Stages Variables tab, and create a new stage variable called lambda and set FindAllMovies:Production as a value, shown as follows: Do the same for the staging environment with the lambda variable pointing to the Lambda function's Staging alias, shown as follows: To test the endpoint, use the cURL command or any REST client you're familiar with. I opt for Postman. A GET method on the API Gateway's production stage invoked URL should return all the movies in the database, shown as follows: Do the same for the staging environment, with a new Header key called Count=4; you should have only four movies items in return, shown as follows: That's how you can maintain multiple environments of your Lambda functions. You can now easily promote the 1.1.0 version into production by changing the Production pointer to point to 1.1.0 instead of 1.0.0, and roll back in case of failure to the previous working version without changing the API Gateway settings. To summarize, we learned about how to deploy serverless applications using the AWS Lambda functions. If you've enjoyed reading this and want to know more about how to set up a CI/CD pipeline from scratch to automate the process of deploying Lambda functions to production with Go programming language, check out our book, Hands-On Serverless Applications with Go. Keep your serverless AWS applications secure [Tutorial] Azure Functions 2.0 launches with better workload support for serverless How Serverless computing is making AI development easier
Read more
  • 0
  • 0
  • 48105

article-image-step-detector-and-step-counters-sensors
Packt
14 Apr 2016
13 min read
Save for later

Step Detector and Step Counters Sensors

Packt
14 Apr 2016
13 min read
In this article by Varun Nagpal, author of the book, Android Sensor Programming By Example, we will focus on learning about the use of step detector and step counter sensors. These sensors are very similar to each other and are used to count the steps. Both the sensors are based on a common hardware sensor, which internally uses accelerometer, but Android still treats them as logically separate sensors. Both of these sensors are highly battery optimized and consume very low power. Now, lets look at each individual sensor in detail. (For more resources related to this topic, see here.) In this article by Varun Nagpal, author of the book, Android Sensor Programming By Example, we will focus on learning about the use of step detector and step counter sensors. These sensors are very similar to each other and are used to count the steps. Both the sensors are based on a common hardware sensor, which internally uses accelerometer, but Android still treats them as logically separate sensors. Both of these sensors are highly battery optimized and consume very low power. Now, lets look at each individual sensor in detail. The step counter sensor The step counter sensor is used to get the total number of steps taken by the user since the last reboot (power on) of the phone. When the phone is restarted, the value of the step counter sensor is reset to zero. In the onSensorChanged() method, the number of steps is give by event.value[0]; although it's a float value, the fractional part is always zero. The event timestamp represents the time at which the last step was taken. This sensor is especially useful for those applications that don't want to run in the background and maintain the history of steps themselves. This sensor works in batches and in continuous mode. If we specify 0 or no latency in the SensorManager.registerListener() method, then it works in a continuous mode; otherwise, if we specify any latency, then it groups the events in batches and reports them at the specified latency. For prolonged usage of this sensor, it's recommended to use the batch mode, as it saves power. Step counter uses the on-change reporting mode, which means it reports the event as soon as there is change in the value. The step detector sensor The step detector sensor triggers an event each time a step is taken by the user. The value reported in the onSensorChanged() method is always one, the fractional part being always zero, and the event timestamp is the time when the user's foot hit the ground. The step detector sensor has very low latency in reporting the steps, which is generally within 1 to 2 seconds. The Step detector sensor has lower accuracy and produces more false positive, as compared to the step counter sensor. The step counter sensor is more accurate, but has more latency in reporting the steps, as it uses this extra time after each step to remove any false positive values. The step detector sensor is recommended for those applications that want to track the steps in real time and want to maintain their own history of each and every step with their timestamp. Time for action – using the step counter sensor in activity Now, you will learn how to use the step counter sensor with a simple example. The good thing about the step counter is that, unlike other sensors, your app doesn't need to tell the sensor when to start counting the steps and when to stop counting them. It automatically starts counting as soon as the phone is powered on. For using it, we just have to register the listener with the sensor manager and then unregister it after using it. In the following example, we will show the total number of steps taken by the user since the last reboot (power on) of the phone in the Android activity. We created a PedometerActivity and implemented it with the SensorEventListener interface, so that it can receive the sensor events. We initiated the SensorManager and Sensor object of the step counter and also checked the sensor availability in the OnCreate() method of the activity. We registered the listener in the onResume() method and unregistered it in the onPause() method as a standard practice. We used a TextView to display the total number of steps taken and update its latest value in the onSensorChanged() method. public class PedometerActivity extends Activity implements SensorEventListener{ private SensorManager mSensorManager; private Sensor mSensor; private boolean isSensorPresent = false; private TextView mStepsSinceReboot; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_pedometer); mStepsSinceReboot = (TextView)findViewById(R.id.stepssincereboot); mSensorManager = (SensorManager) this.getSystemService(Context.SENSOR_SERVICE); if(mSensorManager.getDefaultSensor(Sensor.TYPE_STEP_COUNTER) != null) { mSensor = mSensorManager.getDefaultSensor(Sensor.TYPE_STEP_COUNTER); isSensorPresent = true; } else { isSensorPresent = false; } } @Override protected void onResume() { super.onResume(); if(isSensorPresent) { mSensorManager.registerListener(this, mSensor, SensorManager.SENSOR_DELAY_NORMAL); } } @Override protected void onPause() { super.onPause(); if(isSensorPresent) { mSensorManager.unregisterListener(this); } } @Override public void onSensorChanged(SensorEvent event) { mStepsSinceReboot.setText(String.valueOf(event.values[0])); } Time for action – maintaining step history with step detector sensor The Step counter sensor works well when we have to deal with the total number of steps taken by the user since the last reboot (power on) of the phone. It doesn't solve the purpose when we have to maintain history of each and every step taken by the user. The Step counter sensor may combine some steps and process them together, and it will only update with an aggregated count instead of reporting individual step detail. For such cases, the step detector sensor is the right choice. In our next example, we will use the step detector sensor to store the details of each step taken by the user, and we will show the total number of steps for each day, since the application was installed. Our next example will consist of three major components of Android, namely service, SQLite database, and activity. Android service will be used to listen to all the individual step details using the step counter sensor when the app is in the background. All the individual step details will be stored in the SQLite database and finally the activity will be used to display the list of total number of steps along with dates. Let's look at the each component in detail. The first component of our example is PedometerListActivity. We created a ListView in the activity to display the step count along with dates. Inside the onCreate() method of PedometerListActivity, we initiated the ListView and ListAdaptor required to populate the list. Another important task that we do in the onCreate() method is starting the service (StepsService.class), which will listen to all the individual steps' events. We also make a call to the getDataForList() method, which is responsible for fetching the data for ListView. public class PedometerListActivity extends Activity{ private ListView mSensorListView; private ListAdapter mListAdapter; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); mSensorListView = (ListView)findViewById(R.id.steps_list); getDataForList(); mListAdapter = new ListAdapter(); mSensorListView.setAdapter(mListAdapter); Intent mStepsIntent = new Intent(getApplicationContext(), StepsService.class); startService(mStepsIntent); } In our example, the DateStepsModel class is used as a POJO (Plain Old Java Object) class, which is a handy way of grouping logical data together, to store the total number of steps and date. We also use the StepsDBHelper class to read and write the steps data in the database (discussed further in the next section). Inside the getDataForList() method, we initiated the object of the StepsDBHelper class and call the readStepsEntries() method of the StepsDBHelper class, which returns ArrayList of the DateStepsModel objects containing the total number of steps along with dates after reading from database. The ListAdapter class is used for populating the values for ListView, which internally uses ArrayList of DateStepsModel as the data source. The individual list item is the string, which is the concatenation of date and the total number of steps. class DateStepsModel { public String mDate; public int mStepCount; } private StepsDBHelper mStepsDBHelper; private ArrayList<DateStepsModel> mStepCountList; public void getDataForList() { mStepsDBHelper = new StepsDBHelper(this); mStepCountList = mStepsDBHelper.readStepsEntries(); } private class ListAdapter extends BaseAdapter{ private TextView mDateStepCountText; @Override public int getCount() { return mStepCountList.size(); } @Override public Object getItem(int position) { return mStepCountList.get(position); } @Override public long getItemId(int position) { return position; } @Override public View getView(int position, View convertView, ViewGroup parent) { if(convertView==null){ convertView = getLayoutInflater().inflate(R.layout.list_rows, parent, false); } mDateStepCountText = (TextView)convertView.findViewById(R.id.sensor_name); mDateStepCountText.setText(mStepCountList.get(position).mDate + " - Total Steps: " + String.valueOf(mStepCountList.get(position).mStepCount)); return convertView; } } The second component of our example is StepsService, which runs in the background and listens to the step detector sensor until the app is uninstalled. We implemented this service with the SensorEventListener interface so that it can receive the sensor events. We also initiated theobjects of StepsDBHelper, SensorManager, and the step detector sensor inside the OnCreate() method of the service. We only register the listener when the step detector sensor is available on the device. A point to note here is that we never unregistered the listener because we expect our app to log the step information indefinitely until the app is uninstalled. Both step detector and step counter sensors are very low on battery consumptions and are highly optimized at the hardware level, so if the app really requires, it can use them for longer durations without affecting the battery consumption much. We get a step detector sensor callback in the onSensorChanged() method whenever the operating system detects a step, and from CC: specify, we call the createStepsEntry() method of the StepsDBHelperclass to store the step information in the database. public class StepsService extends Service implements SensorEventListener{ private SensorManager mSensorManager; private Sensor mStepDetectorSensor; private StepsDBHelper mStepsDBHelper; @Override public void onCreate() { super.onCreate(); mSensorManager = (SensorManager) this.getSystemService(Context.SENSOR_SERVICE); if(mSensorManager.getDefaultSensor(Sensor.TYPE_STEP_DETECTOR) != null) { mStepDetectorSensor = mSensorManager.getDefaultSensor(Sensor.TYPE_STEP_DETECTOR); mSensorManager.registerListener(this, mStepDetectorSensor, SensorManager.SENSOR_DELAY_NORMAL); mStepsDBHelper = new StepsDBHelper(this); } } @Override public int onStartCommand(Intent intent, int flags, int startId) { return Service.START_STICKY; } @Override public void onSensorChanged(SensorEvent event) { mStepsDBHelper.createStepsEntry(); } The last component of our example is the SQLite database. We created a StepsDBHelper class and extended it from the SQLiteOpenHelper abstract utility class provided by the Android framework to easily manage database operations. In the class, we created a database called StepsDatabase, which is automatically created on the first object creation of the StepsDBHelper class by the OnCreate() method. This database has one table StepsSummary, which consists of only three columns (id, stepscount, and creationdate). The first column, id, is the unique integer identifier for each row of the table and is incremented automatically on creation of every new row. The second column, stepscount, is used to store the total number of steps taken for each date. The third column, creationdate, is used to store the date in the mm/dd/yyyy string format. Inside the createStepsEntry() method, we first check whether there is an existing step count with the current date, and we if find one, then we read the existing step count of the current date and update the step count by incrementing it by 1. If there is no step count with the current date found, then we assume that it is the first step of the current date and we create a new entry in the table with the current date and step count value as 1. The createStepsEntry() method is called from onSensorChanged() of the StepsService class whenever a new step is detected by the step detector sensor. public class StepsDBHelper extends SQLiteOpenHelper { private static final int DATABASE_VERSION = 1; private static final String DATABASE_NAME = "StepsDatabase"; private static final String TABLE_STEPS_SUMMARY = "StepsSummary"; private static final String ID = "id"; private static final String STEPS_COUNT = "stepscount"; private static final String CREATION_DATE = "creationdate";//Date format is mm/dd/yyyy private static final String CREATE_TABLE_STEPS_SUMMARY = "CREATE TABLE " + TABLE_STEPS_SUMMARY + "(" + ID + " INTEGER PRIMARY KEY AUTOINCREMENT," + CREATION_DATE + " TEXT,"+ STEPS_COUNT + " INTEGER"+")"; StepsDBHelper(Context context) { super(context, DATABASE_NAME, null, DATABASE_VERSION); } @Override public void onCreate(SQLiteDatabase db) { db.execSQL(CREATE_TABLE_STEPS_SUMMARY); } public boolean createStepsEntry() { boolean isDateAlreadyPresent = false; boolean createSuccessful = false; int currentDateStepCounts = 0; Calendar mCalendar = Calendar.getInstance(); String todayDate = String.valueOf(mCalendar.get(Calendar.MONTH))+"/" + String.valueOf(mCalendar.get(Calendar.DAY_OF_MONTH))+"/"+String.valueOf(mCalendar.get(Calendar.YEAR)); String selectQuery = "SELECT " + STEPS_COUNT + " FROM " + TABLE_STEPS_SUMMARY + " WHERE " + CREATION_DATE +" = '"+ todayDate+"'"; try { SQLiteDatabase db = this.getReadableDatabase(); Cursor c = db.rawQuery(selectQuery, null); if (c.moveToFirst()) { do { isDateAlreadyPresent = true; currentDateStepCounts = c.getInt((c.getColumnIndex(STEPS_COUNT))); } while (c.moveToNext()); } db.close(); } catch (Exception e) { e.printStackTrace(); } try { SQLiteDatabase db = this.getWritableDatabase(); ContentValues values = new ContentValues(); values.put(CREATION_DATE, todayDate); if(isDateAlreadyPresent) { values.put(STEPS_COUNT, ++currentDateStepCounts); int row = db.update(TABLE_STEPS_SUMMARY, values, CREATION_DATE +" = '"+ todayDate+"'", null); if(row == 1) { createSuccessful = true; } db.close(); } else { values.put(STEPS_COUNT, 1); long row = db.insert(TABLE_STEPS_SUMMARY, null, values); if(row!=-1) { createSuccessful = true; } db.close(); } } catch (Exception e) { e.printStackTrace(); } return createSuccessful; } The readStepsEntries() method is called from PedometerListActivity to display the total number of steps along with the date in the ListView. The readStepsEntries() method reads all the step counts along with their dates from the table and fills the ArrayList of DateStepsModelwhich is used as a data source for populating the ListView in PedometerListActivity. public ArrayList<DateStepsModel> readStepsEntries() { ArrayList<DateStepsModel> mStepCountList = new ArrayList<DateStepsModel>(); String selectQuery = "SELECT * FROM " + TABLE_STEPS_SUMMARY; try { SQLiteDatabase db = this.getReadableDatabase(); Cursor c = db.rawQuery(selectQuery, null); if (c.moveToFirst()) { do { DateStepsModel mDateStepsModel = new DateStepsModel(); mDateStepsModel.mDate = c.getString((c.getColumnIndex(CREATION_DATE))); mDateStepsModel.mStepCount = c.getInt((c.getColumnIndex(STEPS_COUNT))); mStepCountList.add(mDateStepsModel); } while (c.moveToNext()); } db.close(); } catch (Exception e) { e.printStackTrace(); } return mStepCountList; } What just happened? We created a small pedometer utility app that maintains the step history along with dates using the steps detector sensor. We used PedometerListActivityto display the list of the total number of steps along with their dates. StepsServiceis used to listen to all the steps detected by the step detector sensor in the background. And finally, the StepsDBHelperclass is used to create and update the total step count for each date and to read the total step counts along with dates from the database. Resources for Article: Further resources on this subject: Introducing the Android UI [article] Building your first Android Wear Application [article] Mobile Phone Forensics – A First Step into Android Forensics [article]
Read more
  • 0
  • 3
  • 48076

article-image-brett-lantz-on-implementing-a-decision-tree-using-c5-0-algorithm-in-r
Packt Editorial Staff
29 Mar 2019
9 min read
Save for later

Brett Lantz on implementing a decision tree using C5.0 algorithm in R

Packt Editorial Staff
29 Mar 2019
9 min read
Decision tree learners are powerful classifiers that utilize a tree structure to model the relationships among the features and the potential outcomes. This structure earned its name due to the fact that it mirrors the way a literal tree begins at a wide trunk and splits into narrower and narrower branches as it is followed upward. In much the same way, a decision tree classifier uses a structure of branching decisions that channel examples into a final predicted class value. In this article, we demonstrate the implementation of decision tree using C5.0 algorithm in R. This article is taken from the book, Machine Learning with R, Fourth Edition written by Brett Lantz. This 10th Anniversary Edition of the classic R data science book is updated to R 4.0.0 with newer and better libraries. This book features several new chapters that reflect the progress of machine learning in the last few years and help you build your data science skills and tackle more challenging problems There are numerous implementations of decision trees, but the most well-known is the C5.0 algorithm. This algorithm was developed by computer scientist J. Ross Quinlan as an improved version of his prior algorithm, C4.5 (C4.5 itself is an improvement over his Iterative Dichotomiser 3 (ID3) algorithm). Although Quinlan markets C5.0 to commercial clients (see http://www.rulequest.com/ for details), the source code for a single-threaded version of the algorithm was made public, and has therefore been incorporated into programs such as R. The C5.0 decision tree algorithm The C5.0 algorithm has become the industry standard for producing decision trees because it does well for most types of problems directly out of the box. Compared to other advanced machine learning models, the decision trees built by C5.0 generally perform nearly as well but are much easier to understand and deploy. Additionally, as shown in the following table, the algorithm's weaknesses are relatively minor and can be largely avoided. Strengths An all-purpose classifier that does well on many types of problems. Highly automatic learning process, which can handle numeric or nominal features, as well as missing data. Excludes unimportant features. Can be used on both small and large datasets. Results in a model that can be interpreted without a mathematical background (for relatively small trees). More efficient than other complex models. Weaknesses Decision tree models are often biased toward splits on features having a large number of levels. It is easy to overfit or underfit the model. Can have trouble modeling some relationships due to reliance on axis-parallel splits. Small changes in training data can result in large changes to decision logic. Large trees can be difficult to interpret and the decisions they make may seem counterintuitive. To keep things simple, our earlier decision tree example ignored the mathematics involved with how a machine would employ a divide and conquer strategy. Let's explore this in more detail to examine how this heuristic works in practice. Choosing the best split The first challenge that a decision tree will face is to identify which feature to split upon. In the previous example, we looked for a way to split the data such that the resulting partitions contained examples primarily of a single class. The degree to which a subset of examples contains only a single class is known as purity, and any subset composed of only a single class is called pure. There are various measurements of purity that can be used to identify the best decision tree splitting candidate. C5.0 uses entropy, a concept borrowed from information theory that quantifies the randomness, or disorder, within a set of class values. Sets with high entropy are very diverse and provide little information about other items that may also belong in the set, as there is no apparent commonality. The decision tree hopes to find splits that reduce entropy, ultimately increasing homogeneity within the groups. Typically, entropy is measured in bits. If there are only two possible classes, entropy values can range from 0 to 1. For n classes, entropy ranges from 0 to log2(n). In each case, the minimum value indicates that the sample is completely homogenous, while the maximum value indicates that the data are as diverse as possible, and no group has even a small plurality. In mathematical notion, entropy is specified as: In this formula, for a given segment of data (S), the term c refers to the number of class levels, and pi  refers to the proportion of values falling into class level i. For example, suppose we have a partition of data with two classes: red (60 percent) and white (40 percent). We can calculate the entropy as: > -0.60 * log2(0.60) - 0.40 * log2(0.40) [1] 0.9709506 We can visualize the entropy for all possible two-class arrangements. If we know the proportion of examples in one class is x, then the proportion in the other class is (1 – x). Using the curve() function, we can then plot the entropy for all possible values of x: > curve(-x * log2(x) - (1 - x) * log2(1 - x),     col = "red", xlab = "x", ylab = "Entropy", lwd = 4) This results in the following figure: The total entropy as the proportion of one class varies in a two-class outcome As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in the maximum entropy. As one class increasingly dominates the other, the entropy reduces to zero. To use entropy to determine the optimal feature to split upon, the algorithm calculates the change in homogeneity that would result from a split on each possible feature, a measure known as information gain. The information gain for a feature F is calculated as the difference between the entropy in the segment before the split (S1) and the partitions resulting from the split (S2): One complication is that after a split, the data is divided into more than one partition. Therefore, the function to calculate Entropy(S2) needs to consider the total entropy across all of the partitions. It does this by weighting each partition's entropy according to the proportion of all records falling into that partition. This can be stated in a formula as: In simple terms, the total entropy resulting from a split is the sum of entropy of each of the n partitions weighted by the proportion of examples falling in the partition (wi). The higher the information gain, the better a feature is at creating homogeneous groups after a split on that feature. If the information gain is zero, there is no reduction in entropy for splitting on this feature. On the other hand, the maximum information gain is equal to the entropy prior to the split. This would imply the entropy after the split is zero, which means that the split results in completely homogeneous groups. The previous formulas assume nominal features, but decision trees use information gain for splitting on numeric features as well. To do so, a common practice is to test various splits that divide the values into groups greater than or less than a threshold. This reduces the numeric feature into a two-level categorical feature that allows information gain to be calculated as usual. The numeric cut point yielding the largest information gain is chosen for the split. Note: Though it is used by C5.0, information gain is not the only splitting criterion that can be used to build decision trees. Other commonly used criteria are Gini index, chi-squared statistic, and gain ratio. For a review of these (and many more) criteria, refer to An Empirical Comparison of Selection Measures for Decision-Tree Induction, Mingers, J, Machine Learning, 1989, Vol. 3, pp. 319-342. Pruning the decision tree As mentioned earlier, a decision tree can continue to grow indefinitely, choosing splitting features and dividing into smaller and smaller partitions until each example is perfectly classified or the algorithm runs out of features to split on. However, if the tree grows overly large, many of the decisions it makes will be overly specific and the model will be overfitted to the training data. The process of pruning a decision tree involves reducing its size such that it generalizes better to unseen data. One solution to this problem is to stop the tree from growing once it reaches a certain number of decisions or when the decision nodes contain only a small number of examples. This is called early stopping or prepruning the decision tree. As the tree avoids doing needless work, this is an appealing strategy. However, one downside to this approach is that there is no way to know whether the tree will miss subtle but important patterns that it would have learned had it grown to a larger size. An alternative, called post-pruning, involves growing a tree that is intentionally too large and pruning leaf nodes to reduce the size of the tree to a more appropriate level. This is often a more effective approach than prepruning because it is quite difficult to determine the optimal depth of a decision tree without growing it first. Pruning the tree later on allows the algorithm to be certain that all of the important data structures were discovered. Note: The implementation details of pruning operations are very technical and beyond the scope of this book. For a comparison of some of the available methods, see A Comparative Analysis of Methods for Pruning Decision Trees, Esposito, F, Malerba, D, Semeraro, G, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, Vol. 19, pp. 476-491. One of the benefits of the C5.0 algorithm is that it is opinionated about pruning—it takes care of many of the decisions automatically using fairly reasonable defaults. Its overall strategy is to post-prune the tree. It first grows a large tree that overfits the training data. Later, the nodes and branches that have little effect on the classification errors are removed. In some cases, entire branches are moved further up the tree or replaced by simpler decisions. These processes of grafting branches are known as subtree raising and subtree replacement, respectively. Getting the right balance of overfitting and underfitting is a bit of an art, but if model accuracy is vital, it may be worth investing some time with various pruning options to see if it improves the test dataset performance. To summarize , decision trees are widely used due to their high accuracy and ability to formulate a statistical model in plain language.  Here, we looked at a highly popular and easily configurable decision tree algorithm C5.0. The major strength of the C5.0 algorithm over other decision tree implementations is that it is very easy to adjust the training options. Harness the power of R to build flexible, effective, and transparent machine learning models with Brett Lantz’s latest book Machine Learning with R, Fourth Edition. Dr.Brandon explains Decision Trees to Jon Building a classification system with Decision Trees in Apache Spark 2.0 Implementing Decision Trees
Read more
  • 0
  • 0
  • 48067
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at £16.99/month. Cancel anytime
article-image-rust-is-the-future-of-systems-programming-c-is-the-new-assembly-intel-principal-engineer-josh-triplett
Bhagyashree R
27 Aug 2019
10 min read
Save for later

“Rust is the future of systems programming, C is the new Assembly”: Intel principal engineer, Josh Triplett

Bhagyashree R
27 Aug 2019
10 min read
At Open Source Technology Summit (OSTS) 2019, Josh Triplett, a Principal Engineer at Intel gave an insight into what Intel is contributing to bring the most loved language, Rust to full parity with C. In his talk titled Intel and Rust: the Future of Systems Programming, he also spoke about the history of systems programming, how C became the “default” systems programming language, what features of Rust gives it an edge over C, and much more. Until now, OSTS was Intel's closed event where the company's business and tech leaders come together to discuss the various trends, technologies, and innovations that will help shape the open-source ecosystem. However, this year was different as the company welcomed non-Intel attendees including media, partners, and developers for the first time. The event hosts keynotes, more than 50 technical sessions, panels, demos covering all the open source technologies Intel is involved in. These include integrated software stacks (edge, AI, infrastructure), firmware, embedded and IoT projects, and cloud system software. This year the event happened from May 14-16 at Stevenson, Washington. What is systems programming Systems programming is the development and management of software that serves as a platform for other software to be built upon. The system software also directly or closely interfaces with computer hardware in order to gain necessary performance and expose abstractions. Unlike application programming where software is created to provide services to the user, it aims to produce software that provides services to the computer hardware. Triplett broadly defines systems programming as “anything that isn't an app.” It includes things like BIOS, firmware, boot loaders, operating systems kernels, embedded and similar types of low-level code, virtual machine implementations. Triplett also counts a web browser as a system software as it is more than “just an app,” they are actually “platforms for websites and web apps,” he says. How C became the “default” systems programming language Previously, most system software including BIOS, boot loaders, and firmware were written in Assembly. In the 1960s, experiments to bring hardware support in high-level languages started, which resulted in the creation of languages such as PL/S, BLISS, BCPL, and extended ALGOL. Then in the 1970s, Dennis Ritchie created the C programming language for the Unix operating system. Derived from the typeless B programming language, C was packed with powerful high-level functionalities and detailed features that were best suited for writing an operating system. Several UNIX components including its kernel were eventually rewritten in C. Many other system software including the Oracle database, a large portion of Windows source code, Linux operating system, were all written in C. C was seeing a huge adoption at this point. But, what exactly made developers comfortable moving to C? Triplett believes that in order to make this move from one language to another, developers have to be comfortable in terms of two things: features and parity. First, the language should offer “sufficiently compelling” features. “It can’t just be a little bit better. It has to be substantially better to warrant the effort and engineering time needed to move,” he adds. As compared to Assembly, C had a lot to offer. It had some degree of type safety, provided portability, better productivity with high-level constructs, and much more readable code. Second, the language has to provide parity, which means developers had to be confident that it is no less capable than Assembly. He states, “It can’t just be better, it also has to be no worse.” In addition to being faster and expressing any type of data that Assembly was able to, it also had what Triplett calls “escape hatch.”  This means you are allowed to make the move incrementally and also combine Assembly if required. Triplett believes that C is now becoming what Assembly was years ago. “C is the new Assembly,” he concludes. Developers are looking for a high-level language that not only addresses the problems in C that can’t be fixed but also leverage other exciting features that these languages provide. Such a language that aims to be compelling enough to make developers move from C should be memory safe, provide automatic memory management,  security, and much more. “Any language that wants to be better than C has to offer a lot more than just protection from buffer overflows if it's actually going to be a compelling alternative. People care about usability and productivity. They care about writing code that is self-explanatory, which accomplishes more work in less code. It also needs to address security issues. Usability and productivity go hand in hand with security. The less code you need to write to accomplish something, the less chance you have of introducing bugs security bugs or otherwise,” he explains. Comparing Rust with C Back in 2006, Graydon Hoare, a Mozilla employee started writing Rust as a personal project. Mozilla, in 2009, started sponsoring the project and also expanded the team to drive further development of the language. One of the reasons why Mozilla got interested is that Firefox was written in more than 4 million lines of C++ code and had quite a bit of highly critical vulnerabilities. Rust was built with safety and concurrency in mind making it the perfect choice for rewriting many components of Firefox under Project Quantum. It is also using Rust to develop Servo, an HTML rendering engine that will eventually replace Firefox’s rendering engine. Many other companies have also started using Rust for their projects including Microsoft, Google, Facebook, Amazon, Dropbox, Fastly, Chef, Baidu, and more. Rust addresses the memory management problem in C. It offers automatic memory management so that developers do not have to manually call free on every object. What sets it apart from other modern languages is that it does not have a garbage collector or runtime system of any kind. Rust instead has the concepts of ownership, borrowing, references, and lifetimes. “Rust has a system of declaring whether any given use of an object is the owner of that object or whether it's just borrowing that object temporarily. If you're just borrowing an object the compiler will keep track of that. It'll make sure that the original sticks around as long as you reference it. Rust makes sure that the owner of the object frees it when it's done and it inserts the call to free at compile time with no extra runtime overhead,” Triplett explains. Not having a runtime is also a plus for Rust. Triplett believes that languages that have a runtime are difficult to use as a system programming language. He adds, “You have to initialize that runtime before you can call any code, you have to use that runtime to call functions, and the runtime itself might run extra code behind your back at unexpected times.” Rust also aims to provide safe concurrent programming. The same features that make it memory safe, keep track of things like which thread own which object, which objects can be passed between threads, and which objects require acquiring locks. These features make Rust compelling enough for developers to choose for systems programming. However, talking about the second criteria, Rust does not have enough parity with C yet. “Achieving parity with C is exactly what got me involved in Rust,” says Triplett Teaching Rust about C compatible unions Triplett's first contribution to the Rust programming language was in the form of the 1444 RFC, which was started in 2015 and got accepted in 2016. This RFC proposed to bring native support for C-compatible unions in Rust that would be defined via a new "contextual keyword" union. Triplett understood the need for this proposal when he wanted to build a virtual machine in Rust and the Linux kernel interface for that /dev/kvm required unions. "I worked with the Rust community and with the language team to get unions into Rust and because of that work I'm actually now part of the Rust language governance team helping to evaluate and guide other changes into the language," he adds. He talked about this RFC in much detail at the very first RustConf in 2016: https://www.youtube.com/watch?v=U8Gl3RTXf88 Support for unnamed struct and union types Another feature that Triplett worked on was the support for unnamed struct and union types in Rust. This has been a widespread C compiler extension for decades and was also included in the C11 standard. This allowed developers to group and layout fields in arbitrary ways to match C data structures used in the Foreign Function Interface (FFI). With this proposal implemented, Rust will be able to represent such types using the same names as the structures without interposing artificial field names that will confuse users of well-established interfaces from existing platforms. A stabilized support for inline Assembly in Rust Systems programming often involves low-level manipulations and requires low-level details of the processors such as privileged instructions. For this, Rust supports using inline Assembly via the ‘asm!’ macro. However, it is only present in the nightly compiler and not yet stabilized. Triplett in a collaboration with other Rust developers is writing a proposal to introduce more robust syntax for inline Assembly. To know more in detail about support for inline Assembly, check out this pre-RFC. BFLOAT16 support into Rust Many Intel processors including Xeon Scalable ‘Cooper Lake-SP’ now support BFLOAT16, a new floating-point format. This truncated 16-bit version of the 32-bit IEEE 754 single-precision floating-point format was mainly designed for deep learning. This format is also used in machine learning libraries like Tensorflow that work with huge datasets. It also makes interoperating with existing systems, functions, and storage much easier. This is why Triplett is working on adding support for BFLOAT16 in Rust so that developers would be able to use the full capabilities of their hardware. FFI/C Parity Working Group This was one of the important announcements that Triplett made. He is starting a working group that will focus on achieving full parity with C. Under this group, he aims to collaborate with both the Rust community and other Intel developers to develop the specifications for the remaining features that need to be implemented in Rust for system programming. This group will also focus on bringing support for systems programming using the stable releases of Rust, not just experimental nightly releases of the compiler. In last week’s Reddit discussion, Triplett shared the current status of the working group, “To pre-answer, one question: the FFI / C Parity working group is in the process of being launched, and hasn't quite kicked off yet. I'll be posting about it here and elsewhere when it is, along with the initial goals.” Watch Josh Triplett’s full OSTS talk to know more about Intel’s contribution to Rust: https://www.youtube.com/watch?v=l9hM0h6IQDo [box type="shadow" align="" class="" width=""]Update: We have made the following corrections based on feedback from Josh Triplett: This year OSTS was open to Intel's partners and press. Previously, the article read 'escape patch', but it is 'escape hatch.' RFC 1444 wasn't last year, it was started in 2015 and accepted in 2016. 'dev KVM' is now corrected to '/dev/kvm'[/box] AMD competes with Intel by launching EPYC Rome, world’s first 7 nm chip for data centers, luring in Twitter and Google Hot Chips 31: IBM Power10, AMD’s AI ambitions, Intel NNP-T, Cerebras largest chip with 1.2 trillion transistors and more Intel’s 10th gen 10nm ‘Ice Lake’ processor offers AI apps, new graphics and best connectivity
Read more
  • 0
  • 0
  • 48003

article-image-setting-up-logistic-regression-model-using-tensorflow
Packt Editorial Staff
25 Apr 2018
8 min read
Save for later

Setting up Logistic Regression model using TensorFlow

Packt Editorial Staff
25 Apr 2018
8 min read
TensorFlow is another open source library developed by the Google Brain Team to build numerical computation models using data flow graphs. The core of TensorFlow was developed in C++ with the wrapper in Python. The tensorflow package in R gives you access to the TensorFlow API composed of Python modules to execute computation models. TensorFlow supports both CPU- and GPU-based computations. In this article, we will cover the application of TensorFlow in setting up a logistic regression model. The example will use a similar dataset to that used in the H2O model setup. The tensorflow package in R calls the Python tensorflow API for execution, which is essential to install the tensorflow package in both R and Python to make R work. The following are the dependencies for tensorflow: Python 2.7 / 3.x R (>3.2) devtools package in R for installing TensorFlow from GitHub TensorFlow in Python pip Getting ready The code for this section is created on Linux but can be run on any operating system. To start modeling, load the tensorflow package in the environment. R loads the default TensorFlow environment variable and also the NumPy library from Python in the np variable: library("tensorflow") # Load TensorFlow np <- import("numpy") # Load numpy library How to do it... The data is imported using a standard function from R, as shown in the following code. The data is imported using the csv file and transformed into the matrix format followed by selecting the features used to model as defined in xFeatures and yFeatures. The next step in TensorFlow is to set up a graph to run optimization: # Loading input and test data xFeatures = c("Temperature", "Humidity", "Light", "CO2", "HumidityRatio") yFeatures = "Occupancy" occupancy_train <-as.matrix(read.csv("datatraining.txt",stringsAsFactors = T)) occupancy_test <- as.matrix(read.csv("datatest.txt",stringsAsFactors = T)) # subset features for modeling and transform to numeric values occupancy_train<-apply(occupancy_train[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) occupancy_test<-apply(occupancy_test[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) # Data dimensions nFeatures<-length(xFeatures) nRow<-nrow(occupancy_train) Before setting up the graph, let's reset the graph using the following command: # Reset the graph tf$reset_default_graph() Additionally, let's start an interactive session as it will allow us to execute variables without referring to the session-to-session object: # Starting session as interactive session sess<-tf$InteractiveSession() Define the logistic regression model in TensorFlow: # Setting-up Logistic regression graph x <- tf$constant(unlist(occupancy_train[, xFeatures]), shape=c(nRow, nFeatures), dtype=np$float32) # W <- tf$Variable(tf$random_uniform(shape(nFeatures, 1L))) b <- tf$Variable(tf$zeros(shape(1L))) y <- tf$matmul(x, W) + b The input feature x is defined as a constant as it will be an input to the system. The weight W and bias b are defined as variables that will be optimized during the optimization process. The y is set up as a symbolic representation between x, W, and b. The weight W is set up to initialize random uniform distribution and b is assigned the value zero. The next step is to set up the cost function for logistic regression: # Setting-up cost function and optimizer y_ <- tf$constant(unlist(occupancy_train[, yFeatures]), dtype="float32", shape=c(nRow, 1L)) cross_entropy<-tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labels=y_, logits=y, name="cross_entropy")) optimizer <- tf$train$GradientDescentOptimizer(0.15)$minimize(cross_entropy) # Start a session init <- tf$global_variables_initializer() sess$run(init) Execute the gradient descent algorithm for the optimization of weights using cross entropy as the loss function: # Running optimization for (step in 1:5000) {   sess$run(optimizer)   if (step %% 20== 0)     cat(step, "-", sess$run(W), sess$run(b), "==>", sess$run(cross_entropy), "n") } How it works... The performance of the model can be evaluated using AUC: # Performance on Train library(pROC) ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) # Performance on test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)). AUC can be visualized using the plot.auc function from the pROC package, as shown in the screenshot following this command. The performance for training and testing (hold-out) is very similar. plot.roc(roc_obj, col = "green", lty=2, lwd=2) plot.roc(roc_objt, add=T, col="red", lty=4, lwd=2) Performance of logistic regression using TensorFlow Visualizing TensorFlow graphs TensorFlow graphs can be visualized using TensorBoard. It is a service that utilizes TensorFlow event files to visualize TensorFlow models as graphs. Graph model visualization in TensorBoard is also used to debug TensorFlow models. Getting ready TensorBoard can be started using the following command in the terminal: $ tensorboard --logdir home/log --port 6006 The following are the major parameters for TensorBoard: --logdir : To map to the directory to load TensorFlow events --debug: To increase log verbosity --host: To define the host to listen to its localhost (0.0.1) by default --port: To define the port to which TensorBoard will serve The preceding command will launch the TensorFlow service on localhost at port 6006, as shown in the following screenshot: TensorBoard The tabs on the TensorBoard capture relevant data generated during graph execution. How to do it... The section covers how to visualize TensorFlow models and output in TernsorBoard. To visualize summaries and graphs, data from TensorFlow can be exported using the FileWriter command from the summary module. A default session graph can be added using the following command: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) The graph for logistic regression developed using the preceding code is shown in the following screenshot: Visualization of the logistic regression graph in TensorBoard Details about symbol descriptions on TensorBoard can be found at https://www.tensorflow.org/get_started/graph_viz. Similarly, other variable summaries can be added to the TensorBoard using correct summaries, as shown in the following code: # Adding histogram summary to weight and bias variable w_hist = tf$histogram_summary("weights", W) b_hist = tf$histogram_summary("biases", b) Create a cross entropy evaluation for test. An example script to generate the cross entropy cost function for test and train is shown in the following command: # Set-up cross entropy for test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- tf$nn$sigmoid(tf$matmul(xt, W) + b) yt_ <- tf$constant(unlist(occupancy_test[, yFeatures]), dtype="float32", shape=c(nRowt, 1L)) cross_entropy_tst<-tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labels=yt_, logits=ypredt, name="cross_entropy_tst")) Add summary variables to be collected: # Add summary ops to collect data w_hist = tf$summary$histogram("weights", W) b_hist = tf$summary$histogram("biases", b) crossEntropySummary<-tf$summary$scalar("costFunction", cross_entropy) crossEntropyTstSummary<-tf$summary$scalar("costFunction_test", cross_entropy_tst) Open the writing object, log_writer. It writes the default graph to the location, c:/log: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) Run the optimization and collect the summaries: for (step in 1:2500) {   sess$run(optimizer)   # Evaluate performance on training and test data after 50 Iteration   if (step %% 50== 0){    ### Performance on Train    ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b))    roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred))    ### Performance on Test    ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b))    roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt))    cat("train AUC: ", auc(roc_obj), " Test AUC: ", auc(roc_objt), "n")    # Save summary of Bias and weights    log_writer$add_summary(sess$run(b_hist), global_step=step)    log_writer$add_summary(sess$run(w_hist), global_step=step)    log_writer$add_summary(sess$run(crossEntropySummary), global_step=step)    log_writer$add_summary(sess$run(crossEntropyTstSummary), global_step=step) } } Collect all the summaries to a single tensor using themerge_all command from the summary module: summary = tf$summary$merge_all() Write the summaries to the log file using the log_writer object: log_writer = tf$summary$FileWriter('c:/log', sess$graph) summary_str = sess$run(summary) log_writer$add_summary(summary_str, step) log_writer$close() We have learned how to perform logistic regression using TensorFlow also we have covered the application of TensorFlow in setting up a logistic regression model. [box type="shadow" align="" class="" width=""]This article is book excerpt taken from, R Deep Learning Cookbook, co-authored by PKS Prakash & Achyutuni Sri Krishna Rao. This book contains powerful and independent recipes to build deep learning models in different application areas using R libraries.[/box] Read More Getting started with Linear and logistic regression Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions Using Logistic regression to predict market direction in algorithmic trading  
Read more
  • 0
  • 0
  • 47924

article-image-connecting-cloud-object-storage-with-databricks-unity-catalog
Pulkit Chadha
22 Oct 2024
10 min read
Save for later

Connecting Cloud Object Storage with Databricks Unity Catalog

Pulkit Chadha
22 Oct 2024
10 min read
This article is an excerpt from the book, Data Engineering with Databricks Cookbook, by Pulkit Chadha. This book shows you how to use Apache Spark, Delta Lake, and Databricks to build data pipelines, manage and transform data, optimize performance, and more. Additionally, you’ll implement DataOps and DevOps practices, and orchestrate data workflows.IntroductionDatabricks Unity Catalog allows you to manage and access data in cloud object storage using a unified namespace and a consistent set of APIs. With Unity Catalog, you can do the following: Create and manage storage credentials, external locations, storage locations, and volumes using SQL commands or the Unity Catalog UI Access data from various cloud platforms (AWS S3, Azure Blob Storage, or Google Cloud Storage) and storage formats (Parquet, Delta Lake, CSV, or JSON) using the same SQL syntax or Spark APIs Apply fine-grained access control and data governance policies to your data using Databricks SQL Analytics or Databricks Runtime In this article, you will learn what Unity Catalog is and how it integrates with AWS S3. Getting ready Before you start setting up and configuring Unity Catalog, you need to have the following prerequisites: A Databricks workspace with administrator privileges A Databricks workspace with the Unity Catalog feature enabled A cloud storage account (such as AWS S3, Azure Blob Storage, or Google Cloud Storage) with the necessary permissions to read and write data How to do it… In this section, we will first create a storage credential, the IAM role, with access to an s3 bucket. Then, we will create an external location in Databricks Unity Catalog that will use the storage credential to access the s3 bucket. Creating a storage credential You must create a storage credential to access data from an external location or a volume. In this example, you will create a storage credential that uses an IAM role taccess the S3 Bucket. The steps are as follows: 1. Go to Catalog Explorer: Click on Catalog in the left panel and go to Catalog Explorer. 2. Create storage credentials: Click on +Add and select Add a storage credential. Figure 10.1 – Add a storage credential 3. Enter storage credential details: Give the credential a name, the IAM role ARN that allows Unity Catalog to access the storage location on your cloud tenant, and a comment if you want, and click on Create.  Figure 10.2 – Create a new storage credential Important note To learn more about IAM roles in AWS, you can reference the user guide here: https:// docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html. 4. Get External ID: In the Storage credential created dialog, copy the External ID value and click on Done.  Figure 10.3 – External ID for the storage credential 5. Update the trust policy with an External ID: Update the trust policy associated with the IAM role and add the External ID value for sts:ExternalId:  Figure 10.4 – Updated trust policy with External ID Creating an external location An external location contains a reference to a storage credential and a cloud storage path. You need to create an external location to access data from a custom storage location that Unity Catalog uses to reference external tables. In this example, you will create an external location that points to the de-book-ext-loc folder in an S3 bucket. To create an external location, you can follow these steps: 1. Go to Catalog Explorer: Click on Catalog in the left panel to go to Catalog Explorer. 2. Create external location: Click on +Add and select Add an external location:  Figure 10.5 – Add an external location 3. Pick an external location creation method: Select Manual and then click on Next:  Figure 10.6 – Create a new external location 4. Enter external location details: Enter the external location name, select the storage credential, and enter the S3 URL; then, click on the Create button:  Figure 10.7 – Create a new external location manually 5. Test connection: Test the connection to make sure you have set up the credentials accurately and that Unity Catalog is able to access cloud storage:  Figure 10.8 – Test connection for external location If everything is set up right, you should see a screen like the following. Click on Done:  Figure 10.9 – Test connection results See also Databricks Unity Catalog: https://www.databricks.com/product/unity-catalog What is Unity Catalog: https://docs.databricks.com/en/data-governance/ unity-catalog/index.html Databricks Unity Catalog documentation: https://docs.databricks.com/en/ compute/access-mode-limitations.html Databricks SQL documentation: https://docs.databricks.com/en/datagovernance/unity-catalog/create-tables.html Databricks Unity Catalog: A Comprehensive Guide to Features, Capabilities, and Architecture: https://atlan.com/databricks-unity-catalog/ Step By Step Guide on Databricks Unity Catalog Setup and its key Features: https:// medium.com/@sauravkum780/step-by-step-guide-on-databricks-unitycatalog-setup-and-its-features-1d0366c282b7 Conclusion In summary, connecting to cloud object storage using Databricks Unity Catalog provides a streamlined approach to managing and accessing data across various cloud platforms such as AWS S3, Azure Blob Storage, and Google Cloud Storage. By utilizing a unified namespace, consistent APIs, and powerful governance features, Unity Catalog simplifies the process of creating and managing storage credentials and external locations. With built-in fine-grained access controls, you can securely manage data stored in different formats and cloud environments, all while leveraging Databricks' powerful data analytics capabilities. This guide walks through setting up an IAM role and creating an external location in AWS S3, demonstrating how easy it is to connect cloud storage with Unity Catalog. Author BioPulkit Chadha is a seasoned technologist with over 15 years of experience in data engineering. His proficiency in crafting and refining data pipelines has been instrumental in driving success across diverse sectors such as healthcare, media and entertainment, hi-tech, and manufacturing. Pulkit’s tailored data engineering solutions are designed to address the unique challenges and aspirations of each enterprise he collaborates with.
Read more
  • 0
  • 0
  • 47916

article-image-restful-web-services-with-kotlin
Natasha Mathur
01 Jun 2018
9 min read
Save for later

Building RESTful web services with Kotlin

Natasha Mathur
01 Jun 2018
9 min read
Kotlin has been eating up the Java world. It has already become a hit in the Android Ecosystem which was dominated by Java and is welcomed with open arms. Kotlin is not limited to Android development and can be used to develop server-side and client-side web applications as well. Kotlin is 100% compatible with the JVM so you can use any existing frameworks such as Spring Boot, Vert.x, or JSF for writing Java applications. In this tutorial, we will learn how to implement RESTful web services using Kotlin. This article is an excerpt from the book 'Kotlin Programming Cookbook', written by, Aanand Shekhar Roy and Rashi Karanpuria. Setting up dependencies for building RESTful services In this recipe, we will lay the foundation for developing the RESTful service. We will see how to set up dependencies and run our first SpringBoot web application. SpringBoot provides great support for Kotlin, which makes it easy to work with Kotlin. So let's get started. We will be using IntelliJ IDEA and Gradle build system. If you don't have that, you can get it from https://www.jetbrains.com/idea/. How to do it… Let's follow the given steps to set up the dependencies for building RESTful services: First, we will create a new project in IntelliJ IDE. We will be using the Gradle build system for maintaining dependency, so create a Gradle project: When you have created the project, just add the following lines to your build.gradle file. These lines of code contain spring-boot dependencies that we will need to develop the web app: buildscript { ext.kotlin_version = '1.1.60' // Required for Kotlin integration ext.spring_boot_version = '1.5.4.RELEASE' repositories { jcenter() } dependencies { classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version" // Required for Kotlin integration classpath "org.jetbrains.kotlin:kotlin-allopen:$kotlin_version" // See https://kotlinlang.org/docs/reference/compiler-plugins.html#kotlin-spring-compiler-plugin classpath "org.springframework.boot:spring-boot-gradle-plugin:$spring_boot_version" } } apply plugin: 'kotlin' // Required for Kotlin integration apply plugin: "kotlin-spring" // See https://kotlinlang.org/docs/reference/compiler-plugins.html#kotlin-spring-compiler-plugin apply plugin: 'org.springframework.boot' jar { baseName = 'gs-rest-service' version = '0.1.0' } sourceSets { main.java.srcDirs += 'src/main/kotlin' } repositories { jcenter() } dependencies { compile "org.jetbrains.kotlin:kotlin-stdlib:$kotlin_version" // Required for Kotlin integration compile 'org.springframework.boot:spring-boot-starter-web' testCompile('org.springframework.boot:spring-boot-starter-test') } Let's now create an App.kt file in the following directory hierarchy: It is important to keep the App.kt file in a package (we've used the college package). Otherwise, you will get an error that says the following: ** WARNING ** : Your ApplicationContext is unlikely to start due to a `@ComponentScan` of the default package. The reason for this error is that if you don't include a package declaration, it considers it a "default package," which is discouraged and avoided. Now, let's try to run the App.kt class. We will put the following code to test if it's running: @SpringBootApplication open class App { } fun main(args: Array<String>) { SpringApplication.run(App::class.java, *args) } Now run the project; if everything goes well, you will see output with the following line at the end: Started AppKt in 5.875 seconds (JVM running for 6.445) We now have our application running on our embedded Tomcat server. If you go to http://localhost:8080, you will see an error as follows: The preceding error is 404 error and the reason for that is we haven't told our application to do anything when a user is on the / path. Creating a REST controller In the previous recipe, we learned how to set up dependencies for creating RESTful services. Finally, we launched our backend on the http://localhost:8080 endpoint but got 404 error as our application wasn't configured to handle requests at that path (/). We will start from that point and learn how to create a REST controller. Let's get started! We will be using IntelliJ IDE for coding purposes. For setting up of the environment, refer to the previous recipe. You can also find the source in the repository at https://gitlab.com/aanandshekharroy/kotlin-webservices. How to do it… In this recipe, we will create a REST controller that will fetch us information about students in a college. We will be using an in-memory database using a list to keep things simple: Let's first create a Student class having a name and roll number properties: package college class Student() { lateinit var roll_number: String lateinit var name: String constructor( roll_number: String, name: String): this() { this.roll_number = roll_number this.name = name } } Next, we will create the StudentDatabase endpoint, which will act as a database for the application: @Component class StudentDatabase { private val students = mutableListOf<Student>() } Note that we have annotated the StudentDatabase class with @Component, which means its lifecycle will be controlled by Spring (because we want it to act as a database for our application). We also need a @PostConstruct annotation, because it's an in-memory database that is destroyed when the application closes. So we would like to have a filled database whenever the application launches. So we will create an init method, which will add a few items into the "database" at startup time: @PostConstruct private fun init() { students.add(Student("2013001","Aanand Shekhar Roy")) students.add(Student("2013165","Rashi Karanpuria")) } Now, we will create a few other methods that will help us deal with our database: getStudent: Gets the list of students present in our database: fun getStudents()=students addStudent: This method will add a student to our database: fun addStudent(student: Student): Boolean { students.add(student) return true } Now let's put this database to use. We will be creating a REST controller that will handle the request. We will create a StudentController and annotate it with @RestController. Using @RestController is simple, and it's the preferred method for creating MVC RESTful web services. Once created, we need to provide our database using Spring dependency injection, for which we will need the @Autowired annotation. Here's how our StudentController looks: @RestController class StudentController { @Autowired private lateinit var database: StudentDatabase } Now we will set our response to the / path. We will show the list of students in our database. For that, we will simply create a method that lists out students. We will need to annotate it with @RequestMapping and provide parameters such as path and request method (GET, POST, and such): @RequestMapping("", method = arrayOf(RequestMethod.GET)) fun students() = database.getStudents() This is what our controller looks like now. It is a simple REST controller: package college import org.springframework.beans.factory.annotation.Autowired import org.springframework.web.bind.annotation.RequestMapping import org.springframework.web.bind.annotation.RequestMethod import org.springframework.web.bind.annotation.RestController @RestController class StudentController { @Autowired private lateinit var database: StudentDatabase @RequestMapping("", method = arrayOf(RequestMethod.GET)) fun students() = database.getStudents() } Now when you restart the server and go to http://localhost:8080, we will see the response as follows: As you can see, Spring is intelligent enough to provide the response in the JSON format, which makes it easy to design APIs. Now let's try to create another endpoint that will fetch a student's details from a roll number: @GetMapping("/student/{roll_number}") fun studentWithRollNumber( @PathVariable("roll_number") roll_number:String) = database.getStudentWithRollNumber(roll_number) Now, if you try the http://localhost:8080/student/2013001 endpoint, you will see the given output: {"roll_number":"2013001","name":"Aanand Shekhar Roy"} Next, we will try to add a student to the database. We will be doing it via the POST method: @RequestMapping("/add", method = arrayOf(RequestMethod.POST)) fun addStudent(@RequestBody student: Student) = if (database.addStudent(student)) student else throw Exception("Something went wrong") There's more… So far, our server has been dependent on IDE. We would definitely want to make it independent of an IDE. Thanks to Gradle, it is very easy to create a runnable JAR just with the following: ./gradlew clean bootRepackage The preceding command is platform independent and uses the Gradle build system to build the application. Now, you just need to type the mentioned command to run it: java -jar build/libs/gs-rest-service-0.1.0.jar You can then see the following output as before: Started AppKt in 4.858 seconds (JVM running for 5.548) This means your server is running successfully. Creating the Application class for Spring Boot The SpringApplication class is used to bootstrap our application. We've used it in the previous recipes; we will see how to create the Application class for Spring Boot in this recipe. We will be using IntelliJ IDE for coding purposes. To set up the environment, read previous recipes, especially the Setting up dependencies for building RESTful services recipe. How to do it… If you've used Spring Boot before, you must be familiar with using @Configuration, @EnableAutoConfiguration, and @ComponentScan in your main class. These were used so frequently that Spring Boot provides a convenient @SpringBootApplication alternative. The Spring Boot looks for the public static main method, and we will use a top-level function outside the Application class. If you noted, while setting up the dependencies, we used the kotlin-spring plugin, hence we don't need to make the Application class open. Here's an example of the Spring Boot application: package college import org.springframework.boot.SpringApplication import org.springframework.boot.autoconfigure.SpringBootApplication @SpringBootApplication class Application fun main(args: Array<String>) { SpringApplication.run(Application::class.java, *args) } The Spring Boot application executes the static run() method, which takes two parameters and starts a autoconfigured Tomcat web server when Spring application is started. When everything is set, you can start the application by executing the following command: ./gradlew bootRun If everything goes well, you will see the following output in the console: This is along with the last message—Started AppKt in xxx seconds. This means that your application is up and running. In order to run it as an independent server, you need to create a JAR and then you can execute as follows: ./gradlew clean bootRepackage Now, to run it, you just need to type the following command: java -jar build/libs/gs-rest-service-0.1.0.jar We learned how to set up dependencies for building RESTful services, creating a REST controller, and creating the application class for Spring boot. If you are interested in learning more about Kotlin then be sure to check out the 'Kotlin Programming Cookbook'. Build your first Android app with Kotlin 5 reasons to choose Kotlin over Java Getting started with Kotlin programming Forget C and Java. Learn Kotlin: the next universal programming language
Read more
  • 0
  • 0
  • 47761
article-image-mastering-code-generation-exploring-the-llvm-backend
Kai Nacke, Amy Kwan
24 Oct 2024
10 min read
Save for later

Mastering Code Generation: Exploring the LLVM Backend

Kai Nacke, Amy Kwan
24 Oct 2024
10 min read
This article is an excerpt from the book, Learn LLVM 17 - Second Edition, by Kai Nacke, Amy Kwan. Learn how to build your own compiler, from reading the source to emitting optimized machine code. This book guides you through the JIT compilation framework, extending LLVM in a variety of ways, and using the right tools for troubleshooting.Introduction Generating optimized machine code is a critical task in the compilation process, and the LLVM backend plays a pivotal role in this transformation. The backend translates the LLVM Intermediate Representation (IR), derived from the Abstract Syntax Tree (AST), into machine code that can be executed on target architectures. Understanding how to generate this IR effectively is essential for leveraging LLVM's capabilities. This article delves into the intricacies of generating LLVM IR using a simple expression language example. We'll explore the necessary steps, from declaring library functions to implementing a code generation visitor, ensuring a comprehensive understanding of the LLVM backend's functionality. Generating code with the LLVM backend The task of the backend is to create optimized machine code from the LLVM IR of a module. The IR is the interface to the backend and can be created using a C++ interface or in textual form. Again, the IR is generated from the AST. Textual representation of LLVM IR Before trying to generate the LLVM IR, it should be clear what we want to generate. For our example expression language, the high-level plan is as follows: 1. Ask the user for the value of each variable. 2. Calculate the value of the expression. 3. Print the result. To ask the user to provide a value for a variable and to print the result, two library functions are used: calc_read() and calc_write(). For the with a: 3*a expression, the generated IR is as follows: 1. The library functions must be declared, like in C. The syntax also resembles C. The type before the function name is the return type. The type names surrounded by parenthesis are the argument types. The declaration can appear anywhere in the file: declare i32 @calc_read(ptr) declare void @calc_write(i32) 2. The calc_read() function takes the variable name as a parameter. The following construct defines a constant, holding a and the null byte used as a string terminator in C: @a.str = private constant [2 x i8] c"a\00" 3. It follows the main() function. The parameter names are omitted because they are not used.  Just as in C, the body of the function is enclosed in braces: define i32 @main(i32, ptr) { 4. Each basic block must have a label. Because this is the first basic block of the function, we name it entry: entry: 5. The calc_read() function is called to read the value for the a variable. The nested getelemenptr instruction performs an index calculation to compute the pointer to the first element of the string constant. The function result is assigned to the unnamed %2 variable.  %2 = call i32 @calc_read(ptr @a.str) 6. Next, the variable is multiplied by 3:  %3 = mul nsw i32 3, %2 7. The result is printed on the console via a call to the calc_write() function:  call void @calc_write(i32 %3) 8. Last, the main() function returns 0 to indicate a successful execution:  ret i32 0 } Each value in the LLVM IR is typed, with i32 denoting the 32-bit bit integer type and ptr denoting a pointer. Note: previous versions of LLVM used typed pointers. For example, a pointer to a byte was expressed as i8* in LLVM. Since L LVM 16, opaque pointers are the default. An opaque pointer is just a pointer to memory, without carrying any type information about it. The notation in LLVM IR is ptr.Previous versions of LLVM used typed pointers. For example, a pointer to a byte was expressed as i8* in LLVM. Since L LVM 16, opaque pointers are the default. An opaque pointer is just a pointer to memory, without carrying any type information about it. The notation in LLVM IR is ptr. Since it is now clear what the IR looks like, let’s generate it from the AST.  Generating the IR from the AST The interface, provided in the CodeGen.h header file, is very small:  #ifndef CODEGEN_H #define CODEGEN_H #include "AST.h" class CodeGen { public: void compile(AST *Tree); }; #endifBecause the AST contains the information, the basic idea is to use a visitor to walk the AST. The CodeGen.cpp file is implemented as follows: 1. The required includes are at the top of the file: #include "CodeGen.h" #include "llvm/ADT/StringMap.h" #include "llvm/IR/IRBuilder.h" #include "llvm/IR/LLVMContext.h" #include "llvm/Support/raw_ostream.h" 2. The namespace of the LLVM libraries is used for name lookups: using namespace llvm; 3. First, some private members are declared in the visitor. Each compilation unit is represented in LLVM by the Module class and the visitor has a pointer to the module called M. For easy IR generation, the Builder (of type IRBuilder<>) is used. LLVM has a class hierarchy to represent types in IR. You can look up the instances for basic types such as i32 from the LLVM context. These basic types are used very often. To avoid repeated lookups, we cache the needed type instances: VoidTy, Int32Ty, PtrTy, and Int32Zero. The V member is the current calculated value, which is updated through the tree traversal. And last, nameMap maps a variable name to the value returned from the calc_read() function: namespace { class ToIRVisitor : public ASTVisitor { Module *M; IRBuilder<> Builder; Type *VoidTy; Type *Int32Ty; PointerType *PtrTy; Constant *Int32Zero; Value *V; StringMap<Value *> nameMap;4. The constructor initializes all members: public: ToIRVisitor(Module *M) : M(M), Builder(M->getContext()) { VoidTy = Type::getVoidTy(M->getContext()); Int32Ty = Type::getInt32Ty(M->getContext()); PtrTy = PointerType::getUnqual(M->getContext()); Int32Zero = ConstantInt::get(Int32Ty, 0, true); }5. For each function, a FunctionType instance must be created. In C++ terminology, this is a function prototype. A function itself is defined with a Function instance. The run() method defines the main() function in the LLVM IR first:  void run(AST *Tree) { FunctionType *MainFty = FunctionType::get( Int32Ty, {Int32Ty, PtrTy}, false); Function *MainFn = Function::Create( MainFty, GlobalValue::ExternalLinkage, "main", M); 6. Then we create the BB basic block with the entry label, and attach it to the IR builder: BasicBlock *BB = BasicBlock::Create(M->getContext(), "entry", MainFn); Builder.SetInsertPoint(BB);7. With this preparation done, the tree traversal can begin:     Tree->accept(*this); 8. After the tree traversal, the computed value is printed via a call to the calc_write() function. Again, a function prototype (an instance of FunctionType) has to be created. The only parameter is the current value, V:  FunctionType *CalcWriteFnTy = FunctionType::get(VoidTy, {Int32Ty}, false); Function *CalcWriteFn = Function::Create( CalcWriteFnTy, GlobalValue::ExternalLinkage, "calc_write", M); Builder.CreateCall(CalcWriteFnTy, CalcWriteFn, {V});9. The generation finishes  by returning 0 from the main() function: Builder.CreateRet(Int32Zero); }10. A WithDecl node holds the names of the declared variables. First, we create a function prototype for the calc_read() function:  virtual void visit(WithDecl &Node) override { FunctionType *ReadFty = FunctionType::get(Int32Ty, {PtrTy}, false); Function *ReadFn = Function::Create( ReadFty, GlobalValue::ExternalLinkage, "calc_read", M);11. The method loops through the variable names:  for (auto I = Node.begin(), E = Node.end(); I != E; ++I) { 12. For each  variable, a string  with a variable name is created:  StringRef Var = *I; Constant *StrText = ConstantDataArray::getString( M->getContext(), Var); GlobalVariable *Str = new GlobalVariable( *M, StrText->getType(), /*isConstant=*/true, GlobalValue::PrivateLinkage, StrText, Twine(Var).concat(".str"));13. Then the IR code to call the calc_read() function is created. The string created in the previous step is passed as a parameter:  CallInst *Call = Builder.CreateCall(ReadFty, ReadFn, {Str});14. The returned value is stored in the mapNames map for later use:  nameMap[Var] = Call; }15. The tree traversal continues with the expression:  Node.getExpr()->accept(*this); };16. A Factor node is either a variable name or a number. For a variable name, the value is looked up in the mapNames map. For a number, the value is converted to an integer and turned into a constant value: virtual void visit(Factor &Node) override { if (Node.getKind() == Factor::Ident) { V = nameMap[Node.getVal()]; } else { int intval; Node.getVal().getAsInteger(10, intval); V = ConstantInt::get(Int32Ty, intval, true); } };17. And last, for a BinaryOp node, the right calculation operation must be used:  virtual void visit(BinaryOp &Node) override { Node.getLeft()->accept(*this); Value *Left = V; Node.getRight()->accept(*this); Value *Right = V; switch (Node.getOperator()) { case BinaryOp::Plus: V = Builder.CreateNSWAdd(Left, Right); break; case BinaryOp::Minus: V = Builder.CreateNSWSub(Left, Right); break; case BinaryOp::Mul: V = Builder.CreateNSWMul(Left, Right); break; case BinaryOp::Div: V = Builder.CreateSDiv(Left, Right); break;    }       };       };       }18. With this, the visitor class is complete. The compile() method creates the global context and the  module, runs the tree traversal, and dumps the generated IR to the console:  void CodeGen::compile(AST *Tree) { LLVMContext Ctx; Module *M = new Module("calc.expr", Ctx); ToIRVisitor ToIR(M); ToIR.run(Tree); M->print(outs(), nullptr); }We now have implemented the frontend of the compiler, from reading the source up to generating the IR. Of course, all these components must work together on user input, which is the task of the compiler driver. We also need to implement the functions needed at runtime. Both are topics of the next section covered in the book. Conclusion In conclusion, the process of generating LLVM IR from an AST involves multiple steps, each crucial for producing efficient machine code. This article highlighted the structure and components necessary for this task, including function declarations, basic block management, and tree traversal using a visitor pattern. By carefully managing these elements, developers can harness the power of LLVM to create optimized and reliable machine code. The integration of all these components, alongside user input and runtime functions, completes the frontend implementation of the compiler. This sets the stage for the next phase, focusing on the compiler driver and runtime functions, ensuring seamless execution and integration of the compiled code. Author BioKai Nacke is a professional IT architect currently residing in Toronto, Canada. He holds a diploma in computer science from the Technical University of Dortmund, Germany. and his diploma thesis on universal hash functions was recognized as the best of the semester. With over 20 years of experience in the IT industry, Kai has extensive expertise in the development and architecture of business and enterprise applications. In his current role, he evolves an LLVM/clang-based compiler.nFor several years, Kai served as the maintainer of LDC, the LLVM-based D compiler. He is the author of D Web Development and Learn LLVM 12, both published by Packt. In the past, he was a speaker in the LLVM developer room at the Free and Open Source Software Developers’ European Meeting (FOSDEM).Amy Kwan is a compiler developer currently residing in Toronto, Canada. Originally, from the Canadian prairies, Amy holds a Bachelor of Science in Computer Science from the University of Saskatchewan. In her current role, she leverages LLVM technology as a backend compiler developer. Previously, Amy has been a speaker at the LLVM Developer Conference in 2022 alongside Kai Nacke.
Read more
  • 0
  • 0
  • 47753

article-image-powershell-basics-for-it-professionals
Savia Lobo
16 Dec 2019
6 min read
Save for later

PowerShell Basics for IT Professionals

Savia Lobo
16 Dec 2019
6 min read
PowerShell is Microsoft’s automation platform for IT Pros. Of late, there have been a lot of questions around the complexity of this latest automation tool by Microsoft. At Microsoft Ignite 2018, Jason Himmelstein, Director of Technical Strategy and Strategic Partnerships, Office Apps & Services MVP, explained the basics of PowerShell and how to truly optimize your SharePoint implementation using this powerful IT pro toolset. While in this post we look at the big picture, you can check out the complete video here: ‘Introduction to PowerShell for the anxious IT pro’. Want to do more with PowerShell? After learning the basics, you can learn how to use PowerShell to automate complex Windows server tasks. You can also improve PowerShell's usability, and control and manage Windows-based environments by working through exciting recipes given in Windows Server 2019 Automation with PowerShell Cookbook - Third Edition written by Thomas Lee.  Himmelstein starts off by saying PowerShell isn’t a packaged executable, nor it is developer-centric that needs one to understand code, and it is easy for an IT Pro to understand. What is PowerShell? Windows PowerShell is Microsoft’s task automation framework, consisting of a command-line shell and associated scripting language built on .NET Framework. It provides full access to COM and WMI, enabling administrators to perform administrative tasks on both local and remote Windows systems. In simple words, PowerShell is an object-based, not a text-based, command-line interface for Microsoft Technologies. This means results in PowerShell can be acted upon and not just read from. One can cause huge damage to an environment using PowerShell as there is no back button in PowerShell. However, to check what must have gone wrong, you can check the logs but can not undo actions. Why PowerShell matters Regardless of the platform a person uses such as Office 365, Azure, etc., PowerShell can be easily implemented due to its cross-platform capability. Himmelstein also highlights one can also get started with Azure PowerShell by trying it out in an Azure Cloud Shell environment, an interactive, authenticated, browser-accessible shell for managing Azure resources.  Azure Cloud Shell comes equipped with commonly used CLI tools including Linux shell interpreters, PowerShell modules, Azure tools, text editors, source control, build tools, container tools, database tools and more. Cloud Shell also includes language support for several popular programming languages such as Node.js, .NET and Python. Cloud Shell also securely authenticates automatically for instant access to your resources through the Azure CLI or Azure PowerShell cmdlets. Users can use PowerShell in Cloud Shell. One can also develop applications using PowerShell or can use PowerShell via Source Control Management (SCM). Basics of PowerShell PowerShell Hardware There are two ways one can use PowerShell; one is via the PowerShell Console, which is similar to a command line. The other is PowerShell ISE (Integrated Scripting Environment). One thing Himmelstein encourages is, “we run PowerShell in the Console and we write PowerShell in the ISE.” The reason is there are certain functionalities that do not work in the ISE when one hits the ‘Run’ command. In such cases, the user will have to take that PowerShell out, copy it, save the file and run it in a command window. cmdlets Cmdlets are the main building blocks of PowerShell. These are mini commands that perform one action. These have the ability to pipe the output of one cmdlet into further cmdlets. These can also perform equality tests with expressions such as -eq, -lt, -match; one can diff easily within a PowerShell. Modules There are four types of Modules in PowerShell: Script: A Script module is a file (.psm1) that contains any valid Windows PowerShell code. Binary: A binary module is a .NET framework assembly (.dll) that contains compiled code. Manifest: A module Manifest is a Windows PowerShell data file (.psd1) that describes the contents of a module and determines how a module is processed. Dynamic: A dynamic module does not persist to disk. It is created using New Module, is intended to be short-lived, and cannot be accessed by Get-Module. Himmelstein prefers not to use the Dynamic module as it persists for just one session. Objects and Members Objects are instances of classes and have properties and methods. Members are properties and methods of an object. Properties define what an Object is and Methods define what you can do with the object. Himmelstein puts together all these terms in a simple way: Objects = stuff Cmdlets = things you can do with the stuff Modules = list of things you can do with the stuff Properties = details about the stuff Methods = instructions for things you can do with the stuff PipeLine Using PipeLines one can chain objects together for processing. The output of a pipelined object becomes the object itself. Functional Explanation Get-command: Gets all the cmdlet installed on your computer. Get-help: Displays additional information about a cmdlet Get-member: Listing the Properties and Methods of a Command or Object Get-verb: Gets approved Windows PowerShell verbs Start-transcript: Logs everything you do in that PowerShell window to a file Get- history: If you didn’t start transcript, you can still review your history before closing your Shell or ISE window. Tips for PowerShell beginners Use Variables: You can use any variables except the ones that are reserved by the system, which you will be prompted when you try to enter a reserved variable. Call one thing at a time Comment your scripts as this may save you a lot of time. Create scripts using an ISE/IDE, you can also use the Visual Studio Code and then execute in Shell. Dispose of your objects. Close the command window by typing Exit. Test before using in Production Write reusable scripts. What Powershell beginners should avoid Rewriting your variables Hard coding your scripts such as Password as it may get fired by PowerShell Taking code from the internet or vendor and just Run in your environment (You should read every code before you run it in your environment). Assuming the code is not harmful; it is. There is no back button in PowerShell and you cannot undo things. Running your code in an IDE/ISE and expect everything to work. PowerShell Syntax and Bracketology Syntax ‘#’ is for Comment ‘+’ is for Add ‘=’, ‘-eq’, are for Equal ‘!’, ‘-ne’, ‘-not’ are for ‘not equal’ Brackets ‘()’ Curved brackets also known as Parentheses are used for required options, compulsory arguments, or control structures. ‘{}’ Curly brackets are used for block expression within a command block and is also used to open a code block ‘[]’ Square brackets are used to denote optional elements or parameters and also used for match functions. Now that you know the basics of PowerShell, you can start performing key admin tasks on Windows Server 2019. To further learn how to employ best practices for writing PowerShell scripts and configuring Windows Server 2019 and leverage PowerShell to automate complex Windows server tasks, check out our book, Windows Server 2019 Automation with PowerShell Cookbook - Third Edition written by Thomas Lee. Weaponizing PowerShell with Metasploit and how to defend against PowerShell attacks [Tutorial] Scripting with Windows Powershell Desired State Configuration [Video] Automate tasks using Azure PowerShell and Azure CLI [Tutorial]
Read more
  • 0
  • 0
  • 47589

article-image-bayesian-network-fundamentals
Packt
10 Aug 2015
25 min read
Save for later

Bayesian Network Fundamentals

Packt
10 Aug 2015
25 min read
In this article by Ankur Ankan and Abinash Panda, the authors of Mastering Probabilistic Graphical Models Using Python, we'll cover the basics of random variables, probability theory, and graph theory. We'll also see the Bayesian models and the independencies in Bayesian models. A graphical model is essentially a way of representing joint probability distribution over a set of random variables in a compact and intuitive form. There are two main types of graphical models, namely directed and undirected. We generally use a directed model, also known as a Bayesian network, when we mostly have a causal relationship between the random variables. Graphical models also give us tools to operate on these models to find conditional and marginal probabilities of variables, while keeping the computational complexity under control. (For more resources related to this topic, see here.) Probability theory To understand the concepts of probability theory, let's start with a real-life situation. Let's assume we want to go for an outing on a weekend. There are a lot of things to consider before going: the weather conditions, the traffic, and many other factors. If the weather is windy or cloudy, then it is probably not a good idea to go out. However, even if we have information about the weather, we cannot be completely sure whether to go or not; hence we have used the words probably or maybe. Similarly, if it is windy in the morning (or at the time we took our observations), we cannot be completely certain that it will be windy throughout the day. The same holds for cloudy weather; it might turn out to be a very pleasant day. Further, we are not completely certain of our observations. There are always some limitations in our ability to observe; sometimes, these observations could even be noisy. In short, uncertainty or randomness is the innate nature of the world. The probability theory provides us the necessary tools to study this uncertainty. It helps us look into options that are unlikely yet probable. Random variable Probability deals with the study of events. From our intuition, we can say that some events are more likely than others, but to quantify the likeliness of a particular event, we require the probability theory. It helps us predict the future by assessing how likely the outcomes are. Before going deeper into the probability theory, let's first get acquainted with the basic terminologies and definitions of the probability theory. A random variable is a way of representing an attribute of the outcome. Formally, a random variable X is a function that maps a possible set of outcomes ? to some set E, which is represented as follows: X : ? ? E As an example, let us consider the outing example again. To decide whether to go or not, we may consider the skycover (to check whether it is cloudy or not). Skycover is an attribute of the day. Mathematically, the random variable skycover (X) is interpreted as a function, which maps the day (?) to its skycover values (E). So when we say the event X = 40.1, it represents the set of all the days {?} such that  , where  is the mapping function. Formally speaking, . Random variables can either be discrete or continuous. A discrete random variable can only take a finite number of values. For example, the random variable representing the outcome of a coin toss can take only two values, heads or tails; and hence, it is discrete. Whereas, a continuous random variable can take infinite number of values. For example, a variable representing the speed of a car can take any number values. For any event whose outcome is represented by some random variable (X), we can assign some value to each of the possible outcomes of X, which represents how probable it is. This is known as the probability distribution of the random variable and is denoted by P(X). For example, consider a set of restaurants. Let X be a random variable representing the quality of food in a restaurant. It can take up a set of values, such as {good, bad, average}. P(X), represents the probability distribution of X, that is, if P(X = good) = 0.3, P(X = average) = 0.5, and P(X = bad) = 0.2. This means there is 30 percent chance of a restaurant serving good food, 50 percent chance of it serving average food, and 20 percent chance of it serving bad food. Independence and conditional independence In most of the situations, we are rather more interested in looking at multiple attributes at the same time. For example, to choose a restaurant, we won't only be looking just at the quality of food; we might also want to look at other attributes, such as the cost, location, size, and so on. We can have a probability distribution over a combination of these attributes as well. This type of distribution is known as joint probability distribution. Going back to our restaurant example, let the random variable for the quality of food be represented by Q, and the cost of food be represented by C. Q can have three categorical values, namely {good, average, bad}, and C can have the values {high, low}. So, the joint distribution for P(Q, C) would have probability values for all the combinations of states of Q and C. P(Q = good, C = high) will represent the probability of a pricey restaurant with good quality food, while P(Q = bad, C = low) will represent the probability of a restaurant that is less expensive with bad quality food. Let us consider another random variable representing an attribute of a restaurant, its location L. The cost of food in a restaurant is not only affected by the quality of food but also the location (generally, a restaurant located in a very good location would be more costly as compared to a restaurant present in a not-very-good location). From our intuition, we can say that the probability of a costly restaurant located at a very good location in a city would be different (generally, more) than simply the probability of a costly restaurant, or the probability of a cheap restaurant located at a prime location of city is different (generally less) than simply probability of a cheap restaurant. Formally speaking, P(C = high | L = good) will be different from P(C = high) and P(C = low | L = good) will be different from P(C = low). This indicates that the random variables C and L are not independent of each other. These attributes or random variables need not always be dependent on each other. For example, the quality of food doesn't depend upon the location of restaurant. So, P(Q = good | L = good) or P(Q = good | L = bad)would be the same as P(Q = good), that is, our estimate of the quality of food of the restaurant will not change even if we have knowledge of its location. Hence, these random variables are independent of each other. In general, random variables  can be considered as independent of each other, if: They may also be considered independent if: We can easily derive this conclusion. We know the following from the chain rule of probability: P(X, Y) = P(X) P(Y | X) If Y is independent of X, that is, if X | Y, then P(Y | X) = P(Y). Then: P(X, Y) = P(X) P(Y) Extending this result on multiple variables, we can easily get to the conclusion that a set of random variables are independent of each other, if their joint probability distribution is equal to the product of probabilities of each individual random variable. Sometimes, the variables might not be independent of each other. To make this clearer, let's add another random variable, that is, the number of people visiting the restaurant N. Let's assume that, from our experience we know the number of people visiting only depends on the cost of food at the restaurant and its location (generally, lesser number of people visit costly restaurants). Does the quality of food Q affect the number of people visiting the restaurant? To answer this question, let's look into the random variable affecting N, cost C, and location L. As C is directly affected by Q, we can conclude that Q affects N. However, let's consider a situation when we know that the restaurant is costly, that is, C = high and let's ask the same question, "does the quality of food affect the number of people coming to the restaurant?". The answer is no. The number of people coming only depends on the price and location, so if we know that the cost is high, then we can easily conclude that fewer people will visit, irrespective of the quality of food. Hence,  . This type of independence is called conditional independence. Installing tools Let's now see some coding examples using pgmpy, to represent joint distributions and independencies. Here, we will mostly work with IPython and pgmpy (and a few other libraries) for coding examples. So, before moving ahead, let's get a basic introduction to these. IPython IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, which offers enhanced introspection, rich media, additional shell syntax, tab completion, and a rich history. IPython provides the following features: Powerful interactive shells (terminal and Qt-based) A browser-based notebook with support for code, text, mathematical expressions, inline plots, and other rich media Support for interactive data visualization and use of GUI toolkits Flexible and embeddable interpreters to load into one's own projects Easy-to-use and high performance tools for parallel computing You can install IPython using the following command: >>> pip3 install ipython To start the IPython command shell, you can simply type ipython3 in the terminal. For more installation instructions, you can visit http://ipython.org/install.html. pgmpy pgmpy is a Python library to work with Probabilistic Graphical models. As it's currently not on PyPi, we will need to build it manually. You can get the source code from the Git repository using the following command: >>> git clone https://github.com/pgmpy/pgmpy Now cd into the cloned directory switch branch for version used and build it with the following code: >>> cd pgmpy >>> git checkout book/v0.1 >>> sudo python3 setup.py install For more installation instructions, you can visit http://pgmpy.org/install.html. With both IPython and pgmpy installed, you should now be able to run the examples. Representing independencies using pgmpy To represent independencies, pgmpy has two classes, namely IndependenceAssertion and Independencies. The IndependenceAssertion class is used to represent individual assertions of the form of  or  . Let's see some code to represent assertions: # Firstly we need to import IndependenceAssertion In [1]: from pgmpy.independencies import IndependenceAssertion # Each assertion is in the form of [X, Y, Z] meaning X is # independent of Y given Z. In [2]: assertion1 = IndependenceAssertion('X', 'Y') In [3]: assertion1 Out[3]: (X _|_ Y) Here, assertion1 represents that the variable X is independent of the variable Y. To represent conditional assertions, we just need to add a third argument to IndependenceAssertion: In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z') In [5]: assertion2 Out [5]: (X _|_ Y | Z) In the preceding example, assertion2 represents . IndependenceAssertion also allows us to represent assertions in the form of  . To do this, we just need to pass a list of random variables as arguments: In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z') In [5]: assertion2 Out[5]: (X _|_ Y | Z) Moving on to the Independencies class, an Independencies object is used to represent a set of assertions. Often, in the case of Bayesian or Markov networks, we have more than one assertion corresponding to a given model, and to represent these independence assertions for the models, we generally use the Independencies object. Let's take a few examples: In [8]: from pgmpy.independencies import Independencies # There are multiple ways to create an Independencies object, we # could either initialize an empty object or initialize with some # assertions.   In [9]: independencies = Independencies() # Empty object In [10]: independencies.get_assertions() Out[10]: []   In [11]: independencies.add_assertions(assertion1, assertion2) In [12]: independencies.get_assertions() Out[12]: [(X _|_ Y), (X _|_ Y | Z)] We can also directly initialize Independencies in these two ways: In [13]: independencies = Independencies(assertion1, assertion2) In [14]: independencies = Independencies(['X', 'Y'],                                          ['A', 'B', 'C']) In [15]: independencies.get_assertions() Out[15]: [(X _|_ Y), (A _|_ B | C)] Representing joint probability distributions using pgmpy We can also represent joint probability distributions using pgmpy's JointProbabilityDistribution class. Let's say we want to represent the joint distribution over the outcomes of tossing two fair coins. So, in this case, the probability of all the possible outcomes would be 0.25, which is shown as follows: In [16]: from pgmpy.factors import JointProbabilityDistribution as         Joint In [17]: distribution = Joint(['coin1', 'coin2'],                              [2, 2],                              [0.25, 0.25, 0.25, 0.25]) Here, the first argument includes names of random variable. The second argument is a list of the number of states of each random variable. The third argument is a list of probability values, assuming that the first variable changes its states the slowest. So, the preceding distribution represents the following: In [18]: print(distribution) +--------------------------------------+ ¦ coin1   ¦ coin2   ¦   P(coin1,coin2) ¦ ¦---------+---------+------------------¦ ¦ coin1_0 ¦ coin2_0 ¦   0.2500         ¦ +---------+---------+------------------¦ ¦ coin1_0 ¦ coin2_1 ¦   0.2500         ¦ +---------+---------+------------------¦ ¦ coin1_1 ¦ coin2_0 ¦   0.2500         ¦ +---------+---------+------------------¦ ¦ coin1_1 ¦ coin2_1 ¦   0.2500         ¦ +--------------------------------------+ We can also conduct independence queries over these distributions in pgmpy: In [19]: distribution.check_independence('coin1', 'coin2') Out[20]: True Conditional probability distribution Let's take an example to understand conditional probability better. Let's say we have a bag containing three apples and five oranges, and we want to randomly take out fruits from the bag one at a time without replacing them. Also, the random variables  and  represent the outcomes in the first try and second try respectively. So, as there are three apples and five oranges in the bag initially,  and  . Now, let's say that in our first attempt we got an orange. Now, we cannot simply represent the probability of getting an apple or orange in our second attempt. The probabilities in the second attempt will depend on the outcome of our first attempt and therefore, we use conditional probability to represent such cases. Now, in the second attempt, we will have the following probabilities that depend on the outcome of our first try:  ,  ,  , and  . The Conditional Probability Distribution (CPD) of two variables  and  can be represented as  , representing the probability of  given  that is the probability of  after the event  has occurred and we know it's outcome. Similarly, we can have  representing the probability of  after having an observation for . The simplest representation of CPD is tabular CPD. In a tabular CPD, we construct a table containing all the possible combinations of different states of the random variables and the probabilities corresponding to these states. Let's consider the earlier restaurant example. Let's begin by representing the marginal distribution of the quality of food with Q. As we mentioned earlier, it can be categorized into three values {good, bad, average}. For example, P(Q) can be represented in the tabular form as follows: Quality P(Q) Good 0.3 Normal 0.5 Bad 0.2 Similarly, let's say P(L) is the probability distribution of the location of the restaurant. Its CPD can be represented as follows: Location P(L) Good 0.6 Bad 0.4 As the cost of restaurant C depends on both the quality of food Q and its location L, we will be considering P(C | Q, L), which is the conditional distribution of C, given Q and L: Location Good Bad Quality Good Normal Bad Good Normal Bad Cost             High 0.8 0.6 0.1 0.6 0.6 0.05 Low 0.2 0.4 0.9 0.4 0.4 0.95 Representing CPDs using pgmpy Let's first see how to represent the tabular CPD using pgmpy for variables that have no conditional variables: In [1]: from pgmpy.factors import TabularCPD   # For creating a TabularCPD object we need to pass three # arguments: the variable name, its cardinality that is the number # of states of the random variable and the probability value # corresponding each state. In [2]: quality = TabularCPD(variable='Quality',                              variable_card=3,                                values=[[0.3], [0.5], [0.2]]) In [3]: print(quality) +----------------------+ ¦ ['Quality', 0] ¦ 0.3 ¦ +----------------+-----¦ ¦ ['Quality', 1] ¦ 0.5 ¦ +----------------+-----¦ ¦ ['Quality', 2] ¦ 0.2 ¦ +----------------------+ In [4]: quality.variables Out[4]: OrderedDict([('Quality', [State(var='Quality', state=0),                                  State(var='Quality', state=1),                                  State(var='Quality', state=2)])])   In [5]: quality.cardinality Out[5]: array([3])   In [6]: quality.values Out[6]: array([0.3, 0.5, 0.2]) You can see here that the values of the CPD are a 1D array instead of a 2D array, which you passed as an argument. Actually, pgmpy internally stores the values of the TabularCPD as a flattened numpy array. In [7]: location = TabularCPD(variable='Location',                               variable_card=2,                              values=[[0.6], [0.4]]) In [8]: print(location) +-----------------------+ ¦ ['Location', 0] ¦ 0.6 ¦ +-----------------+-----¦ ¦ ['Location', 1] ¦ 0.4 ¦ +-----------------------+ However, when we have conditional variables, we also need to specify them and the cardinality of those variables. Let's define the TabularCPD for the cost variable: In [9]: cost = TabularCPD(                      variable='Cost',                      variable_card=2,                      values=[[0.8, 0.6, 0.1, 0.6, 0.6, 0.05],                              [0.2, 0.4, 0.9, 0.4, 0.4, 0.95]],                      evidence=['Q', 'L'],                      evidence_card=[3, 2]) Graph theory The second major framework for the study of probabilistic graphical models is graph theory. Graphs are the skeleton of PGMs, and are used to compactly encode the independence conditions of a probability distribution. Nodes and edges The foundation of graph theory was laid by Leonhard Euler when he solved the famous Seven Bridges of Konigsberg problem. The city of Konigsberg was set on both sides by the Pregel river and included two islands that were connected and maintained by seven bridges. The problem was to find a walk to exactly cross all the bridges once in a single walk. To visualize the problem, let's think of the graph in Fig 1.1: Fig 1.1: The Seven Bridges of Konigsberg graph Here, the nodes a, b, c, and d represent the land, and are known as vertices of the graph. The line segments ab, bc, cd, da, ab, and bc connecting the land parts are the bridges and are known as the edges of the graph. So, we can think of the problem of crossing all the bridges once in a single walk as tracing along all the edges of the graph without lifting our pencils. Formally, a graph G = (V, E) is an ordered pair of finite sets. The elements of the set V are known as the nodes or the vertices of the graph, and the elements of  are the edges or the arcs of the graph. The number of nodes or cardinality of G, denoted by |V|, are known as the order of the graph. Similarly, the number of edges denoted by |E| are known as the size of the graph. Here, we can see that the Konigsberg city graph shown in Fig 1.1 is of order 4 and size 7. In a graph, we say that two vertices, u, v ? V are adjacent if u, v ? E. In the City graph, all the four vertices are adjacent to each other because there is an edge for every possible combination of two vertices in the graph. Also, for a vertex v ? V, we define the neighbors set of v as  . In the City graph, we can see that b and d are neighbors of c. Similarly, a, b, and c are neighbors of d. We define an edge to be a self loop if the start vertex and the end vertex of the edge are the same. We can put it more formally as, any edge of the form (u, u), where u ? V is a self loop. Until now, we have been talking only about graphs whose edges don't have a direction associated with them, which means that the edge (u, v) is same as the edge (v, u). These types of graphs are known as undirected graphs. Similarly, we can think of a graph whose edges have a sense of direction associated with it. For these graphs, the edge set E would be a set of ordered pair of vertices. These types of graphs are known as directed graphs. In the case of a directed graph, we also define the indegree and outdegree for a vertex. For a vertex v ? V, we define its outdegree as the number of edges originating from the vertex v, that is,  . Similarly, the indegree is defined as the number of edges that end at the vertex v, that is,  . Walk, paths, and trails For a graph G = (V, E) and u,v ? V, we define a u - v walk as an alternating sequence of vertices and edges, starting with u and ending with v. In the City graph of Fig 1.1, we can have an example of a - d walk as . If there aren't multiple edges between the same vertices, then we simply represent a walk by a sequence of vertices. As in the case of the Butterfly graph shown in Fig 1.2, we can have a walk W : a, c, d, c, e: Fig 1.2: Butterfly graph—a undirected graph A walk with no repeated edges is known as a trail. For example, the walk  in the City graph is a trail. Also, a walk with no repeated vertices, except possibly the first and the last, is known as a path. For example, the walk  in the City graph is a path. Also, a graph is known as cyclic if there are one or more paths that start and end at the same node. Such paths are known as cycles. Similarly, if there are no cycles in a graph, it is known as an acyclic graph. Bayesian models In most of the real-life cases when we would be representing or modeling some event, we would be dealing with a lot of random variables. Even if we would consider all the random variables to be discrete, there would still be exponentially large number of values in the joint probability distribution. Dealing with such huge amount of data would be computationally expensive (and in some cases, even intractable), and would also require huge amount of memory to store the probability of each combination of states of these random variables. However, in most of the cases, many of these variables are marginally or conditionally independent of each other. By exploiting these independencies, we can reduce the number of values we need to store to represent the joint probability distribution. For instance, in the previous restaurant example, the joint probability distribution across the four random variables that we discussed (that is, quality of food Q, location of restaurant L, cost of food C, and the number of people visiting N) would require us to store 23 independent values. By the chain rule of probability, we know the following: P(Q, L, C, N) = P(Q) P(L|Q) P(C|L, Q) P(N|C, Q, L) Now, let us try to exploit the marginal and conditional independence between the variables, to make the representation more compact. Let's start by considering the independency between the location of the restaurant and quality of food over there. As both of these attributes are independent of each other, P(L|Q) would be the same as P(L). Therefore, we need to store only one parameter to represent it. From the conditional independence that we have seen earlier, we know that  . Thus, P(N|C, Q, L) would be the same as P(N|C, L); thus needing only four parameters. Therefore, we now need only (2 + 1 + 6 + 4 = 13) parameters to represent the whole distribution. We can conclude that exploiting independencies helps in the compact representation of joint probability distribution. This forms the basis for the Bayesian network. Representation A Bayesian network is represented by a Directed Acyclic Graph (DAG) and a set of Conditional Probability Distributions (CPD) in which: The nodes represent random variables The edges represent dependencies For each of the nodes, we have a CPD In our previous restaurant example, the nodes would be as follows: Quality of food (Q) Location (L) Cost of food (C) Number of people (N) As the cost of food was dependent on the quality of food (Q) and the location of the restaurant (L), there will be an edge each from Q ? C and L ? C. Similarly, as the number of people visiting the restaurant depends on the price of food and its location, there would be an edge each from L ? N and C ? N. The resulting structure of our Bayesian network is shown in Fig 1.3: Fig 1.3: Bayesian network for the restaurant example Factorization of a distribution over a network Each node in our Bayesian network for restaurants has a CPD associated to it. For example, the CPD for the cost of food in the restaurant is P(C|Q, L), as it only depends on the quality of food and location. For the number of people, it would be P(N|C, L) . So, we can generalize that the CPD associated with each node would be P(node|Par(node)) where Par(node) denotes the parents of the node in the graph. Assuming some probability values, we will finally get a network as shown in Fig 1.4: Fig 1.4: Bayesian network of restaurant along with CPDs Let us go back to the joint probability distribution of all these attributes of the restaurant again. Considering the independencies among variables, we concluded as follows: P(Q,C,L,N) = P(Q)P(L)P(C|Q, L)P(N|C, L) So now, looking into the Bayesian network (BN) for the restaurant, we can say that for any Bayesian network, the joint probability distribution  over all its random variables {X1,X2,...,Xn} can be represented as follows: This is known as the chain rule for Bayesian networks. Also, we say that a distribution P factorizes over a graph G, if P can be encoded as follows: Here, ParG(X) is the parent of X in the graph G. Summary In this article, we saw how we can represent a complex joint probability distribution using a directed graph and a conditional probability distribution associated with each node, which is collectively known as a Bayesian network. Resources for Article:   Further resources on this subject: Web Scraping with Python [article] Exact Inference Using Graphical Models [article] wxPython: Design Approaches and Techniques [article]
Read more
  • 0
  • 0
  • 47571
article-image-optimizing-salesforce-flows
Andy Forbes, Philip Safir, Joseph Kubon, Francisco Fálder
24 Jan 2024
13 min read
Save for later

Optimizing Salesforce Flows

Andy Forbes, Philip Safir, Joseph Kubon, Francisco Fálder
24 Jan 2024
13 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, ChatGPT for Accelerating Salesforce Development, by Andy Forbes, Philip Safir, Joseph Kubon, Francisco Fálder. Harness ChatGPT for streamlined flows, effective configuration, proficient code writing, and enhanced project activities.IntroductionEmbark on a journey through Salesforce wizardry in this article as we delve into the art of optimizing Flows for email functionality within the Snowflake data realm. Picture this: you, the savvy sales rep, armed with a wizard that guides you through the intricate process of selecting the perfect contact for your email endeavors. Join us in dissecting a case study, navigating through user stories, and tackling challenges head-on. We'll explore the nitty-gritty details, gaps, and tricks to ensure a seamless Salesforce experience. Get ready to enhance your Flow game and witness the magic unfold in the world of Salesforce efficiency!Enhancing Email Selection Wizards for Sales EfficiencyThe case study from Appendix A has several user stories that can be addressed with Flows.Let’s use the base flow request:“Please provide detailed directions on how to create a Salesforce Flow to address the following user story. Please note any gaps in the user story that leave an opportunity for misunderstandings, as well as any parts of the flow that may be tricky to create. Include error handling in the Flow, and write a description of what the Flow does for the project documentation.”We’ll also use epic 3 and user story 2 from the case study to investigate ChatGPT’s capabilities.Epic 3 – email functionality with Snowflake dataUser Story 2 – create a Flow-based wizard to select a contact:• Goal: As a sales rep, I want a wizard to guide me through the process of selecting a contact to send the email to so that I can choose the correct recipient.• Acceptance criteria:‚ Given a sales rep clicks the Email button in the Snowflake section‚ When the wizard starts‚ Then the sales rep should be prompted to select a contact associated with the accountLet’s take a look:Prompt:Please provide detailed directions on how to create a Salesforce flow to address the following user story. Please note any gaps in the user story that leave an opportunity for misunderstandings, as well as any parts of the flow that may be tricky to create. Include error handling in the flow, and write a description of what the Flow does for the project documentation.Epic 3 – email functionality with Snowflake dataUser story 2 – create a Flow-based wizard to select a contact:•    Goal: As a sales rep, I want a wizard to guide me through the process of selecting a contact to send the email to so that I can choose the correct recipient•    Acceptance criteria:‚    Given a sales rep clicks the Email button in the Snowflake section‚    When the wizard starts‚    Then the sales rep should be prompted to select a contact associated with the accountResponse: Certainly! Let’s address the user story and design a Flow-based wizard to select a contact for emailing in Salesforce. We’ll also examine any potential gaps and challenges in the process.Flow design for email functionality with Snowflake data:1. Start the Flow:Trigger: The Flow will be triggered by clicking an Email button within the Snowflake sectionAction: A wizard screen will appear, prompting the sales rep to proceed with selecting a contact associated with the account2. Screen Element for Contact Selection:Create a screen element with a dropdown that lists contacts associated with the accountThis list should be dynamically generated based on the account related to the Snowflake section3. Error handling and validation:Include logic to check if there are any contacts associated with the account If no contacts are found, an error message should be displayed to the sales rep4. Finish the Flow:Once the contact has been selected, the wizard should end, and the selection should be available for further processing, such as sending an email Gaps and challenges:Ambiguity in the Snowflake section: The story does not clearly define what the “Snowflake section” refers to. Understanding its context is essential to properly link the Email button to the flow.Selection process clarity: The story does not provide details on what happens after contact selection, which might require collaboration with other stakeholders for complete understanding.Account association: Ensuring that the listed contacts are relevant to the specific account can be tricky and needs careful handling.Project documentationFlow name: Contact selection wizard for Snowflake email functionality.Description: This Salesforce flow enables sales reps to select a contact associated with an account within the Snowflake section by guiding them through a wizard. Triggered by clicking the Email button, the Flow provides a drop-down list of relevant contacts for selection, includes error handling to deal with scenarios where no contacts are found and ends with the selected contact being available for email communication.Associated epic: Epic 3 – email functionality with Snowflake data.User story: User story 2 – create a Flow-based wizard to select a contact.Goal: To facilitate the correct selection of an email recipient by guiding the sales rep through a structured process.By carefully considering these steps, challenges, and potential gaps, this Flow aims to provide an intuitive and error-free way for sales reps to select the appropriate contact for their email communications within the Snowflake section. It underscores the importance of alignment with stakeholders to ensure the full context and intended behavior of the Flow are understood and accurately implemented.ChatGPT identified three gaps and challenges. Let’s prompt ChatGPT to refine the answer based on answers to the gaps and challenges:Prompt:The Snowflake section on the contact page layout is a tab dedicated to displaying information from Snowflake. There are additional user stories that address what happens after this user story. The contacts to be listed are the ones where the account lookup field on the contact is all the same account. With these answers, please update and rewrite the response. Anything unaffected by these answers does not need to change. Response:Certainly! With the additional context provided, let’s refine the Flow design to create a Salesforce Flow-based wizard that aligns with the user story and its requirements.Flow design for email functionality with Snowflake data:1. Start the Flow:Trigger: The flow will be triggered by clicking an Email button within the Snowflake tab on the contact page layoutAction: A wizard screen will appear, prompting the sales rep to proceed with selecting a contact associated with the account2. Screen element for contact selection:Create a screen element with a dropdown that lists contacts associated with the same account, where the account lookup field on the contact matches the specific accountThis list should be dynamically generated based on the account related to theSnowflake tab3. Error handling and validation:Include logic to check if there are any contacts associated with the specific account. If no contacts are found, an error message should be displayed to the sales rep4. Finish the Flow:Once the contact has been selected, the wizard should end, and the selection should be available for further processing, as defined in the subsequent user stories.Gaps and challenges:Account association: Ensuring that the listed contacts are relevant to the specific account and filtered by the account lookup field can be tricky and needs careful handling.Project documentationFlow name: Contact selection wizard for Snowflake email functionality.Description: This Salesforce flow enables sales reps to select a contact associated with an account within the Snowflake tab on the contact page layout. Triggered by clicking the Email button, the flow provides a drop-down list of contacts where the account lookup field matches the specific account. It includes error handling to deal with scenarios where no contacts are found and ends with the selected contact being available for further processing as detailed in subsequent user stories.Associated Epic: Epic 3 – mail functionality with Snowflake data.User Story: User story 2 – create a Flow-based wizard to select a contact.Goal: To facilitate the correct selection of an email recipient by guiding the sales rep through a structured process within the Snowflake tab.By integrating the details about the Snowflake section and the specific Account relationship, this refined flow design aims to provide a seamless experience for Sales Reps in selecting the appropriate contact for email communications. It emphasizes the precision needed in filtering contacts based on the Account lookup field, thus catering to the specific requirements of the given user story.Even though the “Account Association” concern seems to have been addressed, ChatGPT is still listing it as a concern. This could be a ChatGPT quirk or there could still be a gap. Let’s directly ask for an explanation:Prompt: Under “Gaps and Challenges” is Account Association: ensuring that the listed contacts are relevant to the specific account and filtered by the account lookup field can be tricky and needs careful handling.I thought I addressed this by telling you that “The contacts to be listed are the ones where the account lookup field on the contact are all the same account.”Why is this tricky? What is needed to resolve this “Gap and Challenge”?Response:My apologies for the confusion. Your direction does indeed clarify the requirement, but let me further explain why I considered this as a potential challenge.The “tricky” part I mentioned relates to the design and implementation phase of the flow, specifically in the context of Salesforce. Here’s why:1. Query complexity: Retrieving contacts based on the specific account lookup field requires a well-constructed SOQL query. While it’s a straightforward requirement, the developer must ensure that the query is efficient, especially if there are large numbers of contacts related to an account.2. Contextual handling: The Flow must recognize the specific account related to the Snowflake tab from which the Email button is clicked. This requires proper handling of contextual information, ensuring that the correct account ID is used in filtering the contacts.3. Dynamic filtering: If the Account field is ever updated, or if the context changes dynamically within the user interface, the flow must respond accordingly, updating the list of contacts. Ensuring that this dynamic filtering works smoothly can require careful testing and validation.4. Permissions and security: Ensuring that the user has the appropriate permissions to view and select the contacts related to the specific account must be considered.This involves compliance with the Salesforce org’s security model.Resolution:1. Construct a robust SOQL query: Build a well-designed SOQL query to fetch contacts based on the specific account lookup field.2. Contextual Handling: Utilize the context of the button click within the Snowflake tab to identify the relevant account, possibly using Apex if necessary.3. Test dynamically: Implement tests to ensure that the dynamic filtering of contacts works as expected as the context changes.4. Consider security: Align the flow with the Salesforce org’s security protocols to ensure that it adheres to the relevant permissions and profiles.In the hands of an experienced Salesforce developer like yourself, these challenges would be well within reach to overcome. I mentioned them to highlight potential areas that might require attention, and they can be seen as aspects to validate rather than insurmountable obstacles.And since the authors want to get along with our AI overlords, here’s the prompt we provided:Prompt:Thank you for your explanation of how to address this “Gap and Challenge” as well as for your help with this user story!Response:You’re welcome! It’s been a pleasure working alongside you on this user story. Your in-depth knowledge and detailed questions have enabled a clear and precise understanding of the requirements. Should you have any more questions or need further assistance, feel free to reach out. Here’s to a successful implementation of the Flow, and may it greatly enhance the functionality and user experience within your Salesforce environment. Happy developing!ConclusionAnd there you have it – a deep dive into the world of Salesforce wizardry! From refining Flows to addressing challenges and optimizing for Snowflake data, we've navigated the intricate landscape of email functionality. Picture your sales team empowered with intuitive wizards, seamlessly selecting contacts for emails. As we wrap up, remember the key: precision in design and collaboration with stakeholders. Let this be a guide for a successful implementation, enhancing your Salesforce environment. Here's to smooth workflows, error-free communication, and a Salesforce experience that truly works its magic for you! Happy developing! Author BioAndy Forbes stands at the vanguard of technological innovation. With a career spanning four decades, he is currently focusing his robust IT background on exploring the transformative potential of artificial intelligence, particularly Generative AI, on Salesforce project delivery. Andy has honed a deep expertise in CRM and project management which is backed by his repertoire of ITIL and Salesforce certifications. Andy's eight-year tenure at a Global Systems Integrator, coupled with his entrepreneurial spirit, has propelled him to the leadership of multiple Salesforce projects for Fortune 500 clients.Philip Safir is a consulting executive and business architect. He serves enterprises via delivery of technology roadmaps, process improvement, and solutions on the Salesforce platform. As the Head of Salesforce Professional Services Delivery & Talent for a Global Systems Integrator, he is responsible for a team of 250 consultants and a $100M portfolio. His career spans Fortune 500, start-up, and international clients across various industry domains like Manufacturing, Retail, Financial Services, Telecom, and so on.Joseph Kubon, an experienced Solution Architect for global enterprise deliveries, Salesforce MVP with 40+ Salesforce certifications and inventor (holding several patents), navigates the realms of manufacturing, health and media industries with a results-driven approach. Skilled in Agile methodologies, Business Process, and Architecture values, he carries a toolkit replete with Salesforce configuration and customization expertise for groundbreaking development. Guided by the wisdom that 'just because it can be built doesn't mean it should be', Joseph embraces the multiplicity of solutions to tomorrow's challenges measuring success with his “Time to Value” principles.Francisco Fálder is a seasoned Salesforce maven and master of digital transformation. His career stands as a testament to a commitment to delivering top-tier, complex projects, seamlessly merging business and tech to deliver standout customer experiences. With every project, Paco re-imagines and redefines the digital landscape, fostering an environment where innovation is not only encouraged but celebrated. Passionate about the ever-evolving tech world, he has honed a deep-rooted affinity for Artificial Intelligence, Continuous Integration/Continuous Deployment (CI/CD), and Agile methodologies.
Read more
  • 0
  • 0
  • 47512

article-image-what-are-slowly-changing-dimensions-scd-and-why-you-need-them-in-your-data-warehouse
Savia Lobo
07 Dec 2017
8 min read
Save for later

What are Slowly changing Dimensions (SCD) and why you need them in your Data Warehouse?

Savia Lobo
07 Dec 2017
8 min read
[box type="note" align="" class="" width=""]Below given post is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. The book is a quick guide to explore Informatica PowerCenter and its features such as working on sources, targets, transformations, performance optimization, and managing your data at speed. [/box] Our article explores what Slowly Changing Dimensions (SCD) are and how to implement them in Informatica PowerCenter. As the name suggests, SCD allows maintaining changes in the Dimension table in the data warehouse. These are dimensions that gradually change with time, rather than changing on a regular basis. When you implement SCDs, you actually decide how you wish to maintain historical data with the current data. Dimensions present within data warehousing and in data management include static data about certain entities such as customers, geographical locations, products, and so on. Here we talk about general SCDs: SCD1, SCD2, and SCD3. Apart from these, there are also Hybrid SCDs that you might come across. A Hybrid SCD is nothing but a combination of multiple SCDs to serve your complex business requirements. Types of SCD The various types of SCD are described as follows: Type 1 dimension mapping (SCD1): This keeps only current data and does not maintain historical data. Note : Use SCD1 mapping when you do not want history of previous data. Type 2 dimension/version number mapping (SCD2): This keeps current as well as historical data in the table. It allows you to insert new records and changed records using a new column (PM_VERSION_NUMBER) by maintaining the version number in the table to track the changes. We use a new column PM_PRIMARYKEY to maintain the history. Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using a version number. Consider there is a column LOCATION in the EMPLOYEE table and you wish to track the changes in the location on employees. Consider a record for Employee ID 1001 present in your EMPLOYEE dimension table. Steve was initially working in India and then shifted to USA. We are willing to maintain history on the LOCATION field. Type 2 dimension/flag mapping: This keeps current as well as historical data in the table. It allows you to insert new records and changed records using a new column (PM_CURRENT_FLAG) by maintaining the flag in the table to track the changes. We use a new column PRIMARY_KEY to maintain the history. Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using a flag. Let's take an example to understand different SCDs. Type 2 dimension/effective date range mapping: This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using two new columns (PM_BEGIN_DATE and PM_END_DATE) by maintaining the date range in the table to track the changes. We use a new column PRIMARY_KEY to maintain the history. Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using start date and end date. Type 3 Dimension mapping: This keeps current as well as historical data in the table. We maintain only partial history by adding a new column PM_PREV_COLUMN_NAME, that is, we do not maintain full history. Note: Use SCD3 mapping when you wish to maintain only partial history. EMPLOYEE_ID NAME LOCATION 1001 STEVE INDIA Your data warehouse table should reflect the current status of Steve. To implement this, we have different types of SCDs. SCD1 As you can see in the following table, INDIA will be replaced with USA, so we end up having only current data, and we lose historical data: PM_PRIMARY_KEY EMPLOYEE_ID NAME LOCATION 100 1001 STEVE USA Now if Steve is again shifted to JAPAN, the LOCATION data will be replaced from USA to JAPAN: PM_PRIMARY_KEY EMPLOYEE_ID NAME LOCATION 100 1001 STEVE JAPAN The advantage of SCD1 is that we do not consume a lot of space in maintaining the data. The disadvantage is that we don't have historical data. SCD2 - Version number As you can see in the following table, we are maintaining the full history by adding a new record to maintain the history of the previous records: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_VERSION_NUMBER 100 1001 STEVE INDIA 0 101 1001 STEVE USA 1 102 1001 STEVE JAPAN 2 200 1002 MIKE UK 0 We add two new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID (supposed to be the primary key) column, and PM_VERSION_NUMBER to understand current and history records. SCD2 - FLAG As you can see in the following table, we are maintaining the full history by adding new records to maintain the history of the previous records:   PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_CURRENT_FLAG 100 1001 STEVE INDIA 0 101 1001 STEVE USA 1 We add two new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_CURRENT_FLAG to understand current and history records. Again, if Steve is shifted, the data looks like this: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_CURRENT_FLAG 100 1001 STEVE INDIA 0 101 1001 STEVE USA 0 102 1001 STEVE JAPAN 1 SCD2 - Date range As you can see in the following table, we are maintaining the full history by adding new records to maintain the history of the previous records: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_BEGIN_DATE PM_END_DATE 100 1001 STEVE INDIA 01-01-14 31-05-14 101 1001 STEVE USA 01-06-14 99-99-9999 We add three new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_BEGIN_DATE and PM_END_DATE to understand the versions in the data. The advantage of SCD2 is that you have complete history of the data, which is a must for data warehouse. The disadvantage of SCD2 is that it consumes a lot of space. SCD3 As you can see in the following table, we are maintaining the history by adding new columns: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_PREV_LOCATION 100 1001 STEVE USA INDIA An optional column PM_PRIMARYKEY can be added to maintain the primary key constraints. We add a new column PM_PREV_LOCATION in the table to store the changes in the data. As you can see, we added a new column to store data as against SCD2,where we added rows to maintain history. If Steve is now shifted to JAPAN, the data changes to this: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_PREV_LOCATION 100 1001 STEVE JAPAN USA As you can notice, we lost INDIA from the data warehouse, that is why we say we are maintaining partial history. Note : To implement SCD3, decide how many versions of a particular column you wish to maintain. Based on this, the columns will be added in the table. SCD3 is best when you are not interested in maintaining the complete but only partial history. The drawback of SCD3 is that it doesn't store the full history. At this point, you should be very clear about the different types of SCDs. We need to implement these concepts practically in Informatica PowerCenter. Informatica PowerCenter provides a utility called wizard to implement SCD. Using the wizard, you can easily implement any SCD. In the next topics, you will learn how to use the wizard to implement SCD1, SCD2, and SCD3. Before you proceed to the next section, please make sure you have a proper understanding of the transformations in Informatica PowerCenter. You should be clear about the source qualifier, expression, filter, router, lookup, update strategy, and sequence generator transformations. Wizard creates a mapping using all these transformations to implement the SCD functionality. When we implement SCD, there will be some new records that need to be loaded into the target table, and there will be some existing records for which we need to maintain the history. Note : The record that comes for the first time in the table will be referred to as the NEW record, and the record for which we need to maintain history will be referred to as the CHANGED record. Based on the comparison of the source data with the target data, we will decide which one is the NEW record and which is the CHANGED record. To start with, we will use a sample file as our source and the Oracle table as the target to implement SCDs. Before we implement SCDs, let's talk about the logic that will serve our purpose, and then we will fine-tune the logic for each type of SCD. Extract all records from the source. Look up on the target table, and cache all the data. Compare the source data with the target data to flag the NEW and CHANGED records. Filter the data based on the NEW and CHANGED flags. Generate the primary key for every new row inserted into the table. Load the NEW record into the table, and update the existing record if needed. In this article we concentrated on a very important table feature called slowly changing dimensions. We also discussed different types of SCDs, i.e., SCD1, SCD2, and SCD3. If you are looking to explore more in Informatica Powercentre, go ahead and check out the book Learning Informatica Powercentre 10.x.  
Read more
  • 0
  • 1
  • 47429