Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Tech Guides

852 Articles
article-image-do-you-need-to-be-polyglot-great-programmer
Amit Kothari
19 Jan 2018
6 min read
Save for later

Do you need to be a polyglot to be a great programmer?

Amit Kothari
19 Jan 2018
6 min read
Recently, I was talking to someone who has been working as a developer for over a year. They asked me which programming languages they should learn in order to improve their employability and to grow as a developer. This made me think: Do we really need to be a polyglot to be a good programmer? A polyglot programmer is someone who can write code in multiple languages. Most of us are already using multiple programming languages. Someone working on web apps uses HTML, CSS, and JavaScript. Similarly, backend services might be written in a specific language, but the developer might still be using SQL for database queries or YAML for configuration files. As developers, we like to try and learn new programming languages and frameworks. We do this for many reasons, to solve specific problems, to find a better alternative, or simply to keep ourselves up to date with what's new and trending. The benefits of being a polyglot programmer There are obvious benefits of being a polyglot developer. It increases your employability. Being proficient in multiple languages looks very good on your resume. It shows your experience as a developer and also indicates that you are flexible, able to work with different tools in different situations. It provides you with more opportunities and greater variety. When you’re looking for a new job or maybe even in your current role, if you are able to write code in multiple languages many more opportunities open up to you. When you're a polyglot you become much more in control of your career destiny! Developer happiness. Many developers simply feel more productive when they are using a specific language. But to know what you enjoy, you need to be open minded and willing to explore lots of different languages. Polyglots get to try out different syntaxes, get to know different communities – and this exploration is surely one of the best things about being a developer. Along with all these benefits, working with different languages give us a chance to learn about different programming paradigms. We can learn different ways of solving a problem and different ways of thinking. We can then bring all this learning together to write better code. The challenges While there are many benefits of learning and knowing multiple programming languages, this constant learning comes with its own challenges. Lack of proficiency: In his book "JavaScript: The Good Parts," Douglas Crockford talks about good and bad parts of JavaScript. Similarly, other languages also have certain aspects that should be approached with caution. If you’re frequently changing programming languages without spending enough time to learn one properly, you might run into issues around things like performance and security. Maintenance becomes a nightmare. Having too many languages in a tech stack will likely become a maintenance nightmare for both the development and the operations side. This will take you somewhere that is the opposite of agile and efficient. Developer fatigue. Constantly learning and adapting to new languages and technology may result in developer fatigue. It’s a fact of tech today that developers feel stressed and under pressure – this is bound to affect not only their productivity but their health as well. From an organization's perspective, there are tradeoffs when adding a new language to their tech stack. There may be operational costs and costs to up-skill the team. On the upside, code quality and productivity may improve. Companies who avoid investing in up-skilling their teams and upgrading their tech stack may end up with systems that are difficult to maintain. Even small changes may take weeks to deliver and finding skilled developers can become challenging. On the other hand, constantly changing programming languages and technology may result in features not getting delivered for months; in some cases years. There are many cases where a project started in one programming language and after years of development, the team decided to rewrite the whole system in a newer language or framework. While architectures like microservices solve some of these problems by allowing us to write different parts of a given system in different languages without needing to rewrite the whole system, it is important to understand the cost of introducing a new language. The benefits we get out of it should always outweigh the cost. "Any fool can write code that a computer can understand. Good programmers write code that humans can understand." - Martin Fowler How to become a better developer Learning different programming languages is one way to grow as a developer but there are others things we can do to improve. Write clean code. As developers, we spend more time reading code than writing it. Writing code that is easy to read and understand is one of the key traits of a good developer. Write easy to maintain code. A good programmer puts in extra effort to make sure that the code is easy to maintain. Use design principles and test-driven development to make sure that the code can be modified with ease and confidence that it is not going to affect the existing functionality. Understand the problem. A good developer will try and understand the problem and then pick the appropriate tool to solve the problem instead of starting with a technology just because it's trending. There are lots of obvious advantages of learning multiple programming languages. Not only does it look good on a resume, it also helps you to improve as a developer. However, it is just as important to understand the business problems you’re trying to solve. Whether you’re a polyglot or not, the most important thing any developer can do is focus on the problems instead of the tool. I hope you enjoyed this post; please let us know what you think! Are you a polyglot? Do you think trying to become one is important today? Amit Kothari is a full stack software developer based in Melbourne, Australia. He has 10+ years experience in designing and implementing software mainly in Java/JEE. His recent experience is in building web applications using JavaScript frameworks like React and AngularJS and backend micro services/ REST API in Java. He is passionate about lean software development and continuous delivery.
Read more
  • 0
  • 2
  • 12699

article-image-is-novelty-ruining-web-development
Antonio Cucciniello
17 Jan 2018
5 min read
Save for later

Is novelty ruining web development?

Antonio Cucciniello
17 Jan 2018
5 min read
If you have been paying attention to the world of web development lately, it can seem a little chaotic. There are brand new frameworks and libraries that come out each and every day. These frameworks are sometimes related to previous ones that recently came out, or they are attempting to develop something entirely new. As new technologies emerge, they change things that could have been standard for a long time. With these changes happening fairly often, it can be beneficial or frustrating, depending on the situation. Let's take a look at why the creation of new technologies in web development could be a benefit for some developers, or a negative for others. Why change and novelty in web development is awesome Let's first take a look at how the rapid changes in web development can be a wonderful thing for some developers. New tools and techniques to learn With new tech constantly emerging, you will always have something new to learn as a developer. This keeps the field interesting (at least for me and other developers I know that actually like the field). It allows you to continuously add to your skillset as well. You will constantly be challenged with the newer frameworks when learning them, which will help you learn future technologies faster. Having this skill of being a constant learner is crucial in a field that is always improving. Competition When there are a high number of frameworks that do similar things, the best ones will be the ones that are used by the majority of people. For instance, there are tons of front-end frameworks like React and Angular, but React and Angular are the ones that survive simply because of their popularity and ease of use. This is similar to how capitalism works: Only the best will survive. This creates a culture of innovation in the web development community and causes even more tech to be developed, but at a higher quality. Even better products A large amount of technology being released in a short period of time allows for developers to develop creative and stunning web pages using various combinations of technologies working together. If websites are stunning and easy to use, businesses are more likely to get customers to use their products. If customers are more likely to use products, that probably means they are spending money and therefore growing the economy. Who does not love awesome products anyway? Workflows become more efficient and agile When better web development tools are created, it becomes easier for other web developers out there to create their own web apps. Usually newer technologies present a brand new way of accomplishing something that happened to previously be more difficult. With this increased ability it allows you to build on top of the shoulder of giants, allowing new developers to create something that previously was too difficult or time consuming. Why change and novelty in web development is a pain Now let's take a look at how the ever-changing state of web development can be a bad thing for web developers. New tools require more learning time With each new technology, the user must learn exactly how it works and how it can even benefit their company or project. There is some time in the beginning that must be spent on actually figuring out how to get the new technology to work. Depending on the documentation, this can sometimes be easier than others, but that extra time can definitely hurt if you are attempting to hit a hard deadline. Identifying risk v. reward can be a challenge With attempting something new, there is always a risk involved. It can turn out that this framework will take up a large portion of your time to implement, and it may only give you a minor performance increase, or minor reduction in development time. You must make this tradeoff yourself. Sometimes it is worth it, other times it definitely is not. Support lifespans are getting shorter for many tools Something that is not popular or widely used will tend to lose support. You may have been an early adopter, when you thought that this technology would be great. Just because the technology was supported today, does not mean it will be supported in the future, when you plan on using it. Support can sometimes make or break the usage of an application and it can sometimes be safer to go with a more stable framework. In my opinion, an ever-changing web development landscape is a good thing, and you just need to keep up. But I've attempted to give you both sides of the coin in order for you to make a decision on your own. Antonio Cucciniello is a Software Engineer with a background in C, C++ and JavaScript (Node.js) from New Jersey. His most recent project called Edit Docs is an Amazon Echo skill that allows users to edit Google Drive files using your voice. He loves building cool things with software, reading books on self-help and improvement, finance, and entrepreneurship. Follow him on twitter @antocucciniello, and follow him on GitHub.
Read more
  • 0
  • 0
  • 18401

article-image-ai-rescue-5-ways-machine-learning-can-assist-emergency-situations
Sugandha Lahoti
16 Jan 2018
9 min read
Save for later

AI to the rescue: 5 ways machine learning can assist during emergency situations

Sugandha Lahoti
16 Jan 2018
9 min read
At the wee hours of the night, on January 4 this year, over 9.8 million people experienced a magnitude 4.4 earthquake that rumbled across the San Francisco Bay Area. This was followed by a magnitude 7.6 earthquake in the Caribbean sea on January 9, following which a tsunami advisory was in effect for Puerto Rico and the U.S. and British Virgin Islands. In the past 6 months, the United States alone has witnessed four back-to-back storms from one brutal hurricane season, and a massive wildfire with almost 2 million acres of land ablaze. Natural Disasters across the globe are increasingly becoming more damaging and frequent. Since 1970, the number of disasters worldwide has more than quadrupled to around 400 a year. These series of natural disasters have strained the emergency services and disaster relief operations beyond capacity. We now need to look for newer ways to assist the affected people and automate the recovery process. Artificial intelligence and machine learning have advanced to the state where they are highly proficient in making predictions, and in identification and classification tasks. These use cases of AI can also be applied to prevent disasters or respond quickly, in case of an emergency.   Here are 5 ways how AI can lend a helping hand during emergency situations. 1. Machine Learning for targeted disaster relief management In case of any disaster, the first step is to formulate a critical response team to help those in distress. Before the team goes into action, it is important to analyze and assess the extent of damage and to ensure that the right aid goes first to those who need it the most. AI techniques such as image recognition and classification can be quite helpful in assessing the damage as they can analyze and observe images from the satellites. They can immediately and efficiently filter these images, which would have required months to be sorted manually. AI can identify objects and features such as damaged buildings, flooding, blocked roads from these images. They can also identify temporary settlements which may indicate that people are homeless, and so the first care could be directed towards them.   Artificial intelligence and machine learning tools can also aggregate and crunch data from multiple resources such as crowd-sourced mapping materials or Google maps. Machine learning approaches then combine all this data together, remove unreliable data, and identify informative sources to generate heat maps. These heat maps can identify areas in need of urgent assistance and direct relief efforts to those areas. Heat maps are also helpful for government and other humanitarian agencies in deciding where to conduct aerial assessments. DigitalGlobe provides space imagery and geospatial content. Their Open Data Program is a special program for disaster response. The software learns how to recognize buildings on satellite photos by learning from the crowd. DigitalGlobe releases pre- and post-event imagery for select natural disasters each year, and their crowdsourcing platform, Tomnod, will prioritize micro-tasking to accelerate damage assessments. Following the Nepal Earthquakes in 2015, Rescue Global and academicians from the Orchid Project used machine learning to carry out rescue activities. They took pre and post-disaster imagery and utilized crowd-sourced data analysis and machine learning to identify locations affected by the quakes that had not yet been assessed or received aid. This information was then shared with relief workforces to facilitate their activities. 2. Next Generation 911 911 is the first source of contact during any emergency situation. 911 dispatch centers are already overloaded with calls on a regular day. In case of a disaster or calamity, the number gets quadrupled, or even more. This calls for augmenting traditional 911 emergency centers with newer technologies for better management. Traditional 911 centers rely on voice-based calls alone. Next-gen dispatch services are upgrading their emergency dispatch technology with machine learning to receive more types of data. So now they can ingest the data from not just calls but also from text, video, audio, and pictures, to analyze them to make quick assessments. The insights gained from all this information can be passed on to the emergency response teams out in the field to efficiently carry out critical tasks. The Association of Public-Safety Communications (APCO) have employed IBM’s Watson to listen to 911 calls. This initiative is to help emergency call centers improve operations and public safety by using Watson’s speech-to-text and analytics programs. Using Watson's speech-to-text function, the context of each call is fed into the AI's analytics program allowing improvements in how call centers respond to emergencies. It also helps in reducing call times, provide accurate information, and help accelerate time-sensitive emergency services. 3. Sentiment analysis on social media data for disaster management and recovery Social media channels are a major source of news in present times. Some of the most actionable information, during a disaster, comes from social media users. Real-time images and comments from Facebook, Twitter, Instagram, and YouTube can be analyzed and validated by AI to filter real information from fake ones. These vital stats can help on-the-ground aid workers to reach the point of crisis sooner and direct their efforts to the needy. This data can also help rescue workers in reducing the time needed to find victims. In addition, AI and predictive analytics software can analyze digital content from Twitter, Facebook, and Youtube to provide early warnings, ground-level location data, and real-time report verification. In fact, AI could also be used to view the unstructured data and background of pictures and videos posted to social channels and compare them to find missing people. AI-powered chatbots can help residents affected by a calamity. The chatbot can interact with the victim, or other citizens in the vicinity via popular social media channels and ask them to upload information such as location, a photo, and some description. The AI can then validate and check this information from other sources and pass on the relevant details to the disaster relief committee. This type of information can assist them with assessing damage in real time and help prioritize response efforts. AI for Digital Response (AIDR) is a free and open platform which uses machine intelligence to automatically filter and classify social media messages related to emergencies, disasters, and humanitarian crises. For this, it uses a Collector and a Tagger. The Collector helps in collecting and filtering tweets using keywords and hashtags such as "cyclone" and "#Irma," for example. The Collector works as a word-filter. The Tagger is a topic-filter which classifies tweets by topics of interest, such as "Infrastructure Damage," and "Donations," for example. The Tagger automatically applies the classifier to incoming tweets collected in real-time using the Collector. 4. AI answers distress and help-calls Emergency relief services are flooded with distress and help calls in the event of any emergency situation. Managing such a huge amount of calls is time-consuming and expensive when done manually. The chances of a critical information being lost or unobserved is also a possibility. In such cases, AI can work as a 24/7 dispatcher. AI systems and voice assistants can analyze massive amounts of calls, determine what type of incident occurred and verify the location. They can not only interact with callers naturally and process those calls, but can also instantly transcribe and translate languages. AI systems can analyze the tone of voice for urgency, filtering redundant or less urgent calls and prioritizing them based on the emergency. Blueworx is a powerful IVR platform which uses AI to replace call center officials. Using AI technology is especially useful when unexpected events such as natural disasters drive up call volume. Their AI engine is well suited to respond to emergency calls as unlike a call center agent, it can know who a customer is even before they call. It also provides intelligent call routing, proactive outbound notifications, unified messaging, and Interactive Voice Response. 5. Predictive analytics for proactive disaster management Machine learning and other data science approaches are not limited to assisting the on-ground relief teams or assisting only after the actual emergency. Machine learning approaches such as predictive analytics can also analyze past events to identify and extract patterns and populations vulnerable to natural calamities. A large number of supervised and unsupervised learning approaches are used to identify at-risk areas and improve predictions of future events. For instance, clustering algorithms can classify disaster data on the basis of severity. They can identify and segregate climatic patterns which may cause local storms with the cloud conditions which may lead to a widespread cyclone. Predictive machine learning models can also help officials distribute supplies to where people are going, rather than where they were by analyzing real-time behavior and movement of people. In addition, predictive analytics techniques can also provide insight for understanding the economic and human impact of natural calamities. Artificial neural networks take in information such as region, country, and natural disaster type to predict the potential monetary impact of natural disasters. Recent advances in cloud technologies and numerous open source tools have enabled predictive analytics with almost no initial infrastructure investment. So agencies with limited resources can also build systems based on data science and develop more sophisticated models to analyze disasters.   Optima Predict, a suite of software by Intermedix collects and reads information about disasters such as viral outbreaks or criminal activity in real time. The software spots geographical clusters of reported incidents before humans notice the trend and then alerts key officials about it. The data can also be synced with FirstWatch, which is an online dashboard for an EMS (Emergency Medical Services) personnel. Thanks to the multiple benefits of AI, government agencies and NGOs can start utilizing machine learning to deal with disasters. As AI and allied fields like robotics further develop and expand, we may see a fleet of drone services, equipped with sophisticated machine learning. These advanced drones could expedite access to real-time information at disaster sites using video capturing capabilities and also deliver lightweight physical goods to hard to reach areas. As with every progressing technology, AI will also build on its existing capabilities. It has the potential to eliminate outages before they are detected and give disaster response leaders an informed, clearer picture of the disaster area, ultimately saving lives.
Read more
  • 0
  • 0
  • 29793

article-image-how-ai-is-changing-game-development
Raka Mahesa
16 Jan 2018
4 min read
Save for later

How AI is changing game development

Raka Mahesa
16 Jan 2018
4 min read
Artificial intelligence is transforming many areas, but one of the most interesting and underrated is gaming. It’s not that surprising; after all, an AI last year beat the world's top Go player, and it’s not a giant leap to see how artificial intelligence might be used in modern video gaming. However, things aren’t straightforward. This is because applying artificial intelligence to a video game is different from artificial intelligence applied to the development process of video game. Before we talk further about this, let's first discuss artificial intelligence by itself. What exactly is artificial intelligence? Artificial intelligence is, to put it simply, the ability of machines to learn and apply knowledge. Learning and applying knowledge is such a broad scope though. Being able to do simple math calculations is learning and applying arithmetic knowledge, but no one would call that capability artificial intelligence. Artificial intelligence is a shifting field Surprisingly, there is no definite scope for the kind of knowledge that artificial intelligence must have. It's a moving goalpost, the scope is updated whenever a new breakthrough in the field of artificial intelligence has occurred. There is a saying in the artificial intelligence community that "AI is whatever hasn't been done yet," meaning that whenever an AI problem is solved and figured out, it no longer counts as artificial intelligence and will be branched out to its own field. This has happened to searching algorithm, path finding, optical character recognition, and so on.  So, in short, we have a divide here. When experts are saying artificial intelligence, they usually mean the most cutting-edge part of artificial intelligence like neural networks or advanced voice recognition. Meanwhile, problems that are seen as solved, like path finding, are usually no longer being researched and not counted as artificial intelligence anymore. And it's also why all the advancement that has happened to artificial intelligence doesn't really impact video games. Yes, video games have many, many applications of artificial intelligence, but most of those applications are limited to problems that have already been solved like path finding and decision making. Meanwhile, the latest artificial intelligence technique like machine learning doesn't really have a place in video games. Things are a bit different though for video game development. Even though video games themselves don't use the latest and greatest artificial intelligence, their creation is still an engineering process that can be further aided with the advancement in AI. Do note that most of the artificial intelligence used for game development is still being explored and still in an experimental phase.  There has been research done recently about using an intelligent agent to learn about level layout of a game and then using that trained agent to layout another level. Having an artificial intelligence handle the level creation aspect of a game development will definitely increase the development speed because level designers now can just tweak existing layout instead of starting from scratch. Automation in game development  Another aspect of video game development that already uses a lot of automation and can be aided further by artificial intelligence is quality assurance (QA). By having an artificial intelligence that can learn how to play the game will great assist developers when they need to check for bugs and other issues. But of course, artificial intelligence can only detect stuff that can be measured like bugs and crashes, and they won't be able to see if the game is fun or not. Using AI to improve game design  Having an AI that can automatically play a video game isn't only good for testing purposes, but it could also help improve game design. There was a mobile game studio that uses artificial intelligence to mimic human behavior so they can determine the difficulty of the game. By checking how the artificial intelligence performs on a certain level, designers can project how real players would perform on that level and adjust the difficulty properly.  And that's how AI is changing game development. Despite all the advancement, there is still plenty of work to be done for artificial intelligence to truly assist game developers in their work. That said, game development is one of those jobs that don't need to worry about being replaced by robots, because more than anything, game development is a creative endeavor that only humans can do. Well, so far at least. Raka Mahesa is a game developer at Chocoarts (http://chocoarts.com/) who is interested in digital technology in general. Outside of work, he enjoys working on his own projects, with Corridoom VR being his latest relesed gme.
Read more
  • 0
  • 0
  • 24130

article-image-why-data-science-needs-great-communicators
Erik Kappelman
16 Jan 2018
4 min read
Save for later

Why data science needs great communicators

Erik Kappelman
16 Jan 2018
4 min read
One of the biggest problems facing data science (and many other technical industries) today is communication. This is true on both an individual level, but at a much bigger organizational and cultural level. On the one hand, we can all be better communicators, but at the same time organizations and businesses can do a lot more to facilitate knowledge and information sharing. At an individual level, it’s important to recognize that some people find communication very difficult. Obviously it’s a cliché that many of these people find themselves in technical industries, and while we shouldn’t get stuck on stereotypes, there is certainly an element of truth in it. The reasons why this might be the case is incredibly complex, but it may be true that part of the problem is how technology has been viewed within institutions and other organizations. This is the sort of attitude that says “those smart people just have bad social skills. We should let them do what they're good at and leave them alone.” There are lots of problems with this and it isn’t doing anyone any favors, from the people that struggle with communication to the organizations who encourage this attitude. Statistics and communicating insights Let’s take a field like statistics. There is a notion that you do not need to be good at communicating to be good at statistics; it is often viewed as a primarily numerical and technical skill. However, when you think about what statistics really is, it becomes clear that that is nonsensical. The primary purpose of the field is to tease out information and insights from noisy data and then communicate those insights. If you don’t do that you’re not doing statistics. Some forms of communication are inherent to statistical research; graphs and charts communicate the meaning of data and most statisticians or data scientists have a well worn skill of chart making. But there’s more than just charts – great visualizations, great presentations can all be the work of talented statisticians.  Of course, there are some data-related roles where communication is less important. If you’re working on data processing and storage, for example, being a great communicator may not be quite as valuable. But consider this: if you can’t properly discuss and present why you’re doing what you’re doing to the people that you work with and the people that matter in your organization you’re immediately putting up a barrier to success. The data explosion makes communication even more important There is an even bigger reason data science needs great communicators and it has nothing to do with individual success. We have entered what I like to call the Data Century. Computing power and tools using computers, like the Internet, hit a sweet spot somewhere around the new millennium and the data and analysis now available to the world is unprecedented. Who knows what kind of answers this mass of data holds? Data scientists are at the frontier of the greatest human exploration since the settling of the New World. This exploration is faced inward, as we try to understand how and why human beings do various things by examining the ever growing piles of data. If data scientists cannot relay their findings, we all miss out on this wonderful time of exploration and discovery. People need data scientists to tell them about the whole new world of data that we are just entering. It would be a real shame if the data scientists didn’t know how. Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for the Department of Transportation as a transportation demand modeler.
Read more
  • 0
  • 0
  • 14300

article-image-10-tools-that-will-improve-your-development-workflow
Owen McGab
15 Jan 2018
6 min read
Save for later

10 tools that will improve your web development workflow

Owen McGab
15 Jan 2018
6 min read
Web development is one part of a larger project. Devs work away in the background to make sure everything put together by designers, copywriters and marketers looks pretty and works well. Throughout each stage of a project, developers have to perform their own tasks efficiently while ensuring other members are kept in the loop. Therefore, any tools that can help improve your agency’s workflow management process and simplify task management are greatly valued. Every developer, as they move through the ranks from junior to senior status, relies on their stack — a collection of go-to tools to help them get stuff done. Here are 10 tools that you might want to incorporate into a stack of your own. Slack Slack is the outstanding platform for team communication. The tool is designed to simplify messaging allowing users to communicate between groups (Slack allows the creation of multiple conversation “channels” that can be used for different departments or certain aspects of a project) and upload files within conversations. Another great thing about the platform is its integration with other apps — over 600 of them, from bots to project management to developer tools. Apps mean you can perform tasks without having to leave the platform, which is a wonderful thing for productivity. Dropbox When it comes to cloud storage, Dropbox stands tall. It was the first major folder syncing platform and — despite obvious competition from the likes of Google Drive, iCloud, OneDrive and the rest — it remains the best. It's simple, secure, reliable, feature-packed, and great for collaboration. For developers that work in teams, across devices or remotely, Dropbox is essential for seamless access to files. Work can easily be shared, teams managed, and feedback provided from within the platform, improving productivity and ensuring tasks stay on track. It’s also one of the few platforms to work across iOS, Android, Windows, Linux and BlackBerry devices, so no one misses out. GitLab GitLab is one of the best Git repository managers around. The open-source platform provides a cloud-based location where development projects can be stored, tested and edited between teams. If you’re working with another dev on your code, this is the best place from which to do it. While GitHub is also popular with developers, GitLab’s additional features like snippet support (for sharing small parts of code without sharing the entire project), Work in Progress (WIP) labelling, project milestones, and authentication levels, as well as its slick user-interface, make it a superior choice. Postman for Chrome Postman is a free Chrome app designed to make testing and creating APIs easy. And it does a brilliant job at it. Hey, 3.5 million devs can’t be wrong! It’s GUI makes defining HTTP requests (and analyzing responses) more pleasant than it has any right to be, which can’t always be said of similar tools.   Use it in your own work, or when collaborating with larger teams. Just make sure you keep Postman handy whenever you’re working with web apps. Trello Trello is a project management tool that’ll help you keep everything in order, and on schedule. There’s nothing flashy about it, but that's its biggest selling point. By arranging tasks into lists and cards, it makes it easy to manage everything from solo projects to large-scale assignments involving multiple (up to 10) team members. Users can tag each other in cards, leave comments, create checklists, and set deadlines. And tasks can also be color coded for easy identification. Give it a try, you’ll be glad you did. Bitbucket The makers of Trello also created Bitbucket, a version control system that promotes collaboration and scales with your business. The best thing about Bitbucket, besides being able to work with team members (wherever in the world they happen to be), is the ability to leave comments directly in the source code — how about that for productivity? What’s more, it also integrates with Slack, so that Bitbucket notifications show up in channels. Clipy Clipy is a clipboard extension for Mac OS. Nothing special about that you might think. But it’s not until you use the tool that you appreciate its true value. Clipy lets you copy multiple plain text and image items to the clipboard, and offers a simple shortcut menu to bring up clipboard history in an instant. It’s simple, unobtrusive, and no other utility comes close. Plus, it’s free! Brackets Brackets is “a modern, open-source text editor that understands web design”. Most text editors are boring, this one isn’t. By using visual tools in the editor, Brackets makes writing and editing code an easier, more aesthetically pleasing task. It’s written in a combination of CSS, HTML and JavaScript, giving anyone looking to hack or extend it, the opportunity to work in a language they’re well versed in. If you work on front-end applications, makes this tool part of your development arsenal. Grunt Grunt is a JavaScript task runner that allows repetitive tasks to be automated. Time-consuming things like compilation, unit testing, and minification can all be automated without interfering with overall workflow, giving you more time to focus on other, more fun, aspects of a project. Grunt boasts hundreds of plugins and continues to grow by the day, giving you the chance to run every boring task you can think of without any effort. It’s a workflow essential! Yeoman Finally, there’s Yeoman, a development stack featuring all the tools you need to build and adapt web apps. Yeoman features three types of tools to make building easy, thus improving productivity. Firstly, the ‘yo’ scaffolding tool puts the, er, scaffolding in place, collating any build tasks and package manager dependencies that you need. Then, the build tool (Grunt) is used for the heavy lifting — building, previewing, and testing, while the package manager manages dependencies, remove the need for you to do that side of things manually. Of course, a web development stack can, and probably will, consist of more than just 10 tools. But by adding the tools listed here to any that you already use, you’ll notice a definite improvement in how efficiently and effectively work gets done.
Read more
  • 0
  • 0
  • 21336
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-2018-year-opportunity-tech-training
James Cronin
08 Jan 2018
3 min read
Save for later

2018 is a year of opportunity and change in tech training

James Cronin
08 Jan 2018
3 min read
Why is open source software changing the enterprise? Emerging technologies are increasingly relevant in the enterprise world. Thanks to the growth of the internet and open source technology, a market once controlled solely by the big vendors has been opened and democratised. Rather than buying and using one vendor technology stack, developers and businesses are now choosing and combining the best tools for the job. As a result, the way technology is purchased and adopted is changing, with new tools (and consequently new skills) in increasing demand. ‘Open Source is the norm’, says Sam Lambert, GitHub’s Senior Director of Infrastructure Engineering. ‘A lot of large enterprises view being open source as an essential way of propagating the use of their technologies, and they're open sourcing stuff quickly.’ With the rapid pace of change, it can be difficult to keep up with what’s happening in the software world. As open source becomes mainstream, things get even more complicated - but also much more exciting. Why open source software is always an opportunity for businesses For businesses and developers today, software engineering isn’t just about putting software to work and shipping code, it’s also about problem solving at an even more fundamental level. Understanding what tools to use for different types of problems, and what stacks to develop and deploy with, creates huge opportunities for businesses in the age of digital transformation. As businesses search for new technologies and skill sets to gain advantage, training must adapt to support them - and herein lies the opportunity. As technology stacks are increasingly driven by choice, efficiency and practical application, training must follow suit. So the opportunity lies in: Understanding the technologies themselves, and crucially why they are relevant for enterprise teams and their objectives Understanding the way developers work and learn, and providing hands on material allowing them to skill up as efficiently as possible How Packt can help businesses leverage open source software Packt was founded to help make sense of this emerging technology, and to provide practical resources to help developers put software to work in new ways. This has given us a great understanding of the technologies themselves and how they are used, as well as the way our customers learn and put their skills into practice. We are capturing and sharing this insight and content with our partners, helping developers and businesses to better understand and utilise these tools and technologies. Through our book & video content, Courseware packs and our Live & On Demand training offering, we are providing practical, expert-written material, giving you everything you need to deliver high impact technical training on the emerging technologies driving digital transformation. If you are interested to talk further about this story of industry change, and to explore these emerging technologies and the opportunity in more detail, feel free to get in touch with me: jamesc@packtpub.com.
Read more
  • 0
  • 0
  • 2942

article-image-machine-learningweb-deeplearn-js
Savia Lobo
02 Jan 2018
4 min read
Save for later

Machine Learning slings its web: Deeplearn.js is here!

Savia Lobo
02 Jan 2018
4 min read
Machine learning has been the talk of the town! With implementations in large number of organizations to carry out prediction and classification tasks. Machine learning is cutting edge in  identifying data and processing it to generate meaningful insights based on predictive analytics. But to leverage machine learning, huge computational resources are required. While many may think of it as rocket science, Google has simplified machine learning access to everyone through Deeplearn.js - an initiative that allows ML to run entirely on a web browser. Deeplearn.js is an open source WebGL- accelerated JS library. This Google PAIR’s initiative (to study and redesign human interactions with ML) aims to make ML available for everyone. This implies that it will not be restricted to specific groups of people such as developers or any businesses implementing it. Deeplearn.js + browser: A perfect match? We can say browsers such as Chrome, Internet explorer, Safari, etc are an integral part of our life as it connects us with the world. Their accessibility feature is visible in PWAs’(Progressive Web Apps) wherein applications can run on browsers without the need to download them. In a similar way, machine learning can be carried out within browsers without the fuss of downloading or installing any computational resources. Wonder how? With Deeplearn.js! Deeplearn.js specifically written in Javascript, is exclusively tailored for machine learning to function on web browsers. It offers an interactive client-side platform which helps them carry out rapid prototyping and visualizations. Machine learning involves rapid computations with huge CPU requirements and is a complete  mismatch for Javascript because of its speed limit. Deeplearn.js is a work-around that allows ML to be implemented using Javascript via the WebGL Javascript API. Additionally, you can use hardware accelerators such as GPUs via the webGL to perform faster and excellent computations with 2D and 3D graphics. Basic Mechanism - The structure of Deeplearn.js is a blend of Tensorflow and NumPy, which are Python-based packages for scientific computing. The NumPy acts as a quick execution model and the TensorFlow API provides a delayed execution model. Though TensorFlow is a fast and scalable framework widely used by researchers and developers. However, creating web applications on the browser with TensorFlow is difficult as it lacks runtime support to create web applications. Deeplearn.js allows TensorFlow model capabilities to be imported on the browser. By using the tools within Deeplearn.js, weights from the TensorFlow model can be exported. Opportunities for business - Traditional businesses shy away from using latest ML tools as computational resources are expensive and complicated. Also, due to the complexities in ML, there is a need to hire a technical expert. Through Deeplearn.js, firms can now easily access advanced ML tools and resources. It can not only help them solve data centric business problems but also additionally provide them with innovative strategies, increased competition and improved advantages to stay ahead of their competitors. Differentiating factor - Deeplearn.js is not the only inbrowser ML library. There are other competing frameworks such as ConvNetJS and Tensorfire, a much recent and almost identical framework to deeplearn.js. A unique feature that differentiates deeplearn.js is its capability to perform faster inference, along with full back propagation. Implementations with Deeplearn.js Performance RNN aids in generating music with expressive timing and dynamics. It has been successfully ported into the browser using the Deeplearn.js environment after being trained in TensorFlow. The training data used was the Yamaha e-Piano Competition dataset, which includes MIDI captures of ~1400 performances by skilled pianists. Teachable Machine is built using Deeplearn.js library. It allows users to teach a machine via a camera with live teaching and without any requirement to code.    Faster Neural Style Transfer algorithm allows in-browser image style transfer. It transfers the style of an image into the content of another image. To explore other practical projects on Deeplearn.js, you may visit the GitHub repository here. Deeplearn.js, with the fusion of Machine learning has opened new opportunities and focus areas for businesses and non-developers. SME’s (Subject Matter Expertise) within a business can now grasp deeper insights on how to achieve desired results with Machine learning. The browser is home for many developments which are yet to be revealed in the future. Deeplearn.js truly is a milestone in bringing the web and ML a step closer. However being at the early stage, it would be exciting to see how it unfolds ML for anyone on the planet.      
Read more
  • 0
  • 0
  • 9779

article-image-15-ways-to-make-blockchains-scalable-secure-and-safe
Aaron Lazar
27 Dec 2017
14 min read
Save for later

15 ways to make Blockchains scalable, secure and safe!

Aaron Lazar
27 Dec 2017
14 min read
[box type="note" align="" class="" width=""]This article is a book extract from  Mastering Blockchain, written by Imran Bashir. In the book he has discussed distributed ledgers, decentralization and smart contracts for implementation in the real world.[/box] Today we will look at various challenges that need to be addressed before blockchain becomes a mainstream technology. Even though various use cases and proof of concept systems have been developed and the technology works well for most scenarios, some fundamental limitations continue to act as barriers to mainstream adoption of blockchains in key industries. We’ll start by listing down the main challenges that we face while working on Blockchain and propose ways to over each of those challenges: Scalability This problem has been a focus of intense debate, rigorous research, and media attention for the last few years. This is the single most important problem that could mean the difference between wider adaptability of blockchains or limited private use only by consortiums. As a result of substantial research in this area, many solutions have been proposed, which are discussed here: Block size increase: This is the most debated proposal for increasing blockchain performance (transaction processing throughput). Currently, bitcoin can process only about three to seven transactions per second, which is a major inhibiting factor in adapting the bitcoin blockchain for processing micro-transactions. Block size in bitcoin is hardcoded to be 1 MB, but if block size is increased, it can hold more transactions and can result in faster confirmation time. There are several Bitcoin Improvement Proposals (BIPs) made in favor of block size increase. These include BIP 100, BIP 101, BIP 102, BIP 103, and BIP 109. In Ethereum, the block size is not limited by hardcoding; instead, it is controlled by gas limit. In theory, there is no limit on the size of a block in Ethereum because it's dependent on the amount of gas, which can increase over time. This is possible because miners are allowed to increase the gas limit for subsequent blocks if the limit has been reached in the previous block. Block interval reduction: Another proposal is to reduce the time between each block generation. The time between blocks can be decreased to achieve faster finalization of blocks but may result in less security due to the increased number of forks. Ethereum has achieved a block time of approximately 14 seconds and, at times, it can increase. This is a significant improvement from the bitcoin blockchain, which takes 10 minutes to generate a new block. In Ethereum, the issue of high orphaned blocks resulting from smaller times between blocks is mitigated by using the Greedy Heaviest Observed Subtree (GHOST) protocol whereby orphaned blocks (uncles) are also included in determining the valid chain. Once Ethereum moves to Proof of Stake, this will become irrelevant as no mining will be required and almost immediate finality of transactions can be achieved. Invertible Bloom lookup tables: This is another approach that has been proposed to reduce the amount of data required to be transferred between bitcoin nodes. Invertible Bloom lookup tables (IBLTs) were originally proposed by Gavin Andresen, and the key attraction in this approach is that it does not result in a hard fork of bitcoin if implemented. The key idea is based on the fact that there is no need to transfer all transactions between nodes; instead, only those that are not already available in the transaction pool of the synching node are transferred. This allows quicker transaction pool synchronization between nodes, thus increasing the overall scalability and speed of the bitcoin network. Sharding: Sharding is not a new technique and has been used in distributed databases for scalability such as MongoDB and MySQL. The key idea behind sharding is to split up the tasks into multiple chunks that are then processed by multiple nodes. This results in improved throughput and reduced storage requirements. In blockchains, a similar scheme is employed whereby the state of the network is partitioned into multiple shards. The state usually includes balances, code, nonce, and storage. Shards are loosely coupled partitions of a blockchain that run on the same network. There are a few challenges related to inter-shard communication and consensus on the history of each shard. This is an open area for research. State channels: This is another approach proposed for speeding up the transaction on a blockchain network. The basic idea is to use side channels for state updating and processing transactions off the main chain; once the state is finalized, it is written back to the main chain, thus offloading the time-consuming operations from the main blockchain. State channels work by performing the following three steps: First, a part of the blockchain state is locked under a smart contract, ensuring the agreement and business logic between participants. Now off-chain transaction processing and interaction is started between the participants that update the state only between themselves for now. In this step, almost any number of transactions can be performed without requiring the blockchain and this is what makes the process fast and a best candidate for solving blockchain scalability issues. However, it could be argued that this is not a real on-blockchain solution such as, for example, sharding, but the end result is a faster, lighter, and robust network which can prove very useful in micropayment networks, IoT networks, and many other applications. Once the final state is achieved, the state channel is closed and the final state is written back to the main blockchain. At this stage, the locked part of the blockchain is also unlocked. This technique has been used in the bitcoin lightning network and Ethereum's Raiden. 6. Subchains: This is a relatively new technique recently proposed by Peter R. Rizun which is based on the idea of weak blocks that are created in layers until a strong block is found. Weak blocks can be defined as those blocks that have not been able to be mined by meeting the standard network difficulty criteria but have done enough work to meet another weaker difficulty target. Miners can build sub-chains by layering weak blocks on top of each other, unless a block is found that meets the standard difficulty target. At this point, the subchain is closed and becomes the strong block. Advantages of this approach include reduced waiting time for the first verification of a transaction. This technique also results in a reduced chance of orphaning blocks and speeds up transaction processing. This is also an indirect way of addressing the scalability issue. Subchains do not require any soft fork or hard fork to implement but need acceptance by the community. 7. Tree chains: There are also other proposals to increase bitcoin scalability, such as tree chains that change the blockchain layout from a linearly sequential model to a tree. This tree is basically a binary tree which descends from the main bitcoin chain. This approach is similar to sidechain implementation, eliminating the need for major protocol change or block size increase. It allows improved transaction throughput. In this scheme, the blockchains themselves are fragmented and distributed across the network in order to achieve scalability. Moreover, mining is not required to validate the blocks on the tree chains; instead, users can independently verify the block header. However, this idea is not ready for production yet and further research is required in order to make it practical. Privacy Privacy of transactions is a much desired property of blockchains. However, due to its very nature, especially in public blockchains, everything is transparent, thus inhibiting its usage in various industries where privacy is of paramount importance, such as finance, health, and many others. There are different proposals made to address the privacy issue and some progress has already been made. Several techniques, such as indistinguishability obfuscation, usage of homomorphic encryption, zero knowledge proofs, and ring signatures. All these techniques have their merits and demerits and are discussed in the following sections. Indistinguishability obfuscation: This cryptographic technique may serve as a silver bullet to all privacy and confidentiality issues in blockchains but the technology is not yet ready for production deployments. Indistinguishability obfuscation (IO) allows for code obfuscation, which is a very ripe research topic in cryptography and, if applied to blockchains, can serve as an unbreakable obfuscation mechanism that will turn smart contracts into a black box. The key idea behind IO is what's called by researchers a multilinear jigsaw puzzle, which basically obfuscates program code by mixing it with random elements, and if the program is run as intended, it will produce expected output but any other way of executing would render the program look random and garbage. This idea was first proposed by Sahai and others in their research paper Candidate Indistinguishability Obfuscation and Functional Encryption for All Circuits. Homomorphic encryption: This type of encryption allows operations to be performed on encrypted data. Imagine a scenario where the data is sent to a cloud server for processing. The server processes it and returns the output without knowing anything about the data that it has processed. This is also an area ripe for research and fully homomorphic encryption that allows all operations on encrypted data is still not fully deployable in production; however, major progress in this field has already been made. Once implemented on blockchains, it can allow processing on cipher text which will allow privacy and confidentiality of transactions inherently. For example, the data stored on the blockchain can be encrypted using homomorphic encryption and computations can be performed on that data without the need for decryption, thus providing privacy service on blockchains. This concept has also been implemented in a project named Enigma by MIT's Media Lab. Enigma is a peer-to-peer network which allows multiple parties to perform computations on encrypted data without revealing anything about the data. Zero knowledge proofs: Zero knowledge proofs have recently been implemented in Zcash successfully, as seen in previous chapters. More specifically, SNARKs have been implemented in order to ensure privacy on the blockchain. The same idea can be implemented in Ethereum and other blockchains also. Integrating Zcash on Ethereum is already a very active research project being run by the Ethereum R&D team and the Zcash Company. State channels: Privacy using state channels is also possible, simply due to the fact that all transactions are run off-chain and the main blockchain does not see the transaction at all except the final state output, thus ensuring privacy and confidentiality. Secure multiparty computation: The concept of secure multiparty computation is not new and is based on the notion that data is split into multiple partitions between participating parties under a secret sharing mechanism which then does the actual processing on the data without the need of the reconstructing data on single machine. The output produced after processing is also shared between the parties. MimbleWimble: The MimbleWimble scheme was proposed somewhat mysteriously on the bitcoin IRC channel and since then has gained a lot of popularity. MimbleWimble extends the idea of confidential transactions and Coinjoin, which allows aggregation of transactions without requiring any interactivity. However, it does not support the use of bitcoin scripting language along with various other features of standard Bitcoin protocol. This makes it incompatible with existing Bitcoin protocol. Therefore, it can either be implemented as a sidechain to bitcoin or on its own as an alternative cryptocurrency. This scheme can address privacy and scalability issues both at once. The blocks created using the MimbleWimble technique do not contain transactions as in traditional bitcoin blockchains; instead, these blocks are composed of three lists: an input list, output list, and something called excesses which are lists of signatures and differences between outputs and inputs. The input list is basically references to the old outputs, and the output list contains confidential transactions outputs. These blocks are verifiable by nodes by using signatures, inputs, and outputs to ensure the legitimacy of the block. In contrast to bitcoin, MimbleWimble transaction outputs only contain pubkeys, and the difference between old and new outputs is signed by all participants involved in the transactions. Coinjoin: Coinjoin is a technique which is used to anonymize the bitcoin transactions by mixing them interactively. The idea is based on forming a single transaction from multiple entities without causing any change in inputs and outputs. It removes the direct link between senders and receivers, which means that a single address can no longer be associated with transactions, which could lead to identification of the users. Coinjoin needs cooperation between multiple parties that are willing to create a single transaction by mixing payments. Therefore, it should be noted that, if any single participant in the Coinjoin scheme does not keep up with the commitment made to cooperate for creating a single transaction by not signing the transactions as required, then it can result in a denial of service attack. In this protocol, there is no need for a single trusted third party. This concept is different from mixing a service which acts as a trusted third party or intermediary between the bitcoin users and allows shuffling of transactions. This shuffling of transactions results in the prevention of tracing and the linking of payments to a particular user. Security Even though blockchains are generally secure and make use of asymmetric and symmetric cryptography as required throughout the blockchain network, there still are few caveats that can result in compromising the security of the blockchain. There are two types of Contract Verification that can solve the issue of Security. Why3 formal verification: Formal verification of solidity code is now available as a feature in the solidity browser. First the code is converted into Why3 language that the verifier can understand. In the example below, a simple solidity code that defines the variable z as maximum limit of uint is shown. When this code runs, it will result in returning 0, because uint z will overrun and start again from 0. This can also be verified using Why3, which is shown below: Once the solidity is compiled and available in the formal verification tab, it can be copied into the Why3 online IDE available at h t t p ://w h y 3. l r i . f r /t r y /. The example below shows that it successfully checks and reports integer overflow errors. This tool is under heavy development but is still quite useful. Also, this tool or any other similar tool is not a silver bullet. Even formal verification generally should not be considered a panacea because specifications in the first place should be defined appropriately: Oyente tool: Currently, Oyente is available as a Docker image for easy testing and installation. It is available at https://github.com/ethereum/oyente, and can be quickly downloaded and tested. In the example below, a simple contract taken from solidity documentation that contains a re-entrancy bug has been tested and it is shown that Oyente successfully analyzes the code and finds the bug: This sample code contains a re-entrancy bug which basically means that if a contract is interacting with another contract or transferring ether, it is effectively handing over the control to that other contract. This allows the called contract to call back into the function of the contract from which it has been called without waiting for completion. For example, this bug can allow calling back into the withdraw function shown in the preceding example again and again, resulting in getting Ethers multiple times. This is possible because the share value is not set to 0 until the end of the function, which means that any later invocations will be successful, resulting in withdrawing again and again. An example is shown of Oyente running to analyze the contract shown below and as can be seen in the following output, the analysis has successfully found the re-entrancy bug. The bug is proposed to be handled by a combination of the Checks-Effects-Interactions pattern described in the solidity documentation: We discussed the scalability, security, confidentiality, and privacy aspects of blockchain technology in this article. If you found it useful, go ahead and buy the book, Mastering Blockchain by Imran Bashir to become a professional Blockchain developer.      
Read more
  • 0
  • 0
  • 14528

article-image-two-popular-data-analytics-methodologies-every-data-professional-should-know-tdsp-crisp-dm
Amarabha Banerjee
21 Dec 2017
7 min read
Save for later

Two popular Data Analytics methodologies every data professional should know: TDSP & CRISP-DM

Amarabha Banerjee
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This is a book excerpt taken from Advanced Analytics with R and Tableau authored by Jen Stirrup & Ruben Oliva Ramos. This book will help you make quick, cogent, and data driven decisions for your business using advanced analytical techniques on Tableau and R.[/box] Today we explore popular data analytics methods such as Microsoft TDSP Process and the CRISP- DM methodology. Introduction There is an increasing amount of data in the world, and in our databases. The data deluge is not going to go away anytime soon! Businesses risk wasting the useful business value of information contained in databases, unless they are able to excise useful knowledge from the data. It can be hard to know how to get started. Fortunately, there are a number of frameworks in data science that help us to work our way through an analytics project. Processes such as Microsoft Team Data Science Process (TDSP) and CRISP-DM position analytics as a repeatable process that is part of a bigger vision. Why are they important? The Microsoft TDSP Process and the CRISP-DM frameworks are frameworks for analytics projects that lead to standardized delivery for organizations, both large and small. In this chapter, we will look at these frameworks in more detail, and see how they can inform our own analytics projects and drive collaboration between teams. How can we have the analysis shaped so that it follows a pattern so that data cleansing is included? Industry standard methodologies for analytics There are a few main methodologies: the Microsoft TDSP Process and the CRISP-DM Methodology. Ultimately, they are all setting out to achieve the same objectives as an analytics framework. There are differences, of course, and these are highlighted here. CRISP-DM and TDSP focus on the business value and the results derived from analytics projects. Both of these methodologies are described in the following sections. CRISP-DM One common methodology is the CRISP-DM methodology (the modeling agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process model that provides a fluid framework for devising, creating, building, testing, and deploying machine learning solutions. The process is loosely divided into six main phases. The phases can be seen in the following diagram: Initially, the process starts with a business idea and a general consideration of the data. Each stage is briefly discussed in the following sections. Business understanding/data understanding The first phase looks at the machine learning solution from a business standpoint, rather than a technical standpoint. The business idea is defined, and a draft project plan is generated. Once the business idea is defined, the data understanding phase focuses on data collection and familiarity. At this point, missing data may be identified, or initial insights may be revealed. This process feeds back to the business understanding phase. CRISP-DM model — data preparation In this stage, data will be cleansed and transformed, and it will be shaped ready for the modeling phase. CRISP-DM — modeling phase In the modeling phase, various techniques are applied to the data. The models are further tweaked and refined, and this may involve going back to the data preparation phase in order to correct any unexpected issues. CRISP-DM — evaluation The models need to be tested and verified to ensure that they meet the business objectives that were defined initially in the business understanding phase. Otherwise, we may have built models that do not answer the business question. CRISP-DM — deployment The models are published so that the customer can make use of them. This is not the end of the story, however. CRISP-DM — process restarted We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated. CRISP-DM summary CRISP-DM is the most commonly used framework for implementing machine learning projects specifically, and it applies to analytics projects as well. It has a good focus on the business understanding piece. However, one major drawback is that the model no longer seems to be actively maintained. The official site, CRISP-DM.org, is no longer being maintained. Furthermore, the framework itself has not been updated on issues on working with new technologies, such as big data. Big data technologies means that there can be additional effort spend in the data understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of big data sources. Team Data Science Process The TDSP process model provides a dynamic framework to machine learning solutions that have been through a robust process of planning, producing, constructing, testing, and deploying models. Here is an example of the TDSP process: The process is loosely divided into four main phases: Business Understanding Data Acquisition and Understanding Modeling Deployment The phases are described in the following paragraphs. Business understanding The Business understanding process starts with a business idea, which is solved with a machine learning solution. The business idea is defined from the business perspective, and possible scenarios are identified and evaluated. Ultimately, a project plan is generated for delivering the solution. Data acquisition and understanding Following on from the business understanding phase is the data acquisition and understanding phase, which concentrates on familiarity and fact-finding about the data. The process itself is not completely linear; the output of the data acquisition and understanding phase can feed back to the business understanding phase, for example. At this point, some of the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources. From the user's perspective, there may be actions arising from this effort. For example, it may be noted that there is missing data from the dataset, which requires further investigation before the project proceeds further. Modeling In the modeling phase of the TDSP process, the R model is created, built, and verified against the original business question. In light of the business question, the model needs to make sense. It should also add business value, for example, by performing better than the existing solution that was in place prior to the new R model. This stage also involves examining key metrics in evaluating our R models, which need to be tested to ensure that the models meet the original business objectives set out in the initial business understanding phase. Deployment R models are published to production, once they are proven to be a fit solution to the original business question. This phase involves the creation of a strategy for ongoing review of the R model's performance as well as a monitoring and maintenance plan. It is recommended to carry out a recurrent evaluation of the deployed models. The models will live in a fluid, dynamic world of data and, over time, this environment will impact their efficacy. The TDSP process is a cycle rather than a linear process, and it does not finish, even if the model is deployed. It is comprised of a clear structure for you to follow throughout the Data Science process, and it facilitates teamwork and collaboration along the way. TDSP Summary The data science unicorn does not exist; that is, the person who is equally skilled in all areas of data science, right across the board. In order to ensure successful projects where each team player contributes according to their skill set, the Team Data Science Summary is a team-oriented solution that emphasizes teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups, which can include open source technology. If you liked our post, please be sure to check out Advanced Analytics with R and Tableau that shows how to apply various data analytics techniques in R and Tableau across the different stages of a data science project highlighted in this article.  
Read more
  • 0
  • 0
  • 32371
article-image-6-key-skills-data-scientist-role
Amey Varangaonkar
21 Dec 2017
6 min read
Save for later

6 Key Areas to focus on while transitioning to a Data Scientist role

Amey Varangaonkar
21 Dec 2017
6 min read
[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Statistics for Data Science, authored by James D. Miller. The book dives into the different statistical approaches to discover hidden insights and patterns from different kinds of data.[/box] Being a data scientist is undoubtedly a very lucrative career prospect. In fact, it is one of the highest paying jobs in the world right now. That said, transitioning from a data developer role to a data scientist role needs careful planning and a clear understanding of what the role is all about. In this interesting article, the author highlights six key skills must give special attention to, during this transition. Let's start by taking a moment to state what I consider to be a few generally accepted facts about transitioning to a data scientist. We'll reaffirm these beliefs as we continue through this book: Academia: Data scientists are not all from one academic background. They are not all computer science or statistics/mathematics majors. They do not all possess an advanced degree (in fact, you can use statistics and data science with a bachelor's degree or even less). It's not magic-based: Data scientists can use machine learning and other accepted statistical methods to identify insights from data, not magic. They are not all tech or computer geeks: You don't need years of programming experience or expensive statistical software to be effective. You don't need to be experienced to get started. You can start today, right now. (Well, you already did when you bought this book!) Okay, having made the previous declarations, let's also be realistic. As always, there is an entry-point for everything in life, and, to give credit where it is due, the more credentials you can acquire to begin out with, the better off you will most likely be. Nonetheless, (as we'll see later in this chapter), there is absolutely no valid reason why you cannot begin understanding, using, and being productive with data science and statistics immediately. [box type="info" align="" class="" width=""]As with any profession, certifications and degrees carry the weight that may open the doors, while experience, as always, might be considered the best teacher. There are, however, no fake data scientists but only those with currently more desire than practical experience.[/box] If you are seriously interested in not only understanding statistics and data science but eventually working as a full-time data scientist, you should consider the following common themes (you're likely to find in job postings for data scientists) as areas to focus on: Education Common fields of study here are Mathematics and Statistics, followed by Computer Science and Engineering (also Economics and Operations research). Once more, there is no strict requirement to have an advanced or even related degree. In addition, typically, the idea of a degree or an equivalent experience will also apply here. Technology You will hear SAS and R (actually, you will hear quite a lot about R) as well as Python, Hadoop, and SQL mentioned as key or preferable for a data scientist to be comfortable with, but tools and technologies change all the time so, as mentioned several times throughout this chapter, data developers can begin to be productive as soon as they understand the objectives of data science and various statistical mythologies without having to learn a new tool or language. [box type="info" align="" class="" width=""]Basic business skills such as Omniture, Google Analytics, SPSS, Excel, or any other Microsoft Office tool are assumed pretty much everywhere and don't really count as an advantage, but experience with programming languages (such as Java, PERL, or C++) or databases (such as MySQL, NoSQL, Oracle, and so on.) does help![/box] Data The ability to understand data and deal with the challenges specific to the various types of data, such as unstructured, machine-generated, and big data (including organizing and structuring large datasets). [box type="info" align="" class="" width=""]Unstructured data is a key area of interest in statistics and for a data scientist. It is usually described as data having no redefined model defined for it or is not organized in a predefined manner. Unstructured information is characteristically text-heavy but may also contain dates, numbers, and various other facts as well.[/box] Intellectual Curiosity I love this. This is perhaps well defined as a character trait that comes in handy (if not required) if you want to be a data scientist. This means that you have a continuing need to know more than the basics or want to go beyond the common knowledge about a topic (you don't need a degree on the wall for this!) Business Acumen To be a data developer or a data scientist you need a deep understanding of the industry you're working in, and you also need to know what business problems your organization needs to unravel. In terms of data science, being able to discern which problems are the most important to solve is critical in addition to identifying new ways the business should be leveraging its data. Communication Skills All companies look for individuals who can clearly and fluently translate their findings to a non-technical team, such as the marketing or sales departments. As a data scientist, one must be able to enable the business to make decisions by arming them with quantified insights in addition to understanding the needs of their non-technical colleagues to add value and be successful. This article paints a much clearer picture on why soft skills play an important role in becoming a better data scientist. So why should you, a data developer, endeavor to think like (or more like) a data scientist? Specifically, what might be the advantages of thinking like a data scientist? The following are just a few notions supporting the effort: Developing a better approach to understanding data Using statistical thinking during the process of program or database designing Adding to your personal toolbox Increased marketability If you found this article to be useful, make sure you check out the book Statistics for Data Science, which includes a comprehensive list of tips and tricks to becoming a successful data scientist by mastering the basic and not-so-basic concepts of statistics.  
Read more
  • 0
  • 0
  • 23218

article-image-how-do-data-structures-and-data-models-differ
Amey Varangaonkar
21 Dec 2017
7 min read
Save for later

How do Data Structures and Data Models differ?

Amey Varangaonkar
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Statistics for Data Science, authored by James D. Miller. The book presents interesting techniques through which you can leverage the power of statistics for data manipulation and analysis.[/box] In this article, we will be zooming the spotlight on data structures and data models, and also understanding the difference between both. Data structures Data developers will agree that whenever one is working with large amounts of data, the organization of that data is imperative. If that data is not organized effectively, it will be very difficult to perform any task on that data, or at least be able to perform the task in an efficient manner. If the data is organized effectively, then practically any operation can be performed easily on that data. A data or database developer will then organize the data into what is known as data structures. Following image is a simple binary tree, where the data is organized efficiently by structuring it: A data structure can be defined as a method of organizing large amounts of data more efficiently so that any operation on that data becomes easy. Data structures are created in such a way as to implement one or more particular abstract data type (ADT), which in turn will stipulate what operations can be performed on the data structure, as well as the computational complexity of those operations. [box type="info" align="" class="" width=""]In the field of statistics, an ADT is a model for data types where a data type is defined by its behavior from the point of view (POV) of users of that data, explicitly showing the possible values, the possible operations on data of this type, and the behavior of all of these operations.[/box] Database design is then the process of using the defined data structures to produce a detailed data model, which will become the database. This data model must contain all of the required logical and physical design selections, as well as the physical storage parameters needed to produce a design in a Data Definition Language (DDL), which can then be used to create an actual database. [box type="info" align="" class="" width=""]There are varying degrees of the data model, for example, a fully attributed data model would also contain detailed attributes for each entity in the model.[/box] So, is a data structure a data model? No, a data structure is used to create a data model. Is this data model the same as data models used in statistics? Let's see in the next section. Data models You will find that statistical data models are at the heart of statistical analytics. In the simplest terms, a statistical data model is defined as the following: A representation of a state, process, or system that we want to understand and reason about In the scope of the previous definition, the data or database developer might agree that in theory or in concept, one could use the same terms to define a financial reporting database, as it is designed to contain business transactions and is arranged in data structures that allow business analysts to efficiently review the data, so that they can understand or reason about particular interests they may have concerning the business. Data scientists develop statistical data models so that they can draw inferences from them and, more importantly, make predictions about a topic of concern. Data developers develop databases so that they can similarly draw inferences from them and, more importantly, make predictions about a topic of concern (although perhaps in some organizations, databases are more focused on past and current events (transactions) than on forward-thinking ones (predictions)). Statistical data models come in a multitude of different formats and flavours (as do databases). These models can be equations linking quantities that we can observe or measure; they can also be simply, sets of rules. Databases can be designed or formatted to simplify the entering of online transactions—say, in an order entry system—or for financial reporting when the accounting department must generate a balance sheet, income statement, or profit and loss statement for shareholders. [box type="info" align="" class="" width=""]I found this example of a simple statistical data model: Newton's Second Law of Motion, which states that the net sum of force acting on an object causes the object to accelerate in the direction of the force applied, and at a rate proportional to the resulting magnitude of the force and inversely proportional to the object's mass.[/box] What's the difference? Where or how does the reader find the difference between a data structure or database and a statistical model? At a high level, as we speculated in previous sections, one can conclude that a data structure/database is practically the same thing as a statistical data model, as shown in the following image: At a high level, as we speculated in previous sections, one can conclude that a data structure/database is practically the same thing as a statistical data model. When we take the time to drill deeper into the topic, you should consider the following key points: Although both the data structure/database and the statistical model could be said to represent a set of assumptions, the statistical model typically will be found to be much more keenly focused on a particular set of assumptions concerning the generation of some sample data, and similar data from a larger population, while the data structure/database more often than not will be more broadly based A statistical model is often in a rather idealized form, while the data structure/database may be less perfect in the pursuit of a specific assumption Both a data structure/database and a statistical model are built around relationships between variables The data structure/database relationship may focus on answering certain questions, such as: What are the total orders for specific customers? What are the total orders for a specific customer who has purchased from a certain salesperson? Which customer has placed the most orders? Statistical model relationships are usually very simple, and focused on proving certain questions: Females are shorter than males by a fixed amount Body mass is proportional to height The probability that any given person will partake in a certain sport is a function of age, sex, and socioeconomic status Data structures/databases are all about the act of summarizing data based on relationships between variables Relationships The relationships between variables in a statistical model may be found to be much more complicated than simply straightforward to recognize and understand. An illustration of this is awareness of effect statistics. An effect statistic is one that shows or displays a difference in value to one that is associated with a difference related to one or more other variables. Can you image the SQL query statements you'd use to establish a relationship between two database variables based upon one or more effect statistic? On this point, you may find that a data structure/database usually aims to characterize relationships between variables, while with statistical models, the data scientist looks to fit the model to prove a point or make a statement about the population in the model. That is, a data scientist endeavors to make a statement about the accuracy of an estimate of the effect statistic(s) describing the model! One more note of interest is that both a data structure/database and a statistical model can be seen as tools or vehicles that aim to generalize a population; a database uses SQL to aggregate or summarize data, and a statistical model summarizes its data using effect statistics. The above argument presented the notion that data structures/databases and statistical data models are, in many ways, very similar. If you found this excerpt to be useful, check out the book Statistics for Data Science, which demonstrates different statistical techniques for implementing various data science tasks such as pre-processing, mining, and analysis.  
Read more
  • 0
  • 0
  • 21672

article-image-when-why-and-how-to-use-graph-analytics-for-your-big-data
Sunith Shetty
20 Dec 2017
10 min read
Save for later

When, why and how to use Graph analytics for your big data

Sunith Shetty
20 Dec 2017
10 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. In this book, you will learn how to perform real-time streaming analytics on big data using machine learning algorithms and power of Java. [/box] From the article given below, you will learn why graph analytics is a favourable choice in order to analyze complex datasets. Graph analytics Vs Relational Databases The biggest advantage to using graphs is you can analyze these graphs and use them for analyzing complex datasets. You might ask what is so special about graph analytics that we can’t do by relational databases. Let’s try to understand this using an example, suppose we want to analyze your friends network on Facebook and pull information about your friends such as their name, their birth date, their recent likes, and so on. If Facebook had a relational database, then this would mean firing a query on some table using the foreign key of the user requesting this info. From the perspective of relational database, this first level query is easy. But what if we now ask you to go to the friends at level four in your network and fetch their data (as shown in the following diagram). The query to get this becomes more and more complicated from a relational database perspective but this is a trivial task on a graph or graphical database (such as Neo4j). Graphs are extremely good on operations where you want to pull information from one end of the node to another, where the other node lies after a lot of joins and hops. As such, graph analytics is good for certain use cases (but not for all use cases, relational database are still good on many other use cases): As you can see, the preceding diagram depicts a huge social network (though the preceding diagram might just be depicting a network of a few friends only). The dots represent actual people in a social network. So if somebody asks to pick one user on the left-most side of the diagram and see and follow host connections to the right-most side and pull the friends at the say 10th level or more, this is something very difficult to do in a normal relational database and doing it and maintaining it could easily go out of hand. There are four particular use cases where graph analytics is extremely useful and used frequently (though there are plenty more use cases too): Path analytics: As the name suggests, this analytics approach is used to figure out the paths as you traverse along the nodes of a graph. There are many fields where this can be used—simplest being road networks and figuring out details such as shortest path between cities, or in flight analytics to figure out the shortest time taking flight or direct flights. Connectivity analytics: As the name suggests, this approach outlines how the nodes within a graph are connected to each other. So using this you can figure out how many edges are flowing into a node and how many are flowing out of the node. This kind of information is very useful in analysis. For example, in a social network if there is a person who receives just one message but gives out say ten messages within his network then this person can be used to market his favorite products as he is very good in responding to messages. Community Analytics: Some graphs on big data are huge. But within these huge graphs there might be nodes that are very close to each other and are almost stacked in a cluster of their own. This is useful information as based on this you can extract out communities from your data. For example, in a social network if there are people who are part of some community, say marathon runners, then they can be clubbed into a single community and further tracked. Centrality Analytics: This kind of analytical approach is useful in finding central nodes in a network or graph. This is useful in figuring out sources that are single handedly connected to many other sources. It is helpful in figuring out influential people in a social network, or a central computer in a computer network. From the perspective of this article, we will be covering some of these use cases in our sample case studies and for this we will be using a library on Apache Spark called GraphFrames. GraphFrames GraphX library is advanced and performs well on massive graphs, but, unfortunately, it’s currently only implemented in Scala and does not have any direct Java API. GraphFrames is a relatively new library that is built on top of Apache Spark and provides support for dataframe (now dataset) based graphs. It contains a lot of methods that are direct wrappers over the underlying sparkx methods. As such it provides similar functionality as GraphX except that GraphX acts on the Spark SRDD and GraphFrame works on the dataframe so GraphFrame is more user friendly (as dataframes are simpler to use). All the advantages of firing Spark SQL queries, joining datasets, filtering queries are all supported on this. To understand GraphFrames and representing massive big data graphs, we will take small baby steps first by building some simple programs using GraphFrames before building full-fledged case studies. First, let’s see how to build a graph using Spark and GraphFrames on some sample dataset. Building a graph using GraphFrames Consider that you have as simple graph as shown next. This graph depicts four people Kai, John, Tina, and Alex and the relation they share whether they follow each other or are friends. We will now try to represent this basic graph using the GraphFrame library on top of Apache Spark and in the meantime, we will also start learning the GraphFrame API. Since GraphFrame is a module on top of Spark, let’s first build the Spark configuration and spark sql context for brevity: SparkConfconf= ... JavaSparkContextsc= ... SQLContextsqlContext= ... We will now build the JavaRDD object that will contain the data for our vertices or the people Kai, John, Alex, and Tina in this small network. We will create some sample data using the RowFactory class of Spark API and provide the attributes (ID of the person, and their name and age) that we need per row of the data: JavaRDD<Row>verRow = sc.parallelize(Arrays.asList(RowFactory.create(101L,”Kai”,27),        RowFactory.create(201L,”John”,45),   RowFactory.create(301L,”Alex”,32),        RowFactory.create(401L,”Tina”,23))); Next we will define the structure or schema of the attributes used to build the data. The ID of the person is of type long and the name of the person is a string, and the age of the person is an integer as shown next in the code: List<StructField>verFields = newArrayList<StructField>();  verFields.add(DataTypes.createStructField(“id”,DataTypes.LongType, true));  verFields.add(DataTypes.createStructField(“name”,DataTypes.StringType, true));      verFields.add(DataTypes.createStructField(“age”,DataTypes.IntegerType, true)); Now, let’s build the sample data for the relations between these people and this can basically be represented as the edges of the graph later. This data item of relationship will have the IDs of the persons that are connected together and the type of relationship they share (that is friends or followers). Again we will use the Spark provided RowFactory and build some sample data per row and create the JavaRDD with this data: JavaRDD<Row>edgRow = sc.parallelize(Arrays.asList(        RowFactory.create(101L,301L,”Friends”),        RowFactory.create(101L,401L,”Friends”),        RowFactory.create(401L,201L,”Follow”),        RowFactory.create(301L,201L,”Follow”),        RowFactory.create(201L,101L,”Follow”))); Again, define the schema of the attributes added as part of the edges earlier. This schema is later used in building the dataset for the edges. The attributes passed are the source ID of the node, destination ID of the other node, as well as the relationType, which is a string: List<StructField>EdgFields = newArrayList<StructField>();    EdgFields.add(DataTypes.createStructField(“src”,DataTypes.LongType,true));  EdgFields.add(DataTypes.createStructField(“dst”,DataTypes.LongType,true));  EdgFields.add(DataTypes.createStructField(“relationType”,DataTypes.StringType,true)); Using the schemas that we have defined for the vertices and edges, let’s now build the actual dataset for the vertices and the edges. For this, first create the StructType  object that holds the schema details for the vertices and the edges data and using this structure and the actual data we will next build the dataset of the verticles (verDF) and the dataset for the edges (edgDF): StructTypeverSchema = DataTypes.createStructType(verFields); StructTypeedgSchema = DataTypes.createStructType(EdgFields);    Dataset<Row>verDF = sqlContext.createDataFrame(verRow, verSchema); Dataset<Row>edgDF = sqlContext.createDataFrame(edgRow, edgSchema); Finally, we will now use the vertices and the edges dataset and pass it as a parameter to the GraphFrame constructor and build the GraphFrame instance: GraphFrameg = newGraphFrame(verDF,edgDF); Time has now come to see some mild analytics on the graph we just created. Let’s first visualize our data for the graphs; let’s see the data on the vertices. For this, we will invoke the vertices method on the GraphFrame instance and invoke the standard show method on the generated vertices dataset (GraphFrame would generate a new dataset when the vertices method is invoked). g.vertices().show(); This would print the output as follows: Let’s also see the data on the edges: g.edges().show(); This would print as the output as follows: Let’s also see the number of edges and the number of vertices: System.out.println(“Number of Vertices : “ + g.vertices().count()); System.out.println(“Number of Edges : “ + g.edges().count()); This would print the result as follows: Number of Vertices : 4 Number of Edges : 5 GraphFrame has a handy method to find all the indegrees (out degree or degree) g.inDegrees().show(); This would print the in degrees of all the vertices as shown next: Finally, let’s see one more small thing on this simple graph. As GraphFrames work on the datasets, all the dataset handy methods such as filtering, map, and so on can be applied on them. We will use the filter method and run it on the vertices dataset to figure out the people in the graph with age greater than thirty: g.vertices().filter(“age > 30”).show(); This would print the result as follows: From this post, we learned about graph analytics. We saw how graphs can be built from massive big datasets in order to derive quick insights. You will understand when to implement graph analytics or relational database based on the growing challenges in your organization. To know more about preparing and refining big data and to perform smart data analytics using machine learning algorithms you can refer to the book Big Data Analytics with Java.
Read more
  • 0
  • 0
  • 29440
article-image-healthcare-analytics-logistic-regression-to-reduce-patient-readmissions
Guest Contributor
20 Dec 2017
8 min read
Save for later

Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions

Guest Contributor
20 Dec 2017
8 min read
[box type="info" align="" class="" width=""]We bring to you another guest post by Benjamin Rojogan on Logistic regression to aid healthcare sector in reducing patient readmission. Ben's previous post on ensemble methods to optimize machine learning models is also available for a quick read here.[/box] ER visits are not cheap for any party involved. Whether this be the patient or the insurance company. However, this does not stop some patients from being regular repeat visitors. These recurring visits are due to lack of intervention for problems such as substance abuse, chronic diseases and mental illness. This increases costs for everybody in the healthcare system and reduces quality of care by playing a role in the overflowing of Emergency Departments (EDs). Research teams at UW and other universities are partnering with companies like Kensci to figure out how to approach the problem of reducing readmission rates. The ability to predict the likelihood of a patient’s readmission will allow for targeted intervention which in turn will help reduce the frequency of readmissions. Thus making the population healthier and hopefully reducing the estimated 41.3 billion USD healthcare costs for the entire system. How do they plan to do it? With big data and statistics, of course. A plethora of algorithms are available for data scientists to use to approach this problem. Many possible variables could affect the readmission and medical costs. Also, there are also many different ways researchers might pose their questions. However, the researchers at UW and many other institutions have been heavily focused on reducing the readmission rate simply by trying to calculate whether a person would or would not be readmitted. In particular, this team of researchers was curious about chronic ailments. Patients with chronic ailments are likely to have random flare ups that require immediate attention. Being able to predict if a patient will have an ER visit can lead to managing the cause more effectively. One approach taken by the data science team at UW as well as the Department of Family and Community Medicine at the University of Toronto was to utilize logistic regression to predict whether or not a patient would be readmitted. Patient readmission can be broken down into a binary output: either the patient is readmitted or not. As such logistic regression has been a useful model in my experience to approach this problem. Logistic Regression to predict patient readmissions Why do data scientists like to use logistic regression? Where is it used? And how does it compare to other data algorithms? Logistic regression is a statistical method that statisticians and data scientists use to classify people, products, entities, etc. It is used for analyzing data that produces a binary classification based on one or many independent variables. This means, it produces two clear classifications (Yes or No, 1 or 0, etc). With the example above, the binary classification would be: is the patient readmitted or not? Other examples of this could be whether to give a customer a loan or not, whether a medical claim is fraud or not, whether a patient has diabetes or not. Despite its name, logistic regression does not provide the same output like linear regression (per se). There are some similarities, for instance, the linear model is somewhat consistent as you might notice in the equation below where you see what is very similar to a linear equation. But the final output is based on the log odds. Linear regression and multivariate regression both take one to many independent variables and produce some form of continuous function. Linear regression could be used to predict the price of a house, a person’s age or the cost of a product an e-commerce should display to each customer. The output is not limited to only a few discrete classifications. Whereas logistic regression produces discrete classifiers. For instance, an algorithm using logistic regression could be used to classify whether or not a certain stock price would be either >$50 a share or <$50 a share. Linear regression would be used to predict if a stock share would be worth $50.01, $50.02….etc. Logistic regression is a calculation that uses the odds of a certain classification. In the equation above, the symbol you might know as pi actually represents the odds or probability. To reduce the error rate, we should predict Y = 1 when p ≥ 0.5 and Y = 0 when p < 0.5. This creates a linear classifier, a boundary that when the coefficients β0 + x · β has a p value that is p < 0.5 then Y = 0. By generating coefficients that help predict the logit transformation, the method allows to classify for the characteristic of interest. Now that is a lot of complex math mumbo jumbo. Let’s try to break it down into simpler terms. Probability vs. Odds Let’s start with probability. Let’s say a patient has the probability of 0.6 of being readmitted. Then the probability that the patient won’t be readmitted is .4. Now, we want to take this and convert it into odds. This is what the formula above is doing. You would take .6/.4 and get odds of 1.5. That means the odds of the patient being readmitted are 1.5 to 1. If instead the probability was .5 for both being readmitted and not being readmitted, then the odds would be 1:1. Now the next step in the logistic regression model would be to take the odds and get the “Log odds”. You do this by taking the 1.5 and put it into the log portion of the equation. Now you will get .18(rounded). In logistic regression, we don’t actually know p. That is what we are trying to essentially find and model using various coefficients and input variables. Each input provides a value that changes how much more likely an event will or will not occur. All of these coefficients are used to calculate the log odds. This model can take multiple variables like age, sex, height, etc. and specify how much of an effect each variable has on the odds an event will occur. Once the initial model is developed, then comes the work of deciding its value. How does a business go from creating an algorithm inside a computer and translate it into action. Some of us like to say the “computers” are the easy part. Personally I find the hard part to be the “people”. After all, at the end of the day, it comes down to business value. Will an algorithm save money or not? That means it has to be applied in real life. This could take the form of a new initiative, strategy, product recommendation, etc. You need to find the outliers that are worth going after! For instance, if we go back to the patient readmission example again. The algorithm points out patients with high probabilities of being readmitted. However if the readmission costs are low, they will probably be ignored..sadly. That is how businesses (including hospitals) look at problems. Logistic regression is a great tool for binary classification. It is unlike many other algorithms that estimate continuous variables or estimate distributions. This statistical method can be utilized to classify whether a person will be likely to get cancer because of environmental variables like proximity to a highway, smoking habits, etc? This method has been used effectively in the medical, financial and insurance industry successfully for a while. Knowing when to use what algorithm takes time. However, the more problems a data scientist faces, the faster they will recognize whether to use logistic regression or decision trees. Using logistic regression provides the opportunity for healthcare institutions to accurately target at risk individuals who should receive a more tailored behavioral health plan to help improve their daily health habits. This in turn opens the opportunity for better health for patients and lower costs for hospitals. [box type="shadow" align="" class="" width=""] About the Author Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/box]
Read more
  • 0
  • 0
  • 35682

article-image-understanding-sentiment-analysis-and-other-key-nlp-concepts
Sunith Shetty
20 Dec 2017
12 min read
Save for later

Understanding Sentiment Analysis and other key NLP concepts

Sunith Shetty
20 Dec 2017
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. This book will help you learn to perform big data analytics tasks using machine learning concepts such as clustering, recommending products, data segmentation and more. [/box] With this post, you will learn what is sentiment analysis and how it is used to analyze emotions associated within the text. You will also learn key NLP concepts such as Tokenization, stemming among others and how they are used for sentiment analysis. What is sentiment analysis? One of the forms of text analysis is sentimental analysis. As the name suggests this technique is used to figure out the sentiment or emotion associated with the underlying text. So if you have a piece of text and you want to understand what kind of emotion it conveys, for example, anger, love, hate, positive, negative, and so on you can use the technique sentimental analysis. Sentimental analysis is used in various places, for example: To analyze the reviews of a product whether they are positive or negative This can be especially useful to predict how successful your new product is by analyzing user feedback To analyze the reviews of a movie to check if it's a hit or a flop Detecting the use of bad language (such as heated language, negative remarks, and so on) in forums, emails, and social media To analyze the content of tweets or information on other social media to check if a political party campaign was successful or not  Thus, sentimental analysis is a useful technique, but before we see the code for our sample sentimental analysis example, let's understand some of the concepts needed to solve this problem. [box type="shadow" align="" class="" width=""]For working on a sentimental analysis problem we will be using some techniques from natural language processing and we will be explaining some of those concepts.[/box] Concepts for sentimental analysis Before we dive into the fully-fledged problem of analyzing the sentiment behind text, we must understand some concepts from the NLP (Natural Language Processing) perspective. We will explain these concepts now. Tokenization From the perspective of machine learning one of the most important tasks is feature extraction and feature selection. When the data is plain text then we need some way to extract the information out of it. We use a technique called tokenization where the text content is pulled and tokens or words are extracted from it. The token can be a single word or a group of words too. There are various ways to extract the tokens, as follows: By using regular expressions: Regular expressions can be applied to textual content to extract words or tokens from it. By using a pre-trained model: Apache Spark ships with a pre-trained model (machine learning model) that is trained to pull tokens from a text. You can apply this model to a piece of text and it will return the predicted results as a set of tokens. To understand a tokenizer using an example, let's see a simple sentence as follows: Sentence: "The movie was awesome with nice songs" Once you extract tokens from it you will get an array of strings as follows: Tokens: ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs'] [box type="shadow" align="" class="" width=""]The type of tokens you extract depends on the type of tokens you are interested in. Here we extracted single tokens, but tokens can also be a group of words, for example, 'very nice', 'not good', 'too bad', and so on.[/box] Stop words removal Not all the words present in the text are important. Some words are common words used in the English language that are important for the purpose of maintaining the grammar correctly, but from conveying the information perspective or emotion perspective they might not be important at all, for example, common words such as is, was, were, the, and so. To remove these words there are again some common techniques that you can use from natural language processing, such as: Store stop words in a file or dictionary and compare your extracted tokens with the words in this dictionary or file. If they match simply ignore them. Use a pre-trained machine learning model that has been taught to remove stop words. Apache Spark ships with one such model in the Spark feature package. Let's try to understand stop words removal using an example: Sentence: "The movie was awesome" From the sentence we can see that common words with no special meaning to convey are the and was. So after applying the stop words removal program to this data you will get: After stop words removal: [ 'movie', 'awesome', 'nice', 'songs'] [box type="shadow" align="" class="" width=""]In the preceding sentence, the stop words the, was, and with are removed.[/box] Stemming Stemming is the process of reducing a word to its base or root form. For example, look at the set of words shown here: car, cars, car's, cars' From our perspective of sentimental analysis, we are only interested in the main words or the main word that it refers to. The reason for this is that the underlying meaning of the word in any case is the same. So whether we pick car's or cars we are referring to a car only. Hence the stem or root word for the previous set of words will be: car, cars, car's, cars' => car (stem or root word) For English words again you can again use a pre-trained model and apply it to a set of data for figuring out the stem word. Of course there are more complex and better ways (for example, you can retrain the model with more data), or you have to totally use a different model or technique if you are dealing with languages other than English. Diving into stemming in detail is beyond the scope of this book and we would encourage readers to check out some documentation on natural language processing from Wikipedia and the Stanford nlp website. [box type="shadow" align="" class="" width=""]To keep the sentimental analysis example in this book simple we will not be doing stemming of our tokens, but we will urge the readers to try the same to get better predictive results.[/box] N-grams Sometimes a single word conveys the meaning of context, other times a group of words can convey a better meaning. For example, 'happy' is a word in itself that conveys happiness, but 'not happy' changes the picture completely and 'not happy' is the exact opposite of 'happy'. If we are extracting only single words then in the example shown before, that is 'not happy', then 'not' and 'happy' would be two separate words and the entire sentence might be selected as positive by the classifier However, if the classifier picks the bi-grams (that is, two words in one token) in this case then it would be trained with 'not happy' and it would classify similar sentences with 'not happy' in it as 'negative'. Therefore, for training our models we can either use a uni-gram or a bi-gram where we have two words per token or as the name suggest an n-gram where we have 'n' words per token, it all depends upon which token set trains our model well and it improves its predictive results accuracy. To see examples of n-grams refer to the following table:   Sentence The movie was awesome with nice songs Uni-gram ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs'] Bi-grams ['The movie', 'was awesome', 'with nice', 'songs'] Tri-grams ['The movie was', 'awesome with nice', 'songs']   For the purpose of this case study we will be only looking at unigrams to keep our example simple. By now we know how to extract words from text and remove the unwanted words, but how do we measure the importance of words or the sentiment that originates from them? For this there are a few popular approaches and we will now discuss two such approaches. Term presence and term frequency Term presence just means that if the term is present we mark the value as 1 or else 0. Later we build a matrix out of it where the rows represent the words and columns represent each sentence. This matrix is later used to do text analysis by feeding its content to a classifier. Term Frequency, as the name suggests, just depicts the count or occurrences of the word or tokens within the document. Let's refer to the example in the following table where we find term frequency:   Sentence The movie was awesome with nice songs and nice dialogues. Tokens (Unigrams only for now) ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs', 'and', 'dialogues'] Term Frequency ['The = 1', 'movie = 1', 'was = 1', 'awesome = 1', 'with = 1', 'nice = 2', 'songs = 1', 'dialogues = 1']   As seen in the preceding table, the word 'nice' is repeated twice in the preceding sentence and hence it will get more weight in determining the opinion shown by the sentence. Bland term frequency is not a precise approach for the following reasons: There could be some redundant irrelevant words, for example, the, it, and they that might have a big frequency or count and they might impact the training of the model There could be some important rare words that could convey the sentiment regarding the document yet their frequency might be low and hence they might not be inclusive for greater impact on the training of the model Due to this reason, a better approach of TF-IDF is chosen as shown in the next sections. TF-IDF TF-IDF stands for Term Frequency and Inverse Document Frequency and in simple terms it means the importance of a term to a document. It works using two simple steps as follows: It counts the number of terms in the document, so the higher the number of terms the greater the importance of this term to the document. Counting just the frequency of words in a document is not a very precise way to find the importance of the words. The simple reason for this is there could be too many stop words and their count is high so their importance might get elevated above the importance of real good words. To fix this, TF-IDF checks for the availability of these stop words in other documents as well. If the words appear in other documents as well in large numbers that means these words could be grammatical words such as they, for, is, and so on, and TF-IDF decreases the importance or weight of such stop words. Let's try to understand TF-IDF using the following figure: As seen in the preceding figure, doc-1, doc-2, and so on are the documents from which we extract the tokens or words and then from those words we calculate the TF-IDFs. Words that are stop words or regular words such as for , is, and so on, have low TF-IDFs, while words that are rare such as 'awesome movie' have higher TF-IDFs. TF-IDF is the product of Term Frequency and Inverse document frequency. Both of them are explained here: Term Frequency: This is nothing but the count of the occurrences of the words in the document. There are other ways of measuring this, but the simplistic approach is to just count the occurrences of the tokens. The simple formula for its calculation is:      Term Frequency = Frequency count of the tokens Inverse Document Frequency: This is the measure of how much information the word provides. It scales up the weight of the words that are rare and scales down the weight of highly occurring words. The formula for inverse document frequency is: TF-IDF: TF-IDF is a simple multiplication of the Term Frequency and the Inverse Document Frequency. Hence: This simple technique is very popular and it is used in a lot of places for text analysis. Next let's look into another simple approach called bag of words that is used in text analytics too. Bag of words As the name suggests, bag of words uses a simple approach whereby we first extract the words or tokens from the text and then push them in a bag (imaginary set) and the main point about this is that the words are stored in the bag without any particular order. Thus the mere presence of a word in the bag is of main importance and the order of the occurrence of the word in the sentence as well as its grammatical context carries no value. Since the bag of words gives no importance to the order of words you can use the TF-IDFs of all the words in the bag and put them in a vector and later train a classifier (naïve bayes or any other model) with it. Once trained, the model can now be fed with vectors of new data to predict on its sentiment. Summing it up, we have got you well versed with sentiment analysis techniques and NLP concepts in order to apply sentimental analysis. If you want to implement machine learning algorithms to carry out predictive analytics and real-time streaming analytics you can refer to the book Big Data Analytics with Java.    
Read more
  • 0
  • 0
  • 19479