Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Tech Guides - Big Data

50 Articles
article-image-15-useful-python-libraries-to-make-your-data-science-tasks-easier
Amey Varangaonkar
12 Feb 2018
10 min read
Save for later

15 Useful Python Libraries to make your Data Science tasks Easier

Amey Varangaonkar
12 Feb 2018
10 min read
Python has become a big hit in the Data Science community over the last five years. So much so that it is slowly taking over R - the ‘lingua franca of statistics’ - as the preferred choice of tool for many. The recently published Stack Overflow Developer Survey 2018 suggests Python is the next big programming language, and its adoption in the industry is only going to increase. Python’s rise has been staggering, but not really surprising. Its general-purpose nature, coupled with the efficiency and ease of use make it easier for you to build your data science solutions without any hassle. You also have a rich suite of Python libraries available at your disposal for all your Data Science-related tasks - from basic web scraping to something as complex as training deep learning models. In this article, we take a look at some of the most popular and widely used Python libraries and their application areas. Web Scraping Web scraping is a popular information extraction technique from the web using the HTTP protocol, with the help of a web browser. The two most commonly used tools for web scraping are, unsurprisingly, Python-based. 1.Beautiful Soup Beautiful Soup is a popular Python library for extracting information out of the HTML and XML files. It provides a unique, easy way to navigate, search and modify the parsed data, potentially saving you hours of needless work. It works with both the versions of Python, i.e. 2.7 and 3.x and is very easy to use. Check out our latest tutorial on how to scrape web page using the Beautiful Soup. [box type="info" align="" class="" width=""]Editor's Tip: If you’re new to the concept of web scraping, Beautiful Soup should be your go-to library. You can learn more about how to use this library more efficiently in our book Python Web Scraping Cookbook [/box] 2.Scrapy Scrapy is a free, open source framework written in Python. Although developed for web scraping, it can also be used as a general web crawler and extract data using different APIs. Following the ‘Don’t Repeat Yourself’ philosophy of frameworks such as Django, Scrapy includes a set of self-contained crawlers, with each of them following specific instructions with a specific objective. [box type="info" align="" class="" width=""]Editor’s tip: To learn how to use Scrapy for your scraping projects, our book Python Web Scraping, Second Edition is definitely worth checking out. [/box] Scientific Computation and Data Analysis Arguably the most common data science tasks, Python proves to be of great worth to data scientists by providing unique libraries for data manipulation and analysis, as well as mathematical computation. 3. NumPy NumPy is the most popular library for scientific computing in Python and is a part of the larger Python stack for scientific computation called SciPy (discussed below). Apart from its uses in linear algebra and other mathematical functions, it can also be used as a multi-dimensional container, or array, of generic data with arbitrary data types. NumPy integrates seamlessly languages such as C/C++ and because of its support for multiple data types, it works well with a variety of databases as well. 4. SciPy SciPy is a Python-based framework containing open source libraries for mathematics, scientific computation and data analysis.  The SciPy library is a collection of algorithms and tools for advanced mathematical computations, statistics and much more. The SciPy stack consists of the following libraries: NumPy - Python package for numerical computation SciPy - One of the core packages of the SciPy stack for signal processing, optimization and advanced statistics matplotlib - Popular Python library for data visualization SymPy - Library for symbolic mathematics and algebra pandas - Python library for data manipulation and analysis iPython -  Interactive console to run Python-based code 5. pandas pandas is a widely used Python package providing data structures and tools for effective data manipulation and analysis. It is a popularly used tool for Quantitative Analysis and finds a lot of application in algorithmic trading and risk analysis. With a large community of dedicated users, pandas is regularly updated to get new API changes, performance updates and bug fixes. This is one library you definitely need to work with to truly realize its power. [box type="info" align="" class="" width=""]Editor's Tip: To get a more hands-on understanding of how to effectively use pandas for data analysis, make sure you check out our highly popular title pandas Cookbook.[/box] Machine Learning and Deep Learning Python trumps all other languages when it comes to implementing efficient machine learning and deep learning models, simply by virtue of its diverse, effective and easy to use set of libraries. It is worth having a look at the experts’ take on why Python is great for machine learning and Artificial Intelligence. In this section, we see some of the most popular and commonly used Python libraries for machine learning and deep learning: 6. Scikit-learn scikit-learn is the most popular Python library for data mining, analysis and machine learning. It is built using the capabilities of NumPy, SciPy and matplotlib, and is commercially usable. You can implement a variety of machine learning techniques such as classification, regression, clustering and more, using scikit-learn. It is very easy to install and has a clean, slick documentation for anyone looking to get started with it. [box type="info" align="" class="" width=""]Editor’s tip: To understand how to use scikit-learn in your machine learning projects, our bestselling book Python Machine Learning, Second Edition is all you need. If you’re looking to specifically master scikit-learn, Mastering Machine Learning with scikit-learn will prove to be a very useful resource. Check it out! [/box] 7. Tensorflow Tensorflow is the popular machine learning library everyone seems to be talking about today. It is a Python-based framework for effective machine learning and deep learning using multiple CPUs or GPUs. Backed by Google, it was initially developed by the research team of Google Brain, and is the widely used framework in the world for machine intelligence. It enjoys the support of a large community of active users and is finding widespread application for advanced machine learning across a multitude of industrial domains - from manufacturing and retail to healthcare and smart cars. If you are interested to know more about Tensorflow, you can quickly check out the tutorial here. [box type="info" align="" class="" width=""]Editor's Tip: Tensorflow being the most popular framework for machine learning and deep learning, it is one library you should definitely master. Check out the following books to skill up quickly! Machine Learning with TensorFlow 1.x TensorFlow Machine Learning Cookbook Deep Learning with TensorFlow Tensorflow 1.x Deep Learning Cookbook Mastering Tensorflow 1.x [/box] 8. Keras Keras is a Python-based neural networks API, and offers a simplified interface to train and deploy your deep learning models with ease. It has support for a variety of deep learning frameworks such as Tensorflow, Deeplearning4j and CNTK. Keras is very user-friendly, follows a modular approach and supports both CPU and GPU-based computations. If you want to make the deep learning process simpler and effective, this library is definitely worth checking out! [box type="info" align="" class="" width=""]Editor's Tip: If you’re looking for a resource that teaches you how to use Keras effectively, our trending book Deep Learning with Keras will be of great help to you! [/box] 9. PyTorch One of the more recent additions to Python deep learning family is PyTorch, a neural network modeling library with strong GPU support. Although still in a beta stage, this project is backed by bigwigs such as Facebook and Twitter. PyTorch builds on the architecture of Torch, another popular deep library, to enable more efficient tensor computation and implementation of dynamic neural networks. [box type="info" align="" class="" width=""]Editor's Tip: Here is Deep Learning with PyTorch to get you started with this amazing tool. [/box] Natural Language Processing Natural Language Processing pertains to designing of systems that process, interpret and analyze human language, spoken or written. Python offers unique libraries for performing a variety of tasks such as working with structured and unstructured text, predictive analytics and much more. 10. NLTK NLTK is a popular Python library for language processing. It offers easy to use interfaces for a variety of NLP tasks such as text classification, tokenization, text parsing, semantic reasoning and much more. It is an open source, community-driven project, and has support for both Python 2 and Python 3. 11. SpaCy SpaCy is another library for advanced natural language processing, based on Python and Cython. It has an extensive support for various deep learning libraries and frameworks such as Tensorflow and PyTorch. With SpaCy, you can build complex statistical models for NLP with relative ease. SpaCy is easy to install and use, and proves to be of great help when it comes to large-scale extracting and analyzing of textual information. [box type="info" align="" class="" width=""]Editor's Tip: To know more about how these libraries are used for natural language processing, make sure you check out the book Natural Language Processing with Python Cookbook [/box] Data Visualization Data visualization is a popularly used Data Science technique for visually analysing and communicating information and valuable business insights through graphs, charts, dashboards and reports. Python offers a lot of popular libraries for effective data storytelling. Some of them are listed below: 12. matplotlib matplotlib is the most popular Python library for data visualization which allows for enterprise-grade 2D and 3D plotting. With matplotlib, you can build different kinds of visualizations such as histograms, bar charts, scatter plots and much more, with just a few lines of code. The popularity of matplotlib rivals that of R’s highly acclaimed ggplot2, and deciding which library is better has been a hot topic for debate, for many years now. Matplotlib runs seamlessly on all Python consoles, including iPython and Jupyter notebooks, giving you all the necessary tools to create and share your data visualizations with others. [box type="info" align="" class="" width=""]Editor's Tip: Get started with matplotlib today, with the help of Matplotlib 2.x By Example [/box] 13. Seaborn Seaborn is a Python-based data visualization library, which finds its roots in matplotlib. Apart from offering attractive and insightful data visualizations, seaborn also offers strong support for other Python libraries such as NumPy and pandas. Per the official seaborn page: “If matplotlib “tries to make easy things easy and hard things possible”, seaborn tries to make a well-defined set of hard things easy too.” 14. Bokeh Bokeh is an interactive data visualization library based on Python. It aims to provide D3.js style elegant graphics and visualizations and runs primarily on modern web browsers. Apart from the ability to create a wide variety of visualizations, Bokeh also supports large-scale interactivity and visualizations of real-time datasets. 15. Plotly Plotly is a popularly used Python library which is used across the world for making publication-quality plots and graphs. With Plotly, you can build interactive dashboards, scatter plots, histograms, candlestick charts, heat maps, and a whole host of other data visualizations with ease. With superior interactivity, deployment and publication capabilities, Plotly is used across different domains, majorly finance and geospatial industries for effective data storytelling. So there you have it! Python has an extensive suite of libraries for every data science related task, each equipped with unique features to make the task fast and hassle-free. While there are a lot more Python libraries out there, we cherry-picked these 15 libraries based on their popularity, usefulness and the value they bring to the table. Also, the extensive community support for Python means you can get help for any kind of problem you might come across while using these tools. It's time now for you to go out there and crunch some data with some of these Python powered libraries!
Read more
  • 0
  • 0
  • 48122

article-image-what-does-a-data-science-team-look-like
Fatema Patrawala
21 Nov 2019
11 min read
Save for later

What does a data science team look like?

Fatema Patrawala
21 Nov 2019
11 min read
Until a couple of years ago, people barely knew the term 'data science' which has now evolved into an extremely popular career field. The Harvard Business Review dubbed data scientist within the data science team as the sexiest job of the 21st century and expert professionals jumped on the data is the new oil bandwagon. As per the Figure Eight Report 2018, which takes the pulse of the data science community in the US, a lot has changed rapidly in the data science field over the years. For the 2018 report, they surveyed approximately 240 data scientists and found out that machine learning projects have multiplied and more and more data is required to power them. Data science and machine learning jobs are LinkedIn's fastest growing jobs. And the internet is creating 2.5 quintillion bytes of data to process and analyze each day. With all these changes, it is evident for data science teams to evolve and change among various organizations. The data science team is responsible for delivering complex projects where system analysis, software engineering, data engineering, and data science is used to deliver the final solution. To achieve all of this, the team does not only have a data scientist or a data analyst but also includes other roles like business analyst, data engineer or architect, and chief data officer. In this post, we will differentiate and discuss various job roles within a data science team, skill sets required and the compensation benefit for each one of them. For an in-depth understanding of data science teams, read the book, Managing Data Science by Kirill Dubovikov, which has interesting case studies on building successful data science teams. He also explores how the team can efficiently manage data science projects through the use of DevOps and ModelOps.  Now let's get into understanding individual data science roles and functions, but before that we take a look at the structure of the team.There are three basic team structures to match different stages of AI/ML adoption: IT centric team structure At times for companies hiring a data science team is not an option, and they have to leverage in-house talent. During such situations, they take advantage of the fully functional in-house IT department. The IT team manages functions like data preparation, training models, creating user interfaces, and model deployment within the corporate IT infrastructure. This approach is fairly limited, but it is made practical by MLaaS solutions. Environments like Microsoft Azure or Amazon Web Services (AWS) are equipped with approachable user interfaces to clean datasets, train models, evaluate them, and deploy. Microsoft Azure, for instance, supports its users with detailed documentation for a low entry threshold. The documentation helps in fast training and early deployment of models even without an expert data scientists on board. Integrated team structure Within the integrated structure, companies have a data science team which focuses on dataset preparation and model training, while IT specialists take charge of the interfaces and infrastructure for model deployment. Combining machine learning expertise with IT resource is the most viable option for constant and scalable machine learning operations. Unlike the IT centric approach, the integrated method requires having an experienced data scientist within the team. This approach ensures better operational flexibility in terms of available techniques. Additionally, the team leverages deeper understanding of machine learning tools and libraries – like TensorFlow or Theano which are specifically for researchers and data science experts. Specialized data science team Companies can also have an independent data science department to build an all-encompassing machine learning applications and frameworks. This approach entails the highest cost. All operations, from data cleaning and model training to building front-end interfaces, are handled by a dedicated data science team. It doesn't necessarily mean that all team members should have a data science background, but they should have technology background with certain service management skills. A specialized structure model aids in addressing complex data science tasks that include research, use of multiple ML models tailored to various aspects of decision-making, or multiple ML backed services. Today's most successful Silicon Valley tech operates with specialized data science teams. Additionally they are custom-built and wired for specific tasks to achieve different business goals. For example, the team structure at Airbnb is one of the most interesting use cases. Martin Daniel, a data scientist at Airbnb in this talk explains how the team emphasizes on having an experimentation-centric culture and apply machine learning rigorously to address unique product challenges. Job roles and responsibilities within data science team As discussed earlier, there are many roles within a data science team. As per Michael Hochster, Director of Data Science at Stitch Fix, there are two types of data scientists: Type A and Type B. Type A stands for analysis. Individuals involved in Type A are statisticians that make sense of data without necessarily having strong programming knowledge. Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc. Type B stands for building. These individuals use data in production. They're good software engineers with strong programming knowledge and statistics background. They build recommendation systems, personalization use cases, etc. Though it is rare that one expert will fit into a single category. But understanding these data science functions can help make sense of the roles described further. Chief data officer/Chief analytics officer The chief data officer (CDO) role has been taking organizations by storm. A recent NewVantage Partners' Big Data Executive Survey 2018 found that 62.5% of Fortune 1000 business and technology decision-makers said their organization appointed a chief data officer. The role of chief data officer involves overseeing a range of data-related functions that may include data management, ensuring data quality and creating data strategy. He or she may also be responsible for data analytics and business intelligence, the process of drawing valuable insights from data. Even though chief data officer and chief analytics officer (CAO) are two distinct roles, it is often handled by the same person. Expert professionals and leaders in analytics also own the data strategy and how a company should treat its data. It does make sense as analytics provide insights and value to the data. Hence, with a CDO+CAO combination companies can take advantage of a good data strategy and proper data management without losing on quality. According to compensation analysis from PayScale, the median chief data officer salary is $177,405 per year, including bonuses and profit share, ranging from $118,427 to $313,791 annually. Skill sets required: Data science and analytics, programming skills, domain expertise, leadership and visionary abilities are required. Data analyst The data analyst role implies proper data collection and interpretation activities. The person in this job role will ensure that collected data is relevant and exhaustive while also interpreting the results of the data analysis. Some companies also require data analysts to have visualization skills to convert alienating numbers into tangible insights through graphics. As per Indeed, the average salary for a data analyst is $68,195 per year in the United States. Skill sets required: Programming languages like R, Python, JavaScript, C/C++, SQL. With this critical thinking, data visualization and presentation skills will be good to have. Data scientist Data scientists are data experts who have the technical skills to solve complex problems and the curiosity to explore what problems are needed to be solved. A data scientist is an individual who develops machine learning models to make predictions and is well versed in algorithm development and computer science. This person will also know the complete lifecycle of the model development. A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze customer and market trends. Basic responsibilities include gathering and analyzing data, using various types of analytics and reporting tools to detect patterns, trends and relationships in data sets. According to Glassdoor, the current U.S. average salary for a data scientist is $118,709. Skills set required: A data scientist will require knowledge of big data platforms and tools like  Seahorse powered by Apache Spark, JupyterLab, TensorFlow and MapReduce; and programming languages that include SQL, Python, Scala and Perl; and statistical computing languages, such as R. They should also have cloud computing capabilities and knowledge of various cloud platforms like AWS, Microsoft Azure etc.You can also read this post on how to ace a data science interview to know more. Machine learning engineer At times a data scientist is confused with machine learning engineers, but a machine learning engineer is a distinct role that involves different responsibilities. A machine learning engineer is someone who is responsible for combining software engineering and machine modeling skills. This person determines which model to use and what data should be used for each model. Probability and statistics are also their forte. Everything that goes into training, monitoring, and maintaining a model is the ML engineer's job. The average machine learning engineer's salary is $146,085 in the US, and is ranked No.1 on the Indeed's Best Jobs in 2019 list. Skill sets required: Machine learning engineers will be required to have expertise in computer science and programming languages like R, Python, Scala, Java etc. They would also be required to have probability techniques, data modelling and evaluation techniques. Data architects and data engineers The data architects and data engineers work in tandem to conceptualize, visualize, and build an enterprise data management framework. The data architect visualizes the complete framework to create a blueprint, which the data engineer can use to build a digital framework. The data engineering role has recently evolved from the traditional software-engineering field.  Recent enterprise data management experiments indicate that the data-focused software engineers are needed to work along with the data architects to build a strong data architecture. Average salary for a data architect in the US ranges from $1,22,000 to $1,29, 000 annually as per a recent LinkedIn survey. Skill sets required: A data architect or an engineer should have a keen interest and experience in programming languages frameworks like HTML5, RESTful services, Spark, Python, Hive, Kafka, and CSS etc. They should have the required knowledge and experience to handle database technologies such as PostgreSQL, MapReduce and MongoDB and visualization platforms such as; Tableau, Spotfire etc. Business analyst A business analyst (BA) basically handles Chief analytics officer's role but on the operational level. This implies converting business expectations into data analysis. If your core data scientist lacks domain expertise, a business analyst can bridge the gap. They are responsible for using data analytics to assess processes, determine requirements and deliver data-driven recommendations and reports to executives and stakeholders. BAs engage with business leaders and users to understand how data-driven changes will be implemented to processes, products, services, software and hardware. They further articulate these ideas and balance them against technologically feasible and financially reasonable. The average salary for a business analyst is $75,078 per year in the United States, as per Indeed. Skill sets required: Excellent domain and industry expertise will be required. With this good communication as well as data visualization skills and knowledge of business intelligence tools will be good to have. Data visualization engineer This specific role is not present in each of the data science teams as some of the responsibilities are realized by either a data analyst or a data architect. Hence, this role is only necessary for a specialized data science model. The role of a data visualization engineer involves having a solid understanding of UI development to create custom data visualization elements for your stakeholders. Regardless of the technology, successful data visualization engineers have to understand principles of design, both graphical and more generally user-centered design. As per Payscale, the average salary for a data visualization engineer is $98,264. Skill sets required: A data visualization engineer need to have rigorous knowledge of data visualization methods and be able to produce various charts and graphs to represent data. Additionally they must understand the fundamentals of design principles and visual display of information. To sum it up, a data science team has evolved to create a number of job roles and opportunities, but companies still face challenges in building up the team from scratch and find it hard to figure where to start from. If you are facing a similar dilemma, check out this book, Managing Data Science, written by Kirill Dubovikov. It covers concepts and methodologies to manage and deliver top-notch data science solutions, while also providing guidance on hiring, growing and sustaining a successful data science team. How to learn data science: from data mining to machine learning How to ace a data science interview Data science vs. machine learning: understanding the difference and what it means today 30 common data science terms explained 9 Data Science Myths Debunked
Read more
  • 0
  • 0
  • 44105

article-image-python-r-war
Amey Varangaonkar
28 Aug 2017
7 min read
Save for later

Is Python edging R out in the data science wars?

Amey Varangaonkar
28 Aug 2017
7 min read
When it comes to the ‘lingua franca’ of data science, there seems to be a face-off between R and Python. R has long been established as the language of researchers and statisticians but Python has come up quickly as a bona-fide challenger, helping embed analytics as a necessity for businesses and other organizations in 2017. If  a tech war does exist between the two languages, it’s a battle fought not so much on technical features but instead on the wider changes within modern business and technology. R is a language purpose-built for statistics, for performing accurate and intensive analysis. So, the fact that R is being challenged by Python — a language that is flexible, fast, and relatively easy to learn — suggests we are seeing a change in who’s actually doing data science, where they’re doing it, and what they’re trying to achieve. Python versus R — A Closer Look Let’s make a quick comparison of the two languages on aspects important to those working with data and see what we can learn about the two worlds where R and Python operate. Learning curve Python is the easier language to learn. While R certainly isn’t impenetrable, Python’s syntax marks it as a great language to learn even if you’re completely new to programming. The fact that such an easy language would come to rival R within data science indicates the pace at which the field is expanding. More and more people are taking on data-related roles, possibly without a great deal of programming knowledge — Python makes the barrier to entry much lower than R. That said, once you get to grips with the basics of R, it becomes relatively easier to learn the more advanced stuff. This is why statisticians and experienced programmers find R easier to use. Packages and libraries Many R packages are in-built. Python, meanwhile, depends upon a range of external packages. This obviously makes R much more efficient as a statistical tool — it means that if you’re using Python you need to know exactly what you’re trying to do and what external support you’re going to need. Data Visualization R is well-known for its excellent graphical capabilities. This makes it easy to present and communicate data in varied forms. For statisticians and researchers, the importance of that is obvious. It means you can perform your analysis and present your work in a way that is relatively seamless. The ggplot2 package in R, for example, allows you to create complex and elegant plots with ease and as a result, its popularity in the R community has increased over the years. Python also offers a wide range of libraries which can be used for effective data storytelling. The breadth of external packages available with Python means the scope of what’s possible is always expanding. Matplotlib has been a mainstay of Python data visualization. It’s also worth remarking on upcoming libraries like Seaborn. Seaborn is a neat little library that sits on top of Matplotlib, wrapping its functionality and giving you a neater API for specific applications. So, to sum up, you have sufficient options to perform your data visualization tasks effectively — using either R or Python! Analytics and Machine Learning Thanks to libraries like scikit-learn, Python helps you build machine learning systems with relative ease. This takes us back to the point about barrier to entry. If machine learning is upending how we use and understand data, it makes sense that more people want a piece of the action without having to put in too much effort. But Python also has another advantage; it’s great for creating web services where data can be uploaded by different people. In a world where accessibility and data empowerment have never been more important (i.e., where everyone takes an interest in data, not just the data team), this could prove crucial. With packages such as caret, MICE, and e1071, R too gives you the power to perform effective machine learning in order to get crucial insights out of your data. However, R falls short in comparison to Python, thanks to the latter’s superior libraries and more diverse use-cases. Deep Learning Both R and Python have libraries for deep learning. It’s much easier and more efficient with Python though — most likely because the Python world changes much more quickly, new libraries and tools springing up as quickly as the data science world hooks on to a new buzzword. Theano, and most recently Keras and TensorFlow have all made a huge impact on making it relatively easy to build incredibly complex and sophisticated deep learning systems. If you’re clued-up and experienced with R it shouldn’t be too hard to do the same, using libraries such as MXNetR, deepr, and H2O — that said, if you want to switch models, you may need to switch tools, which could be a bit of a headache. Big Data With Python, you can write efficient MapReduce applications with ease, or scale your R program on Hadoop to work with petabytes of data. Both R and Python are equally good when it comes to working with Big Data, as they can be seamlessly integrated with Big Data tools such as Apache Spark and Apache Hadoop, among many others. It’s likely that it’s in this field that we’re going to see R moving more and more into industry as businesses look for a concise way to handle large datasets. This is true in industries such as bioinformatics which have a close connection with the academic world and necessarily depend upon a combination of size and accuracy when it comes to working with data. So, where does this comparison leave us? Ultimately, what we see are two different languages offering great solutions to very different problems in data science. In Python, we have a flexible and adaptable language with a vibrant community of developers working on a huge range of problems and tasks, each one trying to find more effective and more intelligent ways of doing things. In R, we have a purely statistical language with a large repository of over 8000 packages for data analysis and visualization. While Python is production-ready and is better suited for organizations looking to harness technical innovation to its advantage, R’s analytical and data visualization capabilities can make your life as a statistician or data analyst easier. Recent surveys indicate that Python commands a higher salary than R — that is because it’s a language that can be used across domains; a problem-solving language. That’s not to say that R isn’t a valuable language; rather, Python is the language that just seems to fit the times at the moment. In the end, it all boils down to your background, and the kind of data problems you want to solve. If you come from a statistics or research background and your problems only revolve around statistical analysis and visualization, then R would best fit your bill. However, if you’re a Computer Science graduate looking to build a general-purpose, enterprise-wide data model which can integrate seamlessly with the other business workflows, you will find Python easier to use. R and Python are two different animals. Instead of comparing the two, maybe it’s time we understood where and how each can be best used and then harnessed their power to the fullest to solve our data problems. One thing is for sure, though — neither is going away anytime soon. Both R and Python occupy a large chunk of the data science market-share today, and it will take a major disruption to take either one of them out of the equation completely.
Read more
  • 0
  • 1
  • 41963
Visually different images

article-image-top-5-programming-languages-big-data
Amey Varangaonkar
04 Apr 2018
8 min read
Save for later

Top 5 programming languages for crunching Big Data effectively

Amey Varangaonkar
04 Apr 2018
8 min read
One of the most important decisions that Big Data professionals have to make, especially the ones who are new to the scene or are just starting out, is choosing the best programming languages for big data manipulation and analysis. Understanding the Big Data problem and framing the architecture to solve it is not quite enough these days - the execution needs to be perfect as well, and choosing the right language goes a long way. The best languages for big data In this article, we look at the 5 of the most popularly used - not to mention highly effective - programming languages for developing Big Data solutions. Scala A beautiful crossover of the object-oriented and functional programming paradigms, Scala is fast and robust, and a popular choice of language for many Big Data professionals.The fact that two of the most popular Big Data processing frameworks in Apache Spark and Apache Kafka have been built on top of Scala tells you everything you need to know about the power of Scala. Scala runs on the JVM, which means the codes written in Scala can be easily used within a Java-based Big Data ecosystem. One significant factor that differentiates Scala from Java, though, is that Scala is a lot less verbose in comparison. You can write 100s of lines of confusing-looking Java code in less than 15 lines in Scala. One negative aspect of Scala, though, is its steep learning curve when compared to languages like Go and Python, and this may put off beginners looking to use it. Why use Scala for big data? Fast and robust Suitable for working with Big Data tools like Apache Spark for distributed Big Data processing JVM compliant, can be used in a Java-based ecosystem Python Python has been declared as one of the fastest growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Its general-purpose nature means it can be used across a broad spectrum of use-cases, and Big Data programming is one major area of application. Many libraries for data analysis and manipulation which are increasingly being used in a Big Data framework to clean and manipulate large chunks of data, such as pandas, NumPy, SciPy - are all Python-based. Not just that, most popular machine learning and deep learning frameworks such as scikit-learn, Tensorflow and many more, are also written in Python and are finding increasing application within the Big Data ecosystem. One drawback of using Python, and a reason why it is not a first-class citizen when it comes to Big Data programming yet, is that it’s slow. Although very easy to use, Big Data professionals have found systems built with languages such as Java or Scala faster and more robust to use than the systems built with Python. However, Python makes up for this limitation with other qualities. As Python is primarily a scripting language, interactive coding and development of analytical solutions for Big Data becomes very easy. Python can integrate effortlessly with the existing Big Data frameworks such as Apache Hadoop and Apache Spark, allowing you to perform predictive analytics at scale without any problem. Why use Python for big data? General-purpose Rich libraries for data analysis and machine learning Easy to use Supports iterative development Rich integration with Big Data tools Interactive computing through Jupyter notebooks R It won’t come as a surprise to many that those who love statistics, love R. The ‘language of statistics’ as it is popularly called as, R is used to build data models which can be used for effective and accurate data analysis. Powered by a large repository of R packages (CRAN, also called as Comprehensive R Archive Network), with R you have just about every type of tool to accomplish any task in Big Data processing - right from analysis to data visualization. R can be integrated seamlessly with Apache Hadoop and Apache Spark, among other popular frameworks, for Big Data processing and analytics. One issue with using R as a programming language for Big Data is that it is not very general-purpose. It means the code written in R is not production-deployable and generally has to be translated to some other programming language such as Python or Java. That said, if your goal is to only build statistical models for Big Data analytics, R is an option you should definitely consider. Why use R for big data? Built for data science Support for Hadoop and Spark Strong statistical modeling and visualization capabilities Support for Jupyter notebooks Java Last, but not the least, there’s always the good old Java. Some of the traditional Big Data frameworks such as Apache Hadoop and all the tools within its ecosystem are all Java-based, and still in use today in many enterprises. Not to mention the fact that Java is the most stable and production-ready language among all the languages we have discussed so far! Using Java to develop your Big Data applications gives you the ability to use a large ecosystem of tools and libraries for interoperability, monitoring and much more, most of which have already been tried and tested. One major drawback of Java is its verbosity. The fact that you have to write hundreds of lines of codes in Java for a task which can written in barely 15-20 lines of code in Python or Scala, can turnoff many budding programmers. However, the introduction of lambda functions in Java 8 does make life quite easier. Java also does not support iterative development unlike newer languages like Python, and this is an area of focus for the future Java releases. Despite the flaws, Java remains a strong contender when it comes to the preferred language for Big Data programming because of its history and the continued reliance on the traditional Big Data tools and frameworks. Why use Java for big data? Traditional Big Data tools and frameworks are written in Java Stable and production-ready Large ecosystem of tried and tested tools and libraries Go Last but not the least, there’s Go - one of the fastest rising programming languages in recent times. Designed by a group of Google engineers who were frustrated with C++, we think Go is a good shout in this list - simply because of the fact that it powers so many tools used in the Big Data infrastructure, including Kubernetes, Docker and many more. Go is fast, easy to learn, and fairly easy to develop applications with, not to mention deploy them. More importantly, as businesses look at building data analysis systems that can operate at scale, Go-based systems are being used to integrate machine learning and parallel processing of data. It is also possible to interface other languages with Go-based systems with relative ease. Why use Go for big data? Fast, easy to use Many tools used in the Big Data infrastructure are Go-based Efficient distributed computing There are a few other languages you might want to consider - Julia, SAS and MATLAB being some major ones which are useful in their own right. However, when compared to the languages we talked about above, we thought they fell a bit short in some aspects - be it speed, efficiency, ease of use, documentation, or community support, among other things. Let’s take a quick look at the comparison table of all the languages we discussed above. Note that we have used the ✓ symbol for the best possible language/s to help you make an informed decision. This is just our view, and that’s not to say that the other languages are any worse! Scala Python R Java Go Speed ✓ ✓ ✓ Ease of use ✓ ✓ ✓ Quick Learning curve ✓ ✓ Data Analysis capability ✓ ✓ ✓ General-purpose ✓ ✓ ✓ ✓ Big Data support ✓ ✓ ✓ ✓ ✓ Interfacing with other languages ✓ ✓ ✓ Production-ready ✓ ✓ ✓ So...which language should you choose? To answer the question in short - it all depends on the use-case you want to develop. If your focus is hardcore data analysis which involves a lot of statistical computing, R would be your go-to language. On the other hand, if you want to develop streaming applications for your Big Data, Scala can be a preferable choice. If you wish to use Machine Learning to leverage your Big Data and build predictive models, Python will come to your rescue. Lastly, if you plan to build Big Data solutions using just the traditionally-available tools, Java is the language for you. You also have the option of combining the power of two languages to get a more efficient and powerful solution. For example, you can train your machine learning model in Python and deploy it on Spark in a distributed mode. Ultimately, it all depends on how efficiently your solution can function, and more importantly, how fast and accurate it is. Which language do you prefer for crunching your Big Data? Do let us know!
Read more
  • 0
  • 1
  • 41540

article-image-what-is-streaming-analytics-and-why-is-it-important
Amey Varangaonkar
05 Oct 2017
5 min read
Save for later

Say hello to Streaming Analytics

Amey Varangaonkar
05 Oct 2017
5 min read
In this data-driven age, businesses want fast, accurate insights from their huge data repositories in the shortest time span — and in real time when possible. These insights are essential — they help businesses understand relevant trends, improve their existing processes, enhance customer satisfaction, improve their bottom line, and most importantly, build, and sustain their competitive advantage in the market.   Doing all of this is quite an ask - one that is becoming increasingly difficult to achieve using just the traditional data processing systems where analytics is limited to the back-end. There is now a burning need for a newer kind of system where larger, more complex data can be processed and analyzed on the go. Enter: Streaming Analytics Streaming Analytics, also referred to as real-time event processing, is the processing and analysis of large streams of data in real-time. These streams are basically events that occur as a result of some action. Actions like a transaction or a system failure, or a trigger that changes the state of a system at any point in time. Even something as minor or granular as a click would then constitute as an event, depending upon the context. Consider this scenario - You are the CTO of an organization that deals with sensor data from wearables. Your organization would have to deal with terabytes of data coming in on a daily basis, from thousands of sensors. One of your biggest challenges as a CTO would be to implement a system that processes and analyzes the data from these sensors as it enters the system. Here’s where streaming analytics can help you by giving you the ability to derive insights from your data on the go. According to IBM, a streaming system demonstrates the following qualities: It can handle large volumes of data It can handle a variety of data and analyze it efficiently — be it structured or unstructured, and identifies relevant patterns accordingly It can process every event as it occurs unlike traditional analytics systems that rely on batch processing Why is Streaming Analytics important? The humongous volume of data that companies have to deal with today is almost unimaginable. Add to that the varied nature of data that these companies must handle, and the urgency with which value needs to be extracted from this data - it all makes for a pretty tricky proposition. In such scenarios, choosing a solution that integrates seamlessly with different data sources, is fine-tuned for performance, is fast, reliable, and most importantly one that is flexible to changes in technology, is critical. Streaming analytics offers all these features - thereby empowering organizations to gain that significant edge over their competition. Another significant argument in favour of streaming analytics is the speed at which one can derive insights from the data. Data in a real-time streaming system is processed and analyzed before it registers in a database. This is in stark contrast to analytics on traditional systems where information is gathered, stored, and then the analytics is performed. Thus, streaming analytics supports much faster decision-making than the traditional data analytics systems. Is Streaming Analytics right for my business? Not all organizations need streaming analytics, especially those that deal with static data or data that hardly change over longer intervals of time, or those that do not require real-time insights for decision-making.   For instance, consider the HR unit of a call centre. It is sufficient and efficient to use a traditional analytics solution to analyze thousands of past employee records rather than run it through a streaming analytics system. On the other hand, the same call centre can find real value in implementing streaming analytics to something like a real-time customer log monitoring system. A system where customer interactions and context-sensitive information are processed on the go. This can help the organization find opportunities to provide unique customer experiences, improve their customer satisfaction score, alongside a whole host of other benefits. Streaming Analytics is slowly finding adoption in a variety of domains, where companies are looking to get that crucial competitive advantage - sensor data analytics, mobile analytics, business activity monitoring being some of them. With the rise of Internet of Things, data from the IoT devices is also increasing exponentially. Streaming analytics is the way to go here as well. In short, streaming analytics is ideal for businesses dealing with time-critical missions and those working with continuous streams of incoming data, where decision-making has to be instantaneous. Companies that obsess about real-time monitoring of their businesses will also find streaming analytics useful - just integrate your dashboards with your streaming analytics platform! What next? It is safe to say that with time, the amount of information businesses will manage is going to rise exponentially, and so will the nature of this information. As a result, it will get increasingly difficult to process volumes of unstructured data and gain insights from them using just the traditional analytics systems. Adopting streaming analytics into the business workflow will therefore become a necessity for many businesses. Apache Flink, Spark Streaming, Microsoft's Azure Stream Analytics, SQLstream Blaze, Oracle Stream Analytics and SAS Event Processing are all good places to begin your journey through the fleeting world of streaming analytics. You can browse through this list of learning resources from Packt to know more. Learning Apache Flink Learning Real Time processing with Spark Streaming Real Time Streaming using Apache Spark Streaming (video) Real Time Analytics with SAP Hana Real-Time Big Data Analytics
Read more
  • 0
  • 0
  • 39459

article-image-why-is-hadoop-dying
Aaron Lazar
23 Apr 2018
5 min read
Save for later

Why is Hadoop dying?

Aaron Lazar
23 Apr 2018
5 min read
Hadoop has been the definitive big data platform for some time. The name has practically been synonymous with the field. But while its ascent followed the trajectory of what was referred to as the 'big data revolution', Hadoop now seems to be in danger. The question is everywhere - is Hadoop dying out? And if it is, why is it? Is it because big data is no longer the buzzword it once was, or are there simply other ways of working with big data that have become more useful? Hadoop was essential to the growth of big data When Hadoop was open sourced in 2007, it opened the door to big data. It brought compute to data, as against bringing data to compute. Organisations had the opportunity to scale their data without having to worry too much about the cost. It obviously had initial hiccups with security, the complexity of querying and querying speeds, but all that was taken care off, in the long run. Still, although querying speeds remained quite a pain, however that wasn’t the real reason behind Hadoop dying (slowly). As cloud grew, Hadoop started falling One of the main reasons behind Hadoop's decline in popularity was the growth of cloud. There cloud vendor market was pretty crowded, and each of them provided their own big data processing services. These services all basically did what Hadoop was doing. But they also did it in an even more efficient and hassle-free way. Customers didn't have to think about administration, security or maintenance in the way they had to with Hadoop. One person’s big data is another person’s small data Well, this is clearly a fact. Several organisations that used big data technologies without really gauging the amount of data they actually would need to process, have suffered. Imagine sitting with 10TB Hadoop clusters when you don’t have that much data. The two biggest organisations that built products on Hadoop, Hortonworks and Cloudera, saw a decline in revenue in 2015, owing to their massive use of Hadoop. Customers weren’t pleased with nature of Hadoop’s limitations. Apache Hadoop v Apache Spark Hadoop processing is way behind in terms of processing speed. In 2014 Spark took the world by storm. I’m going to let you guess which line in the graph above might be Hadoop, and which might be Spark. Spark was a general purpose, easy to use platform that was built after studying the pitfalls of Hadoop. Spark was not bound to just the HDFS (Hadoop Distributed File System) which meant that it could leverage storage systems like Cassandra and MongoDB as well. Spark 2.3 was also able to run on Kubernetes; a big leap for containerized big data processing in the cloud. Spark also brings along GraphX, which allows developers to view data in the form of graphs. Some of the major areas Spark wins are Iterative Algorithms in Machine Learning, Interactive Data Mining and Data Processing, Stream processing, Sensor data processing, etc. Machine Learning in Hadoop is not straightforward Unlike MLlib in Spark, Machine Learning is not possible in Hadoop unless tied with a 3rd party library. Mahout used to be quite popular for doing ML on Hadoop, but its adoption has gone down in the past few years. Tools like RHadoop, a collection of 3 R packages, have grown for ML, but it still is nowhere comparable to the power of the modern day MLaaS offerings from cloud providers. All the more reason to move away from Hadoop, right? Maybe. Hadoop is not only Hadoop The general misconception is that Hadoop is quickly going to be extinct. On the contrary, the Hadoop family consists of YARN, HDFS, MapReduce, Hive, Hbase, Spark, Kudu, Impala, and 20 other products. While e folks may be moving away from Hadoop as their choice for big data processing, they will still be using Hadoop in some form or the other. As with Cloudera and Hortonworks, though the market has seen a downward trend, they’re in no way letting go of Hadoop anytime soon, although they have shifted part of their processing operations to Spark. Is Hadoop dying? Perhaps not... In the long run, it’s not completely accurate to say that Hadoop is dying. December last year brought with it Hadoop 3.0, which is supposed to be a much improved version of the framework. Some of the most noteworthy features are its improved shell script, more powerful YARN, improved fault tolerance with erasure coding, and many more. Although, that hasn’t caused any major spike in adoption, there are still users who will adopt Hadoop based on their use case, or simply use another alternative like Spark along with another framework from the Hadoop family. So, Hadoop’s not going away anytime soon. Read More Pandas is an effective tool to explore and analyze data - Interview insights  
Read more
  • 0
  • 1
  • 39249
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-healthcare-analytics-logistic-regression-to-reduce-patient-readmissions
Guest Contributor
20 Dec 2017
8 min read
Save for later

Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions

Guest Contributor
20 Dec 2017
8 min read
[box type="info" align="" class="" width=""]We bring to you another guest post by Benjamin Rojogan on Logistic regression to aid healthcare sector in reducing patient readmission. Ben's previous post on ensemble methods to optimize machine learning models is also available for a quick read here.[/box] ER visits are not cheap for any party involved. Whether this be the patient or the insurance company. However, this does not stop some patients from being regular repeat visitors. These recurring visits are due to lack of intervention for problems such as substance abuse, chronic diseases and mental illness. This increases costs for everybody in the healthcare system and reduces quality of care by playing a role in the overflowing of Emergency Departments (EDs). Research teams at UW and other universities are partnering with companies like Kensci to figure out how to approach the problem of reducing readmission rates. The ability to predict the likelihood of a patient’s readmission will allow for targeted intervention which in turn will help reduce the frequency of readmissions. Thus making the population healthier and hopefully reducing the estimated 41.3 billion USD healthcare costs for the entire system. How do they plan to do it? With big data and statistics, of course. A plethora of algorithms are available for data scientists to use to approach this problem. Many possible variables could affect the readmission and medical costs. Also, there are also many different ways researchers might pose their questions. However, the researchers at UW and many other institutions have been heavily focused on reducing the readmission rate simply by trying to calculate whether a person would or would not be readmitted. In particular, this team of researchers was curious about chronic ailments. Patients with chronic ailments are likely to have random flare ups that require immediate attention. Being able to predict if a patient will have an ER visit can lead to managing the cause more effectively. One approach taken by the data science team at UW as well as the Department of Family and Community Medicine at the University of Toronto was to utilize logistic regression to predict whether or not a patient would be readmitted. Patient readmission can be broken down into a binary output: either the patient is readmitted or not. As such logistic regression has been a useful model in my experience to approach this problem. Logistic Regression to predict patient readmissions Why do data scientists like to use logistic regression? Where is it used? And how does it compare to other data algorithms? Logistic regression is a statistical method that statisticians and data scientists use to classify people, products, entities, etc. It is used for analyzing data that produces a binary classification based on one or many independent variables. This means, it produces two clear classifications (Yes or No, 1 or 0, etc). With the example above, the binary classification would be: is the patient readmitted or not? Other examples of this could be whether to give a customer a loan or not, whether a medical claim is fraud or not, whether a patient has diabetes or not. Despite its name, logistic regression does not provide the same output like linear regression (per se). There are some similarities, for instance, the linear model is somewhat consistent as you might notice in the equation below where you see what is very similar to a linear equation. But the final output is based on the log odds. Linear regression and multivariate regression both take one to many independent variables and produce some form of continuous function. Linear regression could be used to predict the price of a house, a person’s age or the cost of a product an e-commerce should display to each customer. The output is not limited to only a few discrete classifications. Whereas logistic regression produces discrete classifiers. For instance, an algorithm using logistic regression could be used to classify whether or not a certain stock price would be either >$50 a share or <$50 a share. Linear regression would be used to predict if a stock share would be worth $50.01, $50.02….etc. Logistic regression is a calculation that uses the odds of a certain classification. In the equation above, the symbol you might know as pi actually represents the odds or probability. To reduce the error rate, we should predict Y = 1 when p ≥ 0.5 and Y = 0 when p < 0.5. This creates a linear classifier, a boundary that when the coefficients β0 + x · β has a p value that is p < 0.5 then Y = 0. By generating coefficients that help predict the logit transformation, the method allows to classify for the characteristic of interest. Now that is a lot of complex math mumbo jumbo. Let’s try to break it down into simpler terms. Probability vs. Odds Let’s start with probability. Let’s say a patient has the probability of 0.6 of being readmitted. Then the probability that the patient won’t be readmitted is .4. Now, we want to take this and convert it into odds. This is what the formula above is doing. You would take .6/.4 and get odds of 1.5. That means the odds of the patient being readmitted are 1.5 to 1. If instead the probability was .5 for both being readmitted and not being readmitted, then the odds would be 1:1. Now the next step in the logistic regression model would be to take the odds and get the “Log odds”. You do this by taking the 1.5 and put it into the log portion of the equation. Now you will get .18(rounded). In logistic regression, we don’t actually know p. That is what we are trying to essentially find and model using various coefficients and input variables. Each input provides a value that changes how much more likely an event will or will not occur. All of these coefficients are used to calculate the log odds. This model can take multiple variables like age, sex, height, etc. and specify how much of an effect each variable has on the odds an event will occur. Once the initial model is developed, then comes the work of deciding its value. How does a business go from creating an algorithm inside a computer and translate it into action. Some of us like to say the “computers” are the easy part. Personally I find the hard part to be the “people”. After all, at the end of the day, it comes down to business value. Will an algorithm save money or not? That means it has to be applied in real life. This could take the form of a new initiative, strategy, product recommendation, etc. You need to find the outliers that are worth going after! For instance, if we go back to the patient readmission example again. The algorithm points out patients with high probabilities of being readmitted. However if the readmission costs are low, they will probably be ignored..sadly. That is how businesses (including hospitals) look at problems. Logistic regression is a great tool for binary classification. It is unlike many other algorithms that estimate continuous variables or estimate distributions. This statistical method can be utilized to classify whether a person will be likely to get cancer because of environmental variables like proximity to a highway, smoking habits, etc? This method has been used effectively in the medical, financial and insurance industry successfully for a while. Knowing when to use what algorithm takes time. However, the more problems a data scientist faces, the faster they will recognize whether to use logistic regression or decision trees. Using logistic regression provides the opportunity for healthcare institutions to accurately target at risk individuals who should receive a more tailored behavioral health plan to help improve their daily health habits. This in turn opens the opportunity for better health for patients and lower costs for hospitals. [box type="shadow" align="" class="" width=""] About the Author Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/box]
Read more
  • 0
  • 0
  • 38257

article-image-two-popular-data-analytics-methodologies-every-data-professional-should-know-tdsp-crisp-dm
Amarabha Banerjee
21 Dec 2017
7 min read
Save for later

Two popular Data Analytics methodologies every data professional should know: TDSP & CRISP-DM

Amarabha Banerjee
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This is a book excerpt taken from Advanced Analytics with R and Tableau authored by Jen Stirrup & Ruben Oliva Ramos. This book will help you make quick, cogent, and data driven decisions for your business using advanced analytical techniques on Tableau and R.[/box] Today we explore popular data analytics methods such as Microsoft TDSP Process and the CRISP- DM methodology. Introduction There is an increasing amount of data in the world, and in our databases. The data deluge is not going to go away anytime soon! Businesses risk wasting the useful business value of information contained in databases, unless they are able to excise useful knowledge from the data. It can be hard to know how to get started. Fortunately, there are a number of frameworks in data science that help us to work our way through an analytics project. Processes such as Microsoft Team Data Science Process (TDSP) and CRISP-DM position analytics as a repeatable process that is part of a bigger vision. Why are they important? The Microsoft TDSP Process and the CRISP-DM frameworks are frameworks for analytics projects that lead to standardized delivery for organizations, both large and small. In this chapter, we will look at these frameworks in more detail, and see how they can inform our own analytics projects and drive collaboration between teams. How can we have the analysis shaped so that it follows a pattern so that data cleansing is included? Industry standard methodologies for analytics There are a few main methodologies: the Microsoft TDSP Process and the CRISP-DM Methodology. Ultimately, they are all setting out to achieve the same objectives as an analytics framework. There are differences, of course, and these are highlighted here. CRISP-DM and TDSP focus on the business value and the results derived from analytics projects. Both of these methodologies are described in the following sections. CRISP-DM One common methodology is the CRISP-DM methodology (the modeling agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process model that provides a fluid framework for devising, creating, building, testing, and deploying machine learning solutions. The process is loosely divided into six main phases. The phases can be seen in the following diagram: Initially, the process starts with a business idea and a general consideration of the data. Each stage is briefly discussed in the following sections. Business understanding/data understanding The first phase looks at the machine learning solution from a business standpoint, rather than a technical standpoint. The business idea is defined, and a draft project plan is generated. Once the business idea is defined, the data understanding phase focuses on data collection and familiarity. At this point, missing data may be identified, or initial insights may be revealed. This process feeds back to the business understanding phase. CRISP-DM model — data preparation In this stage, data will be cleansed and transformed, and it will be shaped ready for the modeling phase. CRISP-DM — modeling phase In the modeling phase, various techniques are applied to the data. The models are further tweaked and refined, and this may involve going back to the data preparation phase in order to correct any unexpected issues. CRISP-DM — evaluation The models need to be tested and verified to ensure that they meet the business objectives that were defined initially in the business understanding phase. Otherwise, we may have built models that do not answer the business question. CRISP-DM — deployment The models are published so that the customer can make use of them. This is not the end of the story, however. CRISP-DM — process restarted We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated. CRISP-DM summary CRISP-DM is the most commonly used framework for implementing machine learning projects specifically, and it applies to analytics projects as well. It has a good focus on the business understanding piece. However, one major drawback is that the model no longer seems to be actively maintained. The official site, CRISP-DM.org, is no longer being maintained. Furthermore, the framework itself has not been updated on issues on working with new technologies, such as big data. Big data technologies means that there can be additional effort spend in the data understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of big data sources. Team Data Science Process The TDSP process model provides a dynamic framework to machine learning solutions that have been through a robust process of planning, producing, constructing, testing, and deploying models. Here is an example of the TDSP process: The process is loosely divided into four main phases: Business Understanding Data Acquisition and Understanding Modeling Deployment The phases are described in the following paragraphs. Business understanding The Business understanding process starts with a business idea, which is solved with a machine learning solution. The business idea is defined from the business perspective, and possible scenarios are identified and evaluated. Ultimately, a project plan is generated for delivering the solution. Data acquisition and understanding Following on from the business understanding phase is the data acquisition and understanding phase, which concentrates on familiarity and fact-finding about the data. The process itself is not completely linear; the output of the data acquisition and understanding phase can feed back to the business understanding phase, for example. At this point, some of the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources. From the user's perspective, there may be actions arising from this effort. For example, it may be noted that there is missing data from the dataset, which requires further investigation before the project proceeds further. Modeling In the modeling phase of the TDSP process, the R model is created, built, and verified against the original business question. In light of the business question, the model needs to make sense. It should also add business value, for example, by performing better than the existing solution that was in place prior to the new R model. This stage also involves examining key metrics in evaluating our R models, which need to be tested to ensure that the models meet the original business objectives set out in the initial business understanding phase. Deployment R models are published to production, once they are proven to be a fit solution to the original business question. This phase involves the creation of a strategy for ongoing review of the R model's performance as well as a monitoring and maintenance plan. It is recommended to carry out a recurrent evaluation of the deployed models. The models will live in a fluid, dynamic world of data and, over time, this environment will impact their efficacy. The TDSP process is a cycle rather than a linear process, and it does not finish, even if the model is deployed. It is comprised of a clear structure for you to follow throughout the Data Science process, and it facilitates teamwork and collaboration along the way. TDSP Summary The data science unicorn does not exist; that is, the person who is equally skilled in all areas of data science, right across the board. In order to ensure successful projects where each team player contributes according to their skill set, the Team Data Science Summary is a team-oriented solution that emphasizes teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups, which can include open source technology. If you liked our post, please be sure to check out Advanced Analytics with R and Tableau that shows how to apply various data analytics techniques in R and Tableau across the different stages of a data science project highlighted in this article.  
Read more
  • 0
  • 0
  • 35288

article-image-python-data-stack
Akram Hussain
31 Oct 2014
3 min read
Save for later

Python Data Stack

Akram Hussain
31 Oct 2014
3 min read
The Python programming language has grown significantly in popularity and importance, both as a general programming language and as one of the most advanced providers of data science tools. There are 6 key libraries every Python analyst should be aware of, and they are: 1 - NumPY NumPY: Also known as Numerical Python, NumPY is an open source Python library used for scientific computing. NumPy gives both speed and higher productivity using arrays and metrics. This basically means it's super useful when analyzing basic mathematical data and calculations. This was one of the first libraries to push the boundaries for Python in big data. The benefit of using something like NumPY is that it takes care of all your mathematical problems with useful functions that are cleaner and faster to write than normal Python code. This is all thanks to its similarities with the C language. 2 - SciPY SciPY: Also known as Scientific Python, is built on top of NumPy. SciPy takes scientific computing to another level. It’s an advanced form of NumPy and allows users to carry out functions such as differential equation solvers, special functions, optimizers, and integrations. SciPY can be viewed as a library that saves time and has predefined complex algorithms that are fast and efficient. However, there are a plethora of SciPY tools that might confuse users more than help them. 3 - Pandas Pandas is a key data manipulation and analysis library in Python. Pandas strengths lie in its ability to provide rich data functions that work amazingly well with structured data. There have been a lot of comparisons between pandas and R packages due to their similarities in data analysis, but the general consensus is that it is very easy for anyone using R to migrate to pandas as it supposedly executes the best features of R and Python programming all in one. 4 - Matplotlib Matplotlib is a visualization powerhouse for Python programming, and it offers a large library of customizable tools to help visualize complex datasets. Providing appealing visuals is vital in the fields of research and data analysis. Python’s 2D plotting library is used to produce plots and make them interactive with just a few lines of code. The plotting library additionally offers a range of graphs including histograms, bar charts, error charts, scatter plots, and much more. 5 - scikit-learn scikit-learn is Python’s most comprehensive machine learning library and is built on top of NumPy and SciPy. One of the advantages of scikit-learn is the all in one resource approach it takes, which contains various tools to carry out machine learning tasks, such as supervised and unsupervised learning. 6 - IPython IPython makes life easier for Python developers working with data. It’s a great interactive web notebook that provides an environment for exploration with prewritten Python programs and equations. The ultimate goal behind IPython is improved efficiency thanks to high performance, by allowing scientific computation and data analysis to happen concurrently using multiple third-party libraries. Continue learning Python with a fun (and potentially lucrative!) way to use decision trees. Read on to find out more.
Read more
  • 0
  • 0
  • 34151

article-image-top-4-chatbot-development-frameworks-developers
Sugandha Lahoti
20 Oct 2017
8 min read
Save for later

Top 4 chatbot development frameworks for developers

Sugandha Lahoti
20 Oct 2017
8 min read
The rise of the bots is nigh! If you can imagine a situation involving a dialog, there is probably a chatbot for that. Just look at the chatbot market - text-based email/SMS bots, voice-based bots, bots for customer support, transaction-based bots, entertainment bots and many others. A large number of enterprises, from startups to established organizations, are seeking to invest in this sector. This has also led to an increase in the number of platforms used for chatbot building. These frameworks incorporate AI techniques along with natural language processing capabilities to assist developers in building and deploying chatbots. Let’s start with how a chatbot typically works before diving into some of the frameworks. Understand: The first step for any chatbot is to understand the user input. This is made possible using pattern matching and intent classification techniques. ‘Intents’ are the tasks that users might want to perform with a chatbot. Machine learning, NLP and speech recognition techniques are typically used to identify the intent of the message and extract named entities. Entities are the specific pieces of information extracted from the user’s response i.e. the content associated with an intent. Respond: After understanding, the next goal is to generate a response. This is based on the current input message and the context of the conversation. After specifying the intents and entities, a dialog flow is constructed. This is basically the replies/feedback expected from a chatbot. Learn: Chatbots use AI techniques such as natural language understanding and pattern recognition to store and distinguish between the context of the information provided, and elicit a suitable response for future replies. This is important because different requests might have different meanings depending on previous requests. Top chatbot development frameworks A bot development framework is a set of predefined classes, functions, and utilities that a developer can use to build chatbots easier and faster. They vary in the level of complexity, integration capabilities, and functionalities. Let us look at some of the development platforms utilized for chatbot building. API.AI API.AI, a code based framework with a simple web-based interface, allows users to build engaging voice and text-based conversational apps using a large number of libraries and SDKs including Android, iOS, Webkit HTML5, Node.js, and Python API. It also supports nearly 32 one-click platform integrations such as Google, Facebook Messenger, Twitter and Skype to name a few. API.AI makes use of an agent - a container that transforms natural language based user requests into actionable data. The software tries to find the intent behind a user’s reply and matches it to the default or the closest match. After intent matching, it executes the actions and responses the developer has defined for that intent. API.AI also makes use of entities. Once the intents and entities are specified, the bot is trained. API.AI’s training module efficiently tracks each user’s request and lets developers see how they are parsed and matched to an intent. It also allows for correction of any errors and change requests thus retraining the bot. API.AI streamlines the entire bot-creating process by helping developers provide domain-specific knowledge that is unique to a bot’s needs while working on speech recognition, intent and context management in the backend. Google has recently partnered with API.AI to help them build conversational tools like Apple’s Siri. Microsoft Bot Framework Microsoft Bot Framework allows building and deployment of chatbots across multiple platforms and services such as web, SMS, non-Microsoft platforms, Office 365, Skype etc. The Bot Framework includes two components - The Bot Builder and the Microsoft Cognitive Services. The Bot Builder comprises of two full-featured SDKs - for the.NET and the Node.js platforms along with an emulator for testing and debugging. There’s also a set of RESTful APIs for building code in other languages. The SDKs support features for simple and easy interactions between bots. They also have a large collection of prebuilt sample bots for the developer to choose from. The Microsoft Cognitive Services is a collection of intelligent APIs that simplify a variety of AI tasks such as allowing the system to understand and interpret the user's needs using natural language in just a few lines of code. These APIs allow integration to most modern languages and platforms and constantly improve, learn, and get smarter. Microsoft created the AI Inner Circle Partner Program to work hand in hand with industry to create AI solutions. Their only partner in the UK is ICS.AI who build conversational AI solutions for the UK's public sector. ICS are the first choice for many organisations due to their smart solutions that scale and serve to improve services for the general public. Developers can build bots in the Bot Builder SDK using C# or Node.js. They can then add AI capabilities with Cognitive Services. Finally, they can register the bots on the developer portal, connecting it to users across platforms such as Facebook and Microsoft Teams and also deploy it on the cloud like Microsoft Azure. For a step-by-step guide for chatbot building using Microsoft Bot Framework, you can refer to one of our books on the topic. Sabre Corporation, a customer service provider for travel agencies, have recently announced the development of an AI-powered chatbot that leverages Microsoft Bot Framework and Microsoft Cognitive Services. Watson Conversation IBM’s Watson Conversation helps build chatbot solutions that understand natural-language input and use machine learning to respond to customers in a way that simulates conversations between humans. It is built on a neural network of one million Wikipedia words. It offers deployment across a variety of platforms including mobile devices, messaging platforms, and robots. The platform is robust and secure as IBM allows users to opt out of data sharing. The IBM Watson Tone Analyzer service can help bots understand the tone of the user’s input for better management of the experience. The basic steps to create a chatbot using Watson Conversation are as follows. We first create a workspace - a place for configuring information to maintain separate intents, user examples, entities, and dialogues for each application. One workspace corresponds to one bot. Next, we create Intents. Watson Conversation makes use of multiple conditioned responses to distinguish between similar intents. For example, instead of building specific intents for locations of different places, it creates a general intent “location” and adds an entity to capture the response, like the “location- bedroom” - to the right, near the stairs, “location-kitchen”- to the left. The third step is entity establishment. This involves grouping entities that might trigger a similar response in the dialog. The dialog flow, thus generated after specifying the intents and entities, goes through testing followed by embedding this into an application. It is then connected with other services by using the conversation API. Staples, an office supply retailing firm, uses Watson Conversation in their “Easy Systems” to simplify the customer’s shopping experience. CXP Designer and Aspect NLU Aspect Customer Experience Platform is an application lifecycle management tool to build text and voice-based applications such as chatbots. It provides deployment options across multiple communication channels like text, voice, mobile web and social media networks. The Aspect CXP typically includes a CXP designer to build chatbots and the inbuilt Aspect NLU to provide advanced natural language capabilities. CXP designer works by creating dialog objects to provide a menu of options for frontend as well as backend. Menu items for the frontend are used to create intents and modules within those intents. The developer can then modify labels (of those intents and modules) manually or use the Aspect NLU to disambiguate similar questions for successful extraction of meaning and intent. The Aspect NLU includes tools for spelling correction, linguistic lexicons such as nouns, verbs etc. and options for detecting and extracting common data types such as date, time, numbers, etc. It also allows a developer to modify the meaning extraction based on how they want it if they want it! CXP designer also allows skipping of certain steps in chatbots. For instance, if the user has already provided their tracking id for a particular package, the chatbot will skip the prompt of asking them the tracking id again. With Aspect CXP, developers can create and deploy complex chatbots. Radisson Blu Edwardian, a hotel in London, has collaborated with Aspect software to build an SMS based, AI virtual host. Conclusion Another popular chatbot development platform worth mentioning is the Facebook messenger with over 100,000 monthly active bots, but without cross-platform deployment features. The above bot frameworks are typically used by developers to build chatbots from scratch and require some programming skills. However, there has been a rise in automated bot development tools of late. Some of these include Chatfuel and Motion AI and typically involve drag and drop functionalities. With such tools, beginners and non-programmers can create and deploy chatbots within few minutes. But, they lack the extended functionalities supported by typical code based frameworks such as the flexibility to store data, produce analytics or incorporate customized AI tasks. Every chatbot development system, whether framework or tool, serves a different purpose. Choosing the right one depends on the type of application to build, organizational needs, and the developer’s expertise.
Read more
  • 0
  • 0
  • 33761
article-image-newsql-what-hype-about
Amey Varangaonkar
06 Nov 2017
6 min read
Save for later

NewSQL: What the hype is all about

Amey Varangaonkar
06 Nov 2017
6 min read
First, there was data. Data became database. Then came SQL. Next came NoSQL. And now comes NewSQL. NewSQL Origins For decades, relational database or SQL was the reigning data management standard in enterprises all over the world. With the advent of Big Data and cloud-based storage rose the need for a faster, more flexible and scalable data management system, which didn’t necessarily comply with the SQL standards of ACID compliance. This was popularly dubbed as NoSQL, and databases like MongoDB, Neo4j, and others gained prominence in no time. We can attribute the emergence and eventual adoption of NoSQL databases to a couple of very important factors. The high costs and lack of flexibility of the traditional relational databases drove many SQL users away. Also, NoSQL databases are mostly open source, and their enterprise versions are comparatively cheaper too. They are schema-less meaning they can be used to manage unstructured data effectively. In addition, they can scale well horizontally - i.e. you could add more machines to increase computing power and use it to handle high volumes of data. All these features of NoSQL come with an important tradeoff, however - these systems can’t simultaneously ensure total consistency. Of late, there has been a rise in another type of database systems, with the aim to combine ‘the best of both the worlds’. Popularly dubbed as ‘NewSQL’, this system promises to combine the relational data model of SQL and the scalability and speed of NoSQL. NewSQL - The dark horse in the databases race NewSQL is ‘SQL on Steroids’, say many. This is mainly because all NewSQL systems start with the relational data model and the SQL query language, but also incorporate the features that have led to the rise of NoSQL - addressing the issues of scalability, flexibility, and high performance. They offer the assurance of ACID transactions like in the relational models. However, what makes them really unique is that they allow the horizontal scaling functionality of NoSQL, and can process large volumes of data with high performance and reliability. This is why businesses really like the concept of NewSQL - the performance of NoSQL and the reliability and consistency of the SQL model, all packed in one. To understand what the hype surrounding NewSQL is all about, it’s worth comparing NewSQL database systems with the traditional SQL and NoSQL database systems, and see where they stand out: Characteristic Relational (SQL) NoSQL NewSQL ACID compliance Yes No Yes OLTP/OLAP support Yes No Yes Rigid Schema Structure Yes No In some cases Support for unstructured data No Yes In some cases Performance with large data Moderate Fast Very fast Performance overhead Huge Moderate Minimal Support from Community Very high High Low   As we can see from the table above, NewSQL really comes through as the best when you’re dealing with larger datasets with a desire to lower performance overheads. To give you a practical example, consider an organization that has to work with a large number of short transactions, access a limited amount of data, but executes those queries repeatedly. For such organizations, a NewSQL database system would be a perfect fit. These features are leading to the gradual growth of NewSQL systems. However, it will take some time for more industries to adopt them. Not all NewSQL databases are created equal Today, one has a host of NewSQL solutions to choose from. Some popular solutions are Clustrix, MemSQL, VoltDB and CockroachDB.  Cloud Spanner, the latest NewSQL offering by Google, became generally available in February 2017 - indicating Google’s interest in the NewSQL domain and the value a NewSQL database can offer to their existing cloud offerings. It is important to understand that there are significant differences among these various NewSQL solutions. As such you should choose a NewSQL solution carefully after evaluating your organization’s data requirements and problems. As this article on Dataconomy points out, while some databases handle transactional workloads well, they do not offer the benefit of native clustering - SAP HANA is one such example. NuoDB focuses on cloud deployments, but its overall throughput is found to be rather sub-par. MemSQL is a suitable choice when it comes to clustered analytics but falls short when it comes to consistency. Thus, the choice of the database purely depends on the task you want to do, and what trade-offs you are ready to allow without letting it affect your workflow too much. DBAs and Programmers in the NewSQL world Regardless of which database system an enterprise adopts, the role of DBAs will continue to be important going forward. Core database administration and maintenance tasks such as backup, recovery, replication, etc. will need to be taken care of. The major challenge for the NewSQL DBAs will be in choosing and then customizing the right database solution that fits the organizational requirements. Some degree of capacity planning and overall database administration skills might also have to be recalibrated. Likewise, NewSQL database programmers may find themselves dealing with data manipulation and querying tasks similar to those faced while working with traditional database systems. But NewSQL programmers will be doing these tasks at a much larger, or shall we say, at a more ‘distributed’ scale. In conclusion When it comes to solving a particular problem related to data management, it’s often said that 80% of the solution comes down to selecting the right tool, and 20% is about understanding the problem at hand! In order to choose the right database system for your organization, you must ask yourself these two questions: What is the nature of the data you will work with? What are you willing to trade-off? In other words, how important are factors such as the scalability and performance of the database system? For example, if you primarily work with mostly transactional data with a priority on high performance and high scalability, then NewSQL databases might fit your bill just perfectly. If you’re going to work with volatile data, NewSQL might help you there as well, however, there are better NoSQL solutions to tackle your data problem. As we have seen earlier, NewSQL databases have been designed to combine the advantages and power of both relational and NoSQL systems. It is important to know that NewSQL databases are not designed to replace either NoSQL or SQL relational models. They are rather intentionally-built alternatives for data processing, which mask the flaws and shortcomings of both relational and nonrelational database systems. The ultimate goal of NewSQL is to deliver a high performance, highly available solution to handle modern data, without compromising on data consistency and high-speed transaction capabilities.
Read more
  • 0
  • 0
  • 32503

article-image-when-why-and-how-to-use-graph-analytics-for-your-big-data
Sunith Shetty
20 Dec 2017
10 min read
Save for later

When, why and how to use Graph analytics for your big data

Sunith Shetty
20 Dec 2017
10 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. In this book, you will learn how to perform real-time streaming analytics on big data using machine learning algorithms and power of Java. [/box] From the article given below, you will learn why graph analytics is a favourable choice in order to analyze complex datasets. Graph analytics Vs Relational Databases The biggest advantage to using graphs is you can analyze these graphs and use them for analyzing complex datasets. You might ask what is so special about graph analytics that we can’t do by relational databases. Let’s try to understand this using an example, suppose we want to analyze your friends network on Facebook and pull information about your friends such as their name, their birth date, their recent likes, and so on. If Facebook had a relational database, then this would mean firing a query on some table using the foreign key of the user requesting this info. From the perspective of relational database, this first level query is easy. But what if we now ask you to go to the friends at level four in your network and fetch their data (as shown in the following diagram). The query to get this becomes more and more complicated from a relational database perspective but this is a trivial task on a graph or graphical database (such as Neo4j). Graphs are extremely good on operations where you want to pull information from one end of the node to another, where the other node lies after a lot of joins and hops. As such, graph analytics is good for certain use cases (but not for all use cases, relational database are still good on many other use cases): As you can see, the preceding diagram depicts a huge social network (though the preceding diagram might just be depicting a network of a few friends only). The dots represent actual people in a social network. So if somebody asks to pick one user on the left-most side of the diagram and see and follow host connections to the right-most side and pull the friends at the say 10th level or more, this is something very difficult to do in a normal relational database and doing it and maintaining it could easily go out of hand. There are four particular use cases where graph analytics is extremely useful and used frequently (though there are plenty more use cases too): Path analytics: As the name suggests, this analytics approach is used to figure out the paths as you traverse along the nodes of a graph. There are many fields where this can be used—simplest being road networks and figuring out details such as shortest path between cities, or in flight analytics to figure out the shortest time taking flight or direct flights. Connectivity analytics: As the name suggests, this approach outlines how the nodes within a graph are connected to each other. So using this you can figure out how many edges are flowing into a node and how many are flowing out of the node. This kind of information is very useful in analysis. For example, in a social network if there is a person who receives just one message but gives out say ten messages within his network then this person can be used to market his favorite products as he is very good in responding to messages. Community Analytics: Some graphs on big data are huge. But within these huge graphs there might be nodes that are very close to each other and are almost stacked in a cluster of their own. This is useful information as based on this you can extract out communities from your data. For example, in a social network if there are people who are part of some community, say marathon runners, then they can be clubbed into a single community and further tracked. Centrality Analytics: This kind of analytical approach is useful in finding central nodes in a network or graph. This is useful in figuring out sources that are single handedly connected to many other sources. It is helpful in figuring out influential people in a social network, or a central computer in a computer network. From the perspective of this article, we will be covering some of these use cases in our sample case studies and for this we will be using a library on Apache Spark called GraphFrames. GraphFrames GraphX library is advanced and performs well on massive graphs, but, unfortunately, it’s currently only implemented in Scala and does not have any direct Java API. GraphFrames is a relatively new library that is built on top of Apache Spark and provides support for dataframe (now dataset) based graphs. It contains a lot of methods that are direct wrappers over the underlying sparkx methods. As such it provides similar functionality as GraphX except that GraphX acts on the Spark SRDD and GraphFrame works on the dataframe so GraphFrame is more user friendly (as dataframes are simpler to use). All the advantages of firing Spark SQL queries, joining datasets, filtering queries are all supported on this. To understand GraphFrames and representing massive big data graphs, we will take small baby steps first by building some simple programs using GraphFrames before building full-fledged case studies. First, let’s see how to build a graph using Spark and GraphFrames on some sample dataset. Building a graph using GraphFrames Consider that you have as simple graph as shown next. This graph depicts four people Kai, John, Tina, and Alex and the relation they share whether they follow each other or are friends. We will now try to represent this basic graph using the GraphFrame library on top of Apache Spark and in the meantime, we will also start learning the GraphFrame API. Since GraphFrame is a module on top of Spark, let’s first build the Spark configuration and spark sql context for brevity: SparkConfconf= ... JavaSparkContextsc= ... SQLContextsqlContext= ... We will now build the JavaRDD object that will contain the data for our vertices or the people Kai, John, Alex, and Tina in this small network. We will create some sample data using the RowFactory class of Spark API and provide the attributes (ID of the person, and their name and age) that we need per row of the data: JavaRDD<Row>verRow = sc.parallelize(Arrays.asList(RowFactory.create(101L,”Kai”,27),        RowFactory.create(201L,”John”,45),   RowFactory.create(301L,”Alex”,32),        RowFactory.create(401L,”Tina”,23))); Next we will define the structure or schema of the attributes used to build the data. The ID of the person is of type long and the name of the person is a string, and the age of the person is an integer as shown next in the code: List<StructField>verFields = newArrayList<StructField>();  verFields.add(DataTypes.createStructField(“id”,DataTypes.LongType, true));  verFields.add(DataTypes.createStructField(“name”,DataTypes.StringType, true));      verFields.add(DataTypes.createStructField(“age”,DataTypes.IntegerType, true)); Now, let’s build the sample data for the relations between these people and this can basically be represented as the edges of the graph later. This data item of relationship will have the IDs of the persons that are connected together and the type of relationship they share (that is friends or followers). Again we will use the Spark provided RowFactory and build some sample data per row and create the JavaRDD with this data: JavaRDD<Row>edgRow = sc.parallelize(Arrays.asList(        RowFactory.create(101L,301L,”Friends”),        RowFactory.create(101L,401L,”Friends”),        RowFactory.create(401L,201L,”Follow”),        RowFactory.create(301L,201L,”Follow”),        RowFactory.create(201L,101L,”Follow”))); Again, define the schema of the attributes added as part of the edges earlier. This schema is later used in building the dataset for the edges. The attributes passed are the source ID of the node, destination ID of the other node, as well as the relationType, which is a string: List<StructField>EdgFields = newArrayList<StructField>();    EdgFields.add(DataTypes.createStructField(“src”,DataTypes.LongType,true));  EdgFields.add(DataTypes.createStructField(“dst”,DataTypes.LongType,true));  EdgFields.add(DataTypes.createStructField(“relationType”,DataTypes.StringType,true)); Using the schemas that we have defined for the vertices and edges, let’s now build the actual dataset for the vertices and the edges. For this, first create the StructType  object that holds the schema details for the vertices and the edges data and using this structure and the actual data we will next build the dataset of the verticles (verDF) and the dataset for the edges (edgDF): StructTypeverSchema = DataTypes.createStructType(verFields); StructTypeedgSchema = DataTypes.createStructType(EdgFields);    Dataset<Row>verDF = sqlContext.createDataFrame(verRow, verSchema); Dataset<Row>edgDF = sqlContext.createDataFrame(edgRow, edgSchema); Finally, we will now use the vertices and the edges dataset and pass it as a parameter to the GraphFrame constructor and build the GraphFrame instance: GraphFrameg = newGraphFrame(verDF,edgDF); Time has now come to see some mild analytics on the graph we just created. Let’s first visualize our data for the graphs; let’s see the data on the vertices. For this, we will invoke the vertices method on the GraphFrame instance and invoke the standard show method on the generated vertices dataset (GraphFrame would generate a new dataset when the vertices method is invoked). g.vertices().show(); This would print the output as follows: Let’s also see the data on the edges: g.edges().show(); This would print as the output as follows: Let’s also see the number of edges and the number of vertices: System.out.println(“Number of Vertices : “ + g.vertices().count()); System.out.println(“Number of Edges : “ + g.edges().count()); This would print the result as follows: Number of Vertices : 4 Number of Edges : 5 GraphFrame has a handy method to find all the indegrees (out degree or degree) g.inDegrees().show(); This would print the in degrees of all the vertices as shown next: Finally, let’s see one more small thing on this simple graph. As GraphFrames work on the datasets, all the dataset handy methods such as filtering, map, and so on can be applied on them. We will use the filter method and run it on the vertices dataset to figure out the people in the graph with age greater than thirty: g.vertices().filter(“age > 30”).show(); This would print the result as follows: From this post, we learned about graph analytics. We saw how graphs can be built from massive big datasets in order to derive quick insights. You will understand when to implement graph analytics or relational database based on the growing challenges in your organization. To know more about preparing and refining big data and to perform smart data analytics using machine learning algorithms you can refer to the book Big Data Analytics with Java.
Read more
  • 0
  • 0
  • 32361

article-image-analyzing-the-chief-data-officer-role
Aaron Lazar
13 Apr 2018
9 min read
Save for later

GDPR is pushing the Chief Data Officer role center stage

Aaron Lazar
13 Apr 2018
9 min read
Gartner predicted that by 2020 90% of large organizations in regulated industries will have a Chief Data Officer role. With the recent heat around Facebook, Mark Zuckerberg, and the fast-approaching GDPR compliance deadline, it’s quite likely that 2018 will be the year of the Chief Data Officer. This article was first published in October, 2017 and has been updated to keep up with the latest trends in GDPR. In 2014 around 400 CEOs and top business execs were asked how they recognize data as a corporate asset. They responded with a mixed set of reactions and viewed the worth of data in their organization in varied ways. Now, in 2018 these reactions have drastically changed – more and more organizations have realized the importance of Data-as-an-Asset. More importantly, the European Union (EU) has made it mandatory that General Data Protection Regulation (GDPR) compliance be sought by the 25th of May, 2018. The primary reason for creating the Chief Data Officer role was to connect the ends between functional management and IT teams in an organization. But now, it looks like the CDO will also be primarily focusing on setting up and driving GDPR compliance, to avoid a fine up to €20 million or 4% of the annual global turnover of the previous year, whichever is higher. We’re going to spend a few minutes breaking down the Chief Data Officer role for you, revealing several interesting insights along the way. Let’s start with the obvious question. What might the Chief Data Officer’s responsibilities be? Like other C-suite execs, a Chief Data Officer is expected to have a well-blended mix of technical know-how and business acumen. Their role is very diverse and sometimes comes across as a pain to define the scope. Here are some of the key responsibilities of a Chief Data Officer: Data Policies and GDPR Compliance Data security is one of the most important elements that any business must consider. It needs to comply with regulatory standards and requirements of the country where it operates. A Chief Data Officer is be responsible for ensuring the compliance of policies across all branches of business and the associated compliance requirement taxonomies, on a global level. What is GDPR? If you’re thinking there were no data protection laws before the General Data Protection Regulation 2016/679, that’s not true. The major difference, however, is that GDPR focuses more on customer data privacy and protection. GDPR requirements will change the way organizations store, process and protect their customers’ personal data. They will need solutions for assessing, implementing and maintaining GDPR compliance, and that’s where Chief Data Officers fit in. Using data to gain a competitive edge Chief Data Officers need to have sound knowledge of the business’ customers, markets where they operate, and strong analytical skills to leverage the right data, at the right time, at the right place. This would eventually give the business an edge over its competitors in the market. For example, a ferry service could use data to identify the rates that customers would be willing to pay at a certain time of the day. Setting in motion the best practices of data governance Organizations span across the globe these days and often employees from different parts of the world work on the same data. This can often result in data moving through unconnected systems, and ending up as inefficient or disjointed pieces of business information. A Chief Data Officer needs to ensure that this information is aggregated and maintained in such a way that clear information ownership across the organization is established. Architecting future-proof data solutions A Chief Data Officer often acts like a Data Architect. They will sometimes take on responsibility for planning, designing, and building Big Data systems and ensuring successful integration with other systems in an organization. Designing systems that can provide answers to the user’s problems now and in the future is vital. Chief Data Officers are often found asking themselves key questions like how to generate data with maximum reusability, while also making sure it’s as accurate and relevant as possible. Defining Information Management tools Different business units across the globe tend to use different tools, technologies, and systems to work on, store, and share information on an enterprise level. This greatly affects a company’s ability to access and leverage data for effective decision making and various other duties. A Chief Data Officer is responsible for establishing data-oriented standards across the business and for driving all arms of the business to comply with the standards and embrace change, to ensure the integrity of data. Spotting new opportunities A Chief Data Officer is responsible for spotting new opportunities where the business can venture into through careful analysis of data and past records. For example, a motor company would leverage certain sales information to make an informed decision on what age group to target with their new SUV in-the-making, to maximize sales. These are just the tip of the iceberg when it comes to a Chief Data Officer’s responsibilities. Responsibilities go hand-in-hand with skill and traits required to execute those responsibilities. Below are some key capabilities sought after in a CDO. Key Chief Data Officer Skills The right person for the job is expected to possess impeccable leadership and C-Suite level communication skills, as well as strong business acumen. They are expected to have strong knowledge of GDPR software tools and solutions to enable the organizations hiring them to swiftly transform and adopt the new regulations. They are also expected to possess knowledge of IT architecture, including a familiarity with leading architectural standards such as TOGAF or the Zachman framework. They need to be experienced in driving data governance as well as data quality and integrity, while also possessing a strong knowledge of data analytics, visualization, and storytelling. Familiar with Big Data solutions like Hadoop, MapReduce, and HBase is a plus. How much do Chief Data Officers earn?  Now, let’s take a look at what kind of compensation a Chief Data Officer is likely to be offered. To tell you the truth, the answer to this question is still a bit hazy, but it’s sure to pick up speed with the recent developments in the regulatory and legal areas related to data. About a year ago, a blog post from careeraddict revealed that the salary for a CDO in the US was around $112,000 annually. A job listing seen on Indeed quoted $200,000 as the annual salary. Indeed shows 7 jobs posted for a CDO in the last 15 days. We took the estimated salaries of CDOs and compared them with those of CIOs and CTOs in the same company. It turns out that most were on par, with a few CDO compensations falling slightly short of the CIO and CTO salary. These are just basic salary figures. Bonuses add on the side, amounting up to 50% in some cases. However, please note that these salary figures vary heavily based on the type of organization and the industry. Do businesses even need a Chief Data Officer? One might argue that some of the skills expected of a Chief Data Officer would also be held by the CIO or the Chief Digital Officer of the organization or the Data Protection Officer (if they have one). Then, why have a Chief Data Officer at all and incur an extra significant cost to the company? With the rapid change in tech and the rate at which data is generated, used, and discarded, most data pointers, point in the direction of having a separate Chief Data Officer, working alongside the CIO. It’s critical to have a clearly defined need for both roles to co-exist. Blurring the boundaries of the two roles can be detrimental and organizations must, therefore, be painstakingly mindful of the defined KRAs. The organisation should clearly define the two roles to keep the business structure running smoothly. A Chief Data Officer’s main focus will be on the latest data-centric technological innovations, their compliance to the new standards while also boosting customer engagement, privacy and in turn, loyalty and the business’s competitive advantage. The CIO, on the other hand, focuses on improving the bottom line by owning business productivity metrics, cost-cutting initiatives, making IT investments etc. – i.e., an inward facing data-management and architecture role. The CIO is the person who is therefore responsible for leading digital initiatives at a board level. In addition to managing data and governing information, if the CIO’s responsibilities were to also include implementing analytics in fresh ways to generate value for the business, it is going to be a tall order. To put it simply, it is more practical for the CIO to own the systems and the CDO to oversee all the bits and bytes that flow through these systems. Moreover, in several cases, the Chief Data Officer will act as a liaison between the business and IT. Thus, Chief Data Officers and CIOs both need to work together and support each other for a better functioning business. The bottom line: a Chief Data Officer is essential For an organization dealing with a lot of data, a Chief Data Officer is a must. Failure to have one on board can result in being fined €10 million euros or 2% of the organization’s worldwide turnover (depending on which is higher). Here are the criteria for an organization to have a dedicated personnel managing Data Protection. The organization’s core activities should: Have data processing operations which require regular and systematic monitoring of data subjects on a large scale or monitoring of individuals Be processing a large scale of special categories of data (i.e. sensitive data such as health, religion, race, sexual orientation etc.) Have data processing carried out by a public authority or a body processing personal data, except for courts operating in their judicial capacity Apart from this mandate, a Chief Data Officer can add immense value by aligning data-driven insights with its vision and goals. A CDO can bridge the gap between the CMO and the CIO, by focusing on meeting customer requirements through data-driven products. For those in data and insights centric roles such as data scientists, data engineers, data analysts and others, the CDO is a natural destination for their career progression journey. The Chief Data Officer role is highly attractive in terms of the scope of responsibilities, the capabilities and of course, the pay. Certification courses like this one are popping up to help individuals shape themselves for the role. All-in-all, this new C-suite position in most organizations, is the perfect pivot between old and new, bridging silos, and making a future where data privacy is intact.
Read more
  • 0
  • 0
  • 30368
article-image-create-strong-data-science-project-portfolio-lands-job
Aaron Lazar
13 Feb 2018
8 min read
Save for later

How to create a strong data science project portfolio that lands you a job

Aaron Lazar
13 Feb 2018
8 min read
Okay, you’re probably here because you’ve got just a few months to graduate and the projects section of your resume is blank. Or you’re just an inquisitive little nerd scraping the WWW for ways to crack that dream job. Either way, you’re not alone and there are ten thousand others trying to build a great Data Science portfolio to land them a good job. Look no further, we’ll try our best to help you on how to make a portfolio that catches the recruiter’s eye! David “Trent” Salazar‘s portfolio is a great example of a wholesome one and Sajal Sharma’s, is a good example of how one can display their Data Science Portfolios on a platform like Github. Companies are on the lookout for employees who can add value to the business. To showcase this on your resume effectively, the first step is to understand the different ways in which you can add value. 4 things you need to show in a data science portfolio Data science can be broken down into 4 broad areas: Obtaining insights from data and presenting them to the business leaders Designing an application that directly benefits the customer Designing an application or system that directly benefits other teams in the organisation Sharing expertise on data science with other teams You’ll need to ensure that your portfolio portrays all or at least most of the above, in order to easily make it through a job selection. So let’s see what we can do to make a great portfolio. Demonstrate that you know what you're doing So the idea is to show the recruiter that you’re capable of performing the critical aspects of Data Science, i.e. import a data set, clean the data, extract useful information from the data using various techniques, and finally visualise the findings and communicate them. Apart from the technical skills, there are a few soft skills that are expected as well. For instance, the ability to communicate and collaborate with others, the ability to reason and take the initiative when required. If your project is actually able to communicate these things, you’re in! Stay focused and be specific You might know a lot, but rather than throwing all your skills, projects and knowledge in the employer’s face, it’s always better to be focused on doing something and doing it right. Just as you’d do in your resume, keeping things short and sweet, you can implement this while building your portfolio too. Always remember, the interviewer is looking for specific skills. Research the data science job market Find 5-6 jobs, probably from Linkedin or Indeed, that interest you and go through their descriptions thoroughly. Understand what kind of skills the employer is looking for. For example, it could be classification, machine learning, statistical modeling or regression. Pick up the tools that are required for the job - for example, Python, R, TensorFlow, Hadoop, or whatever might get the job done. If you don’t know how to use that tool, you’ll want to skill-up as you work your way through the projects. Also, identify the kind of data that they would like you to be working on, like text or numerical, etc. Now, once you have this information at hand, start building your project around these skills and tools. Be a problem solver Working on projects that are not actual ‘problems’ that you’re solving, won’t stand out in your portfolio. The closer your projects are to the real-world, the easier it will be for the recruiter to make their decision to choose you. This will also showcase your analytical skills and how you’ve applied data science to solve a prevailing problem. Put at least 3 diverse projects in your data science portfolio A nice way to create a portfolio is to list 3 good projects that are diverse in nature. Here are some interesting projects to get you started on your portfolio: Data Cleaning and wrangling Data Cleaning is one of the most critical tasks that a data scientist performs. By taking a group of diverse data sets, consolidating and making sense of them, you’re giving the recruiter confidence that you know how to prep them for analysis. For example, you can take Twitter or Whatsapp data and clean it for analysis. The process is pretty simple; you first find a “dirty” data set, then spot an interesting angle to approach the data from, clean it up and perform analysis on it, and finally present your findings. Data storytelling Storytelling showcases not only your ability to draw insight from raw data, but it also reveals how well you’re able to convey the insights to others and persuade them. For example, you can use data from the bus system in your country and gather insights to identify which stops incur the most delays. This could be fixed by changing their route. Make sure your analysis is descriptive and your code and logic can be followed. Here’s what you do; first you find a good dataset, then you explore the data and spot correlations in the data. Then you visualize it before you start writing up your narrative. Tackle the data from various angles and pick up the most interesting one. If it’s interesting to you, it will most probably be interesting to anyone else who’s reviewing it. Break down and explain each step in detail, each code snippet, as if you were describing it to a friend. The idea is to teach the reviewer something new as you run through the analysis. End to end data science If you’re more into Machine Learning, or algorithm writing, you should do an end-to-end data science project. The project should be capable of taking in data, processing it and finally learning from it, every step of the way. For example, you can pick up fuel pricing data for your city or maybe stock market data. The data needs to be dynamic and updated regularly. The trick for this one is to keep the code simple so that it’s easy to set up and run. You first need to identify a good topic. Understand here that we will not be working with a single dataset, rather you will need to import and parse all the data and bring it under a single dataset yourself. Next, get the training and test data ready to make predictions. Document your code and other findings and you’re good to go. Prove you have the data science skill set If you want to get that job, you’ve got to have the appropriate tools to get the job done. Here’s a list of some of the most popular tools with a link to the right material for you to skill-up: Data science languages There's a number of key languages in data science that are essential. It might seem obvious, but making sure they're on your resume and demonstrated in your portfolio is incredibly important. Include things like: Python R Java Scala SQL Big Data tools If you're applying for big data roles, demonstrating your experience with the key technologies is a must. It not only proves you have the skills, but also shows that you have an awareness of what tools can be used to build a big data solution or project. You'll need: Hadoop, Spark Hive Machine learning frameworks With machine learning so in demand, if you can prove you've used a number of machine learning frameworks, you've already done a lot to impress. Remember, many organizations won't actually know as much about machine learning as you think. In fact, they might even be hiring you with a view to building out this capability. Remember to include: TensorFlow Caffe2 Keras PyTorch Data visualisation tools Data visualization is a crucial component of any data science project. If you can visualize and communicate data effectively, you're immediately demonstrating you're able to collaborate with others and make your insights accessible and useful to the wider business. Include tools like these in your resume and portfolio:  D3.js Excel chart  Tableau  ggplot2 So there you have it. You know what to do to build a decent data science portfolio. It’s really worth attending competitions and challenges. It will not only help you keep up to data and well oiled with your skills, but also give you a broader picture of what people are actually working on and with what tools they’re able to solve problems.
Read more
  • 0
  • 2
  • 27879

article-image-6-key-skills-data-scientist-role
Amey Varangaonkar
21 Dec 2017
6 min read
Save for later

6 Key Areas to focus on while transitioning to a Data Scientist role

Amey Varangaonkar
21 Dec 2017
6 min read
[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Statistics for Data Science, authored by James D. Miller. The book dives into the different statistical approaches to discover hidden insights and patterns from different kinds of data.[/box] Being a data scientist is undoubtedly a very lucrative career prospect. In fact, it is one of the highest paying jobs in the world right now. That said, transitioning from a data developer role to a data scientist role needs careful planning and a clear understanding of what the role is all about. In this interesting article, the author highlights six key skills must give special attention to, during this transition. Let's start by taking a moment to state what I consider to be a few generally accepted facts about transitioning to a data scientist. We'll reaffirm these beliefs as we continue through this book: Academia: Data scientists are not all from one academic background. They are not all computer science or statistics/mathematics majors. They do not all possess an advanced degree (in fact, you can use statistics and data science with a bachelor's degree or even less). It's not magic-based: Data scientists can use machine learning and other accepted statistical methods to identify insights from data, not magic. They are not all tech or computer geeks: You don't need years of programming experience or expensive statistical software to be effective. You don't need to be experienced to get started. You can start today, right now. (Well, you already did when you bought this book!) Okay, having made the previous declarations, let's also be realistic. As always, there is an entry-point for everything in life, and, to give credit where it is due, the more credentials you can acquire to begin out with, the better off you will most likely be. Nonetheless, (as we'll see later in this chapter), there is absolutely no valid reason why you cannot begin understanding, using, and being productive with data science and statistics immediately. [box type="info" align="" class="" width=""]As with any profession, certifications and degrees carry the weight that may open the doors, while experience, as always, might be considered the best teacher. There are, however, no fake data scientists but only those with currently more desire than practical experience.[/box] If you are seriously interested in not only understanding statistics and data science but eventually working as a full-time data scientist, you should consider the following common themes (you're likely to find in job postings for data scientists) as areas to focus on: Education Common fields of study here are Mathematics and Statistics, followed by Computer Science and Engineering (also Economics and Operations research). Once more, there is no strict requirement to have an advanced or even related degree. In addition, typically, the idea of a degree or an equivalent experience will also apply here. Technology You will hear SAS and R (actually, you will hear quite a lot about R) as well as Python, Hadoop, and SQL mentioned as key or preferable for a data scientist to be comfortable with, but tools and technologies change all the time so, as mentioned several times throughout this chapter, data developers can begin to be productive as soon as they understand the objectives of data science and various statistical mythologies without having to learn a new tool or language. [box type="info" align="" class="" width=""]Basic business skills such as Omniture, Google Analytics, SPSS, Excel, or any other Microsoft Office tool are assumed pretty much everywhere and don't really count as an advantage, but experience with programming languages (such as Java, PERL, or C++) or databases (such as MySQL, NoSQL, Oracle, and so on.) does help![/box] Data The ability to understand data and deal with the challenges specific to the various types of data, such as unstructured, machine-generated, and big data (including organizing and structuring large datasets). [box type="info" align="" class="" width=""]Unstructured data is a key area of interest in statistics and for a data scientist. It is usually described as data having no redefined model defined for it or is not organized in a predefined manner. Unstructured information is characteristically text-heavy but may also contain dates, numbers, and various other facts as well.[/box] Intellectual Curiosity I love this. This is perhaps well defined as a character trait that comes in handy (if not required) if you want to be a data scientist. This means that you have a continuing need to know more than the basics or want to go beyond the common knowledge about a topic (you don't need a degree on the wall for this!) Business Acumen To be a data developer or a data scientist you need a deep understanding of the industry you're working in, and you also need to know what business problems your organization needs to unravel. In terms of data science, being able to discern which problems are the most important to solve is critical in addition to identifying new ways the business should be leveraging its data. Communication Skills All companies look for individuals who can clearly and fluently translate their findings to a non-technical team, such as the marketing or sales departments. As a data scientist, one must be able to enable the business to make decisions by arming them with quantified insights in addition to understanding the needs of their non-technical colleagues to add value and be successful. This article paints a much clearer picture on why soft skills play an important role in becoming a better data scientist. So why should you, a data developer, endeavor to think like (or more like) a data scientist? Specifically, what might be the advantages of thinking like a data scientist? The following are just a few notions supporting the effort: Developing a better approach to understanding data Using statistical thinking during the process of program or database designing Adding to your personal toolbox Increased marketability If you found this article to be useful, make sure you check out the book Statistics for Data Science, which includes a comprehensive list of tips and tricks to becoming a successful data scientist by mastering the basic and not-so-basic concepts of statistics.  
Read more
  • 0
  • 0
  • 27786