Big Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-15-useful-python-libraries-to-make-your-data-science-tasks-easier

12 Feb 2018

10 min read

15 Useful Python Libraries to make your Data Science tasks Easier

12 Feb 2018

0
0
48122

article-image-what-does-a-data-science-team-look-like

Fatema Patrawala

21 Nov 2019

11 min read

What does a data science team look like?

Fatema Patrawala

21 Nov 2019

11 min read

Until a couple of years ago, people barely knew the term 'data science' which has now evolved into an extremely popular career field. The Harvard Business Review dubbed data scientist within the data science team as the sexiest job of the 21st century and expert professionals jumped on the data is the new oil bandwagon. As per the Figure Eight Report 2018, which takes the pulse of the data science community in the US, a lot has changed rapidly in the data science field over the years. For the 2018 report, they surveyed approximately 240 data scientists and found out that machine learning projects have multiplied and more and more data is required to power them. Data science and machine learning jobs are LinkedIn's fastest growing jobs. And the internet is creating 2.5 quintillion bytes of data to process and analyze each day. With all these changes, it is evident for data science teams to evolve and change among various organizations. The data science team is responsible for delivering complex projects where system analysis, software engineering, data engineering, and data science is used to deliver the final solution. To achieve all of this, the team does not only have a data scientist or a data analyst but also includes other roles like business analyst, data engineer or architect, and chief data officer. In this post, we will differentiate and discuss various job roles within a data science team, skill sets required and the compensation benefit for each one of them. For an in-depth understanding of data science teams, read the book, Managing Data Science by Kirill Dubovikov, which has interesting case studies on building successful data science teams. He also explores how the team can efficiently manage data science projects through the use of DevOps and ModelOps. Now let's get into understanding individual data science roles and functions, but before that we take a look at the structure of the team.There are three basic team structures to match different stages of AI/ML adoption: IT centric team structure At times for companies hiring a data science team is not an option, and they have to leverage in-house talent. During such situations, they take advantage of the fully functional in-house IT department. The IT team manages functions like data preparation, training models, creating user interfaces, and model deployment within the corporate IT infrastructure. This approach is fairly limited, but it is made practical by MLaaS solutions. Environments like Microsoft Azure or Amazon Web Services (AWS) are equipped with approachable user interfaces to clean datasets, train models, evaluate them, and deploy. Microsoft Azure, for instance, supports its users with detailed documentation for a low entry threshold. The documentation helps in fast training and early deployment of models even without an expert data scientists on board. Integrated team structure Within the integrated structure, companies have a data science team which focuses on dataset preparation and model training, while IT specialists take charge of the interfaces and infrastructure for model deployment. Combining machine learning expertise with IT resource is the most viable option for constant and scalable machine learning operations. Unlike the IT centric approach, the integrated method requires having an experienced data scientist within the team. This approach ensures better operational flexibility in terms of available techniques. Additionally, the team leverages deeper understanding of machine learning tools and libraries – like TensorFlow or Theano which are specifically for researchers and data science experts. Specialized data science team Companies can also have an independent data science department to build an all-encompassing machine learning applications and frameworks. This approach entails the highest cost. All operations, from data cleaning and model training to building front-end interfaces, are handled by a dedicated data science team. It doesn't necessarily mean that all team members should have a data science background, but they should have technology background with certain service management skills. A specialized structure model aids in addressing complex data science tasks that include research, use of multiple ML models tailored to various aspects of decision-making, or multiple ML backed services. Today's most successful Silicon Valley tech operates with specialized data science teams. Additionally they are custom-built and wired for specific tasks to achieve different business goals. For example, the team structure at Airbnb is one of the most interesting use cases. Martin Daniel, a data scientist at Airbnb in this talk explains how the team emphasizes on having an experimentation-centric culture and apply machine learning rigorously to address unique product challenges. Job roles and responsibilities within data science team As discussed earlier, there are many roles within a data science team. As per Michael Hochster, Director of Data Science at Stitch Fix, there are two types of data scientists: Type A and Type B. Type A stands for analysis. Individuals involved in Type A are statisticians that make sense of data without necessarily having strong programming knowledge. Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc. Type B stands for building. These individuals use data in production. They're good software engineers with strong programming knowledge and statistics background. They build recommendation systems, personalization use cases, etc. Though it is rare that one expert will fit into a single category. But understanding these data science functions can help make sense of the roles described further. Chief data officer/Chief analytics officer The chief data officer (CDO) role has been taking organizations by storm. A recent NewVantage Partners' Big Data Executive Survey 2018 found that 62.5% of Fortune 1000 business and technology decision-makers said their organization appointed a chief data officer. The role of chief data officer involves overseeing a range of data-related functions that may include data management, ensuring data quality and creating data strategy. He or she may also be responsible for data analytics and business intelligence, the process of drawing valuable insights from data. Even though chief data officer and chief analytics officer (CAO) are two distinct roles, it is often handled by the same person. Expert professionals and leaders in analytics also own the data strategy and how a company should treat its data. It does make sense as analytics provide insights and value to the data. Hence, with a CDO+CAO combination companies can take advantage of a good data strategy and proper data management without losing on quality. According to compensation analysis from PayScale, the median chief data officer salary is $177,405 per year, including bonuses and profit share, ranging from $118,427 to $313,791 annually. Skill sets required: Data science and analytics, programming skills, domain expertise, leadership and visionary abilities are required. Data analyst The data analyst role implies proper data collection and interpretation activities. The person in this job role will ensure that collected data is relevant and exhaustive while also interpreting the results of the data analysis. Some companies also require data analysts to have visualization skills to convert alienating numbers into tangible insights through graphics. As per Indeed, the average salary for a data analyst is $68,195 per year in the United States. Skill sets required: Programming languages like R, Python, JavaScript, C/C++, SQL. With this critical thinking, data visualization and presentation skills will be good to have. Data scientist Data scientists are data experts who have the technical skills to solve complex problems and the curiosity to explore what problems are needed to be solved. A data scientist is an individual who develops machine learning models to make predictions and is well versed in algorithm development and computer science. This person will also know the complete lifecycle of the model development. A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze customer and market trends. Basic responsibilities include gathering and analyzing data, using various types of analytics and reporting tools to detect patterns, trends and relationships in data sets. According to Glassdoor, the current U.S. average salary for a data scientist is $118,709. Skills set required: A data scientist will require knowledge of big data platforms and tools like Seahorse powered by Apache Spark, JupyterLab, TensorFlow and MapReduce; and programming languages that include SQL, Python, Scala and Perl; and statistical computing languages, such as R. They should also have cloud computing capabilities and knowledge of various cloud platforms like AWS, Microsoft Azure etc.You can also read this post on how to ace a data science interview to know more. Machine learning engineer At times a data scientist is confused with machine learning engineers, but a machine learning engineer is a distinct role that involves different responsibilities. A machine learning engineer is someone who is responsible for combining software engineering and machine modeling skills. This person determines which model to use and what data should be used for each model. Probability and statistics are also their forte. Everything that goes into training, monitoring, and maintaining a model is the ML engineer's job. The average machine learning engineer's salary is $146,085 in the US, and is ranked No.1 on the Indeed's Best Jobs in 2019 list. Skill sets required: Machine learning engineers will be required to have expertise in computer science and programming languages like R, Python, Scala, Java etc. They would also be required to have probability techniques, data modelling and evaluation techniques. Data architects and data engineers The data architects and data engineers work in tandem to conceptualize, visualize, and build an enterprise data management framework. The data architect visualizes the complete framework to create a blueprint, which the data engineer can use to build a digital framework. The data engineering role has recently evolved from the traditional software-engineering field. Recent enterprise data management experiments indicate that the data-focused software engineers are needed to work along with the data architects to build a strong data architecture. Average salary for a data architect in the US ranges from $1,22,000 to $1,29, 000 annually as per a recent LinkedIn survey. Skill sets required: A data architect or an engineer should have a keen interest and experience in programming languages frameworks like HTML5, RESTful services, Spark, Python, Hive, Kafka, and CSS etc. They should have the required knowledge and experience to handle database technologies such as PostgreSQL, MapReduce and MongoDB and visualization platforms such as; Tableau, Spotfire etc. Business analyst A business analyst (BA) basically handles Chief analytics officer's role but on the operational level. This implies converting business expectations into data analysis. If your core data scientist lacks domain expertise, a business analyst can bridge the gap. They are responsible for using data analytics to assess processes, determine requirements and deliver data-driven recommendations and reports to executives and stakeholders. BAs engage with business leaders and users to understand how data-driven changes will be implemented to processes, products, services, software and hardware. They further articulate these ideas and balance them against technologically feasible and financially reasonable. The average salary for a business analyst is $75,078 per year in the United States, as per Indeed. Skill sets required: Excellent domain and industry expertise will be required. With this good communication as well as data visualization skills and knowledge of business intelligence tools will be good to have. Data visualization engineer This specific role is not present in each of the data science teams as some of the responsibilities are realized by either a data analyst or a data architect. Hence, this role is only necessary for a specialized data science model. The role of a data visualization engineer involves having a solid understanding of UI development to create custom data visualization elements for your stakeholders. Regardless of the technology, successful data visualization engineers have to understand principles of design, both graphical and more generally user-centered design. As per Payscale, the average salary for a data visualization engineer is $98,264. Skill sets required: A data visualization engineer need to have rigorous knowledge of data visualization methods and be able to produce various charts and graphs to represent data. Additionally they must understand the fundamentals of design principles and visual display of information. To sum it up, a data science team has evolved to create a number of job roles and opportunities, but companies still face challenges in building up the team from scratch and find it hard to figure where to start from. If you are facing a similar dilemma, check out this book, Managing Data Science, written by Kirill Dubovikov. It covers concepts and methodologies to manage and deliver top-notch data science solutions, while also providing guidance on hiring, growing and sustaining a successful data science team. How to learn data science: from data mining to machine learning How to ace a data science interview Data science vs. machine learning: understanding the difference and what it means today 30 common data science terms explained 9 Data Science Myths Debunked

0
0
44105

Amey Varangaonkar

28 Aug 2017

7 min read

Is Python edging R out in the data science wars?

Amey Varangaonkar

28 Aug 2017

7 min read

When it comes to the ‘lingua franca’ of data science, there seems to be a face-off between R and Python. R has long been established as the language of researchers and statisticians but Python has come up quickly as a bona-fide challenger, helping embed analytics as a necessity for businesses and other organizations in 2017. If a tech war does exist between the two languages, it’s a battle fought not so much on technical features but instead on the wider changes within modern business and technology. R is a language purpose-built for statistics, for performing accurate and intensive analysis. So, the fact that R is being challenged by Python — a language that is flexible, fast, and relatively easy to learn — suggests we are seeing a change in who’s actually doing data science, where they’re doing it, and what they’re trying to achieve. Python versus R — A Closer Look Let’s make a quick comparison of the two languages on aspects important to those working with data and see what we can learn about the two worlds where R and Python operate. Learning curve Python is the easier language to learn. While R certainly isn’t impenetrable, Python’s syntax marks it as a great language to learn even if you’re completely new to programming. The fact that such an easy language would come to rival R within data science indicates the pace at which the field is expanding. More and more people are taking on data-related roles, possibly without a great deal of programming knowledge — Python makes the barrier to entry much lower than R. That said, once you get to grips with the basics of R, it becomes relatively easier to learn the more advanced stuff. This is why statisticians and experienced programmers find R easier to use. Packages and libraries Many R packages are in-built. Python, meanwhile, depends upon a range of external packages. This obviously makes R much more efficient as a statistical tool — it means that if you’re using Python you need to know exactly what you’re trying to do and what external support you’re going to need. Data Visualization R is well-known for its excellent graphical capabilities. This makes it easy to present and communicate data in varied forms. For statisticians and researchers, the importance of that is obvious. It means you can perform your analysis and present your work in a way that is relatively seamless. The ggplot2 package in R, for example, allows you to create complex and elegant plots with ease and as a result, its popularity in the R community has increased over the years. Python also offers a wide range of libraries which can be used for effective data storytelling. The breadth of external packages available with Python means the scope of what’s possible is always expanding. Matplotlib has been a mainstay of Python data visualization. It’s also worth remarking on upcoming libraries like Seaborn. Seaborn is a neat little library that sits on top of Matplotlib, wrapping its functionality and giving you a neater API for specific applications. So, to sum up, you have sufficient options to perform your data visualization tasks effectively — using either R or Python! Analytics and Machine Learning Thanks to libraries like scikit-learn, Python helps you build machine learning systems with relative ease. This takes us back to the point about barrier to entry. If machine learning is upending how we use and understand data, it makes sense that more people want a piece of the action without having to put in too much effort. But Python also has another advantage; it’s great for creating web services where data can be uploaded by different people. In a world where accessibility and data empowerment have never been more important (i.e., where everyone takes an interest in data, not just the data team), this could prove crucial. With packages such as caret, MICE, and e1071, R too gives you the power to perform effective machine learning in order to get crucial insights out of your data. However, R falls short in comparison to Python, thanks to the latter’s superior libraries and more diverse use-cases. Deep Learning Both R and Python have libraries for deep learning. It’s much easier and more efficient with Python though — most likely because the Python world changes much more quickly, new libraries and tools springing up as quickly as the data science world hooks on to a new buzzword. Theano, and most recently Keras and TensorFlow have all made a huge impact on making it relatively easy to build incredibly complex and sophisticated deep learning systems. If you’re clued-up and experienced with R it shouldn’t be too hard to do the same, using libraries such as MXNetR, deepr, and H2O — that said, if you want to switch models, you may need to switch tools, which could be a bit of a headache. Big Data With Python, you can write efficient MapReduce applications with ease, or scale your R program on Hadoop to work with petabytes of data. Both R and Python are equally good when it comes to working with Big Data, as they can be seamlessly integrated with Big Data tools such as Apache Spark and Apache Hadoop, among many others. It’s likely that it’s in this field that we’re going to see R moving more and more into industry as businesses look for a concise way to handle large datasets. This is true in industries such as bioinformatics which have a close connection with the academic world and necessarily depend upon a combination of size and accuracy when it comes to working with data. So, where does this comparison leave us? Ultimately, what we see are two different languages offering great solutions to very different problems in data science. In Python, we have a flexible and adaptable language with a vibrant community of developers working on a huge range of problems and tasks, each one trying to find more effective and more intelligent ways of doing things. In R, we have a purely statistical language with a large repository of over 8000 packages for data analysis and visualization. While Python is production-ready and is better suited for organizations looking to harness technical innovation to its advantage, R’s analytical and data visualization capabilities can make your life as a statistician or data analyst easier. Recent surveys indicate that Python commands a higher salary than R — that is because it’s a language that can be used across domains; a problem-solving language. That’s not to say that R isn’t a valuable language; rather, Python is the language that just seems to fit the times at the moment. In the end, it all boils down to your background, and the kind of data problems you want to solve. If you come from a statistics or research background and your problems only revolve around statistical analysis and visualization, then R would best fit your bill. However, if you’re a Computer Science graduate looking to build a general-purpose, enterprise-wide data model which can integrate seamlessly with the other business workflows, you will find Python easier to use. R and Python are two different animals. Instead of comparing the two, maybe it’s time we understood where and how each can be best used and then harnessed their power to the fullest to solve our data problems. One thing is for sure, though — neither is going away anytime soon. Both R and Python occupy a large chunk of the data science market-share today, and it will take a major disruption to take either one of them out of the equation completely.

0
1
41963

article-image-top-5-programming-languages-big-data

Amey Varangaonkar

04 Apr 2018

8 min read

Top 5 programming languages for crunching Big Data effectively

Amey Varangaonkar

04 Apr 2018

8 min read

One of the most important decisions that Big Data professionals have to make, especially the ones who are new to the scene or are just starting out, is choosing the best programming languages for big data manipulation and analysis. Understanding the Big Data problem and framing the architecture to solve it is not quite enough these days - the execution needs to be perfect as well, and choosing the right language goes a long way. The best languages for big data In this article, we look at the 5 of the most popularly used - not to mention highly effective - programming languages for developing Big Data solutions. Scala A beautiful crossover of the object-oriented and functional programming paradigms, Scala is fast and robust, and a popular choice of language for many Big Data professionals.The fact that two of the most popular Big Data processing frameworks in Apache Spark and Apache Kafka have been built on top of Scala tells you everything you need to know about the power of Scala. Scala runs on the JVM, which means the codes written in Scala can be easily used within a Java-based Big Data ecosystem. One significant factor that differentiates Scala from Java, though, is that Scala is a lot less verbose in comparison. You can write 100s of lines of confusing-looking Java code in less than 15 lines in Scala. One negative aspect of Scala, though, is its steep learning curve when compared to languages like Go and Python, and this may put off beginners looking to use it. Why use Scala for big data? Fast and robust Suitable for working with Big Data tools like Apache Spark for distributed Big Data processing JVM compliant, can be used in a Java-based ecosystem Python Python has been declared as one of the fastest growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Its general-purpose nature means it can be used across a broad spectrum of use-cases, and Big Data programming is one major area of application. Many libraries for data analysis and manipulation which are increasingly being used in a Big Data framework to clean and manipulate large chunks of data, such as pandas, NumPy, SciPy - are all Python-based. Not just that, most popular machine learning and deep learning frameworks such as scikit-learn, Tensorflow and many more, are also written in Python and are finding increasing application within the Big Data ecosystem. One drawback of using Python, and a reason why it is not a first-class citizen when it comes to Big Data programming yet, is that it’s slow. Although very easy to use, Big Data professionals have found systems built with languages such as Java or Scala faster and more robust to use than the systems built with Python. However, Python makes up for this limitation with other qualities. As Python is primarily a scripting language, interactive coding and development of analytical solutions for Big Data becomes very easy. Python can integrate effortlessly with the existing Big Data frameworks such as Apache Hadoop and Apache Spark, allowing you to perform predictive analytics at scale without any problem. Why use Python for big data? General-purpose Rich libraries for data analysis and machine learning Easy to use Supports iterative development Rich integration with Big Data tools Interactive computing through Jupyter notebooks R It won’t come as a surprise to many that those who love statistics, love R. The ‘language of statistics’ as it is popularly called as, R is used to build data models which can be used for effective and accurate data analysis. Powered by a large repository of R packages (CRAN, also called as Comprehensive R Archive Network), with R you have just about every type of tool to accomplish any task in Big Data processing - right from analysis to data visualization. R can be integrated seamlessly with Apache Hadoop and Apache Spark, among other popular frameworks, for Big Data processing and analytics. One issue with using R as a programming language for Big Data is that it is not very general-purpose. It means the code written in R is not production-deployable and generally has to be translated to some other programming language such as Python or Java. That said, if your goal is to only build statistical models for Big Data analytics, R is an option you should definitely consider. Why use R for big data? Built for data science Support for Hadoop and Spark Strong statistical modeling and visualization capabilities Support for Jupyter notebooks Java Last, but not the least, there’s always the good old Java. Some of the traditional Big Data frameworks such as Apache Hadoop and all the tools within its ecosystem are all Java-based, and still in use today in many enterprises. Not to mention the fact that Java is the most stable and production-ready language among all the languages we have discussed so far! Using Java to develop your Big Data applications gives you the ability to use a large ecosystem of tools and libraries for interoperability, monitoring and much more, most of which have already been tried and tested. One major drawback of Java is its verbosity. The fact that you have to write hundreds of lines of codes in Java for a task which can written in barely 15-20 lines of code in Python or Scala, can turnoff many budding programmers. However, the introduction of lambda functions in Java 8 does make life quite easier. Java also does not support iterative development unlike newer languages like Python, and this is an area of focus for the future Java releases. Despite the flaws, Java remains a strong contender when it comes to the preferred language for Big Data programming because of its history and the continued reliance on the traditional Big Data tools and frameworks. Why use Java for big data? Traditional Big Data tools and frameworks are written in Java Stable and production-ready Large ecosystem of tried and tested tools and libraries Go Last but not the least, there’s Go - one of the fastest rising programming languages in recent times. Designed by a group of Google engineers who were frustrated with C++, we think Go is a good shout in this list - simply because of the fact that it powers so many tools used in the Big Data infrastructure, including Kubernetes, Docker and many more. Go is fast, easy to learn, and fairly easy to develop applications with, not to mention deploy them. More importantly, as businesses look at building data analysis systems that can operate at scale, Go-based systems are being used to integrate machine learning and parallel processing of data. It is also possible to interface other languages with Go-based systems with relative ease. Why use Go for big data? Fast, easy to use Many tools used in the Big Data infrastructure are Go-based Efficient distributed computing There are a few other languages you might want to consider - Julia, SAS and MATLAB being some major ones which are useful in their own right. However, when compared to the languages we talked about above, we thought they fell a bit short in some aspects - be it speed, efficiency, ease of use, documentation, or community support, among other things. Let’s take a quick look at the comparison table of all the languages we discussed above. Note that we have used the ✓ symbol for the best possible language/s to help you make an informed decision. This is just our view, and that’s not to say that the other languages are any worse! Scala Python R Java Go Speed ✓ ✓ ✓ Ease of use ✓ ✓ ✓ Quick Learning curve ✓ ✓ Data Analysis capability ✓ ✓ ✓ General-purpose ✓ ✓ ✓ ✓ Big Data support ✓ ✓ ✓ ✓ ✓ Interfacing with other languages ✓ ✓ ✓ Production-ready ✓ ✓ ✓ So...which language should you choose? To answer the question in short - it all depends on the use-case you want to develop. If your focus is hardcore data analysis which involves a lot of statistical computing, R would be your go-to language. On the other hand, if you want to develop streaming applications for your Big Data, Scala can be a preferable choice. If you wish to use Machine Learning to leverage your Big Data and build predictive models, Python will come to your rescue. Lastly, if you plan to build Big Data solutions using just the traditionally-available tools, Java is the language for you. You also have the option of combining the power of two languages to get a more efficient and powerful solution. For example, you can train your machine learning model in Python and deploy it on Spark in a distributed mode. Ultimately, it all depends on how efficiently your solution can function, and more importantly, how fast and accurate it is. Which language do you prefer for crunching your Big Data? Do let us know!

0
1
41540

article-image-what-is-streaming-analytics-and-why-is-it-important

Amey Varangaonkar

05 Oct 2017

5 min read

Say hello to Streaming Analytics

Amey Varangaonkar

05 Oct 2017

5 min read

In this data-driven age, businesses want fast, accurate insights from their huge data repositories in the shortest time span — and in real time when possible. These insights are essential — they help businesses understand relevant trends, improve their existing processes, enhance customer satisfaction, improve their bottom line, and most importantly, build, and sustain their competitive advantage in the market. Doing all of this is quite an ask - one that is becoming increasingly difficult to achieve using just the traditional data processing systems where analytics is limited to the back-end. There is now a burning need for a newer kind of system where larger, more complex data can be processed and analyzed on the go. Enter: Streaming Analytics Streaming Analytics, also referred to as real-time event processing, is the processing and analysis of large streams of data in real-time. These streams are basically events that occur as a result of some action. Actions like a transaction or a system failure, or a trigger that changes the state of a system at any point in time. Even something as minor or granular as a click would then constitute as an event, depending upon the context. Consider this scenario - You are the CTO of an organization that deals with sensor data from wearables. Your organization would have to deal with terabytes of data coming in on a daily basis, from thousands of sensors. One of your biggest challenges as a CTO would be to implement a system that processes and analyzes the data from these sensors as it enters the system. Here’s where streaming analytics can help you by giving you the ability to derive insights from your data on the go. According to IBM, a streaming system demonstrates the following qualities: It can handle large volumes of data It can handle a variety of data and analyze it efficiently — be it structured or unstructured, and identifies relevant patterns accordingly It can process every event as it occurs unlike traditional analytics systems that rely on batch processing Why is Streaming Analytics important? The humongous volume of data that companies have to deal with today is almost unimaginable. Add to that the varied nature of data that these companies must handle, and the urgency with which value needs to be extracted from this data - it all makes for a pretty tricky proposition. In such scenarios, choosing a solution that integrates seamlessly with different data sources, is fine-tuned for performance, is fast, reliable, and most importantly one that is flexible to changes in technology, is critical. Streaming analytics offers all these features - thereby empowering organizations to gain that significant edge over their competition. Another significant argument in favour of streaming analytics is the speed at which one can derive insights from the data. Data in a real-time streaming system is processed and analyzed before it registers in a database. This is in stark contrast to analytics on traditional systems where information is gathered, stored, and then the analytics is performed. Thus, streaming analytics supports much faster decision-making than the traditional data analytics systems. Is Streaming Analytics right for my business? Not all organizations need streaming analytics, especially those that deal with static data or data that hardly change over longer intervals of time, or those that do not require real-time insights for decision-making. For instance, consider the HR unit of a call centre. It is sufficient and efficient to use a traditional analytics solution to analyze thousands of past employee records rather than run it through a streaming analytics system. On the other hand, the same call centre can find real value in implementing streaming analytics to something like a real-time customer log monitoring system. A system where customer interactions and context-sensitive information are processed on the go. This can help the organization find opportunities to provide unique customer experiences, improve their customer satisfaction score, alongside a whole host of other benefits. Streaming Analytics is slowly finding adoption in a variety of domains, where companies are looking to get that crucial competitive advantage - sensor data analytics, mobile analytics, business activity monitoring being some of them. With the rise of Internet of Things, data from the IoT devices is also increasing exponentially. Streaming analytics is the way to go here as well. In short, streaming analytics is ideal for businesses dealing with time-critical missions and those working with continuous streams of incoming data, where decision-making has to be instantaneous. Companies that obsess about real-time monitoring of their businesses will also find streaming analytics useful - just integrate your dashboards with your streaming analytics platform! What next? It is safe to say that with time, the amount of information businesses will manage is going to rise exponentially, and so will the nature of this information. As a result, it will get increasingly difficult to process volumes of unstructured data and gain insights from them using just the traditional analytics systems. Adopting streaming analytics into the business workflow will therefore become a necessity for many businesses. Apache Flink, Spark Streaming, Microsoft's Azure Stream Analytics, SQLstream Blaze, Oracle Stream Analytics and SAS Event Processing are all good places to begin your journey through the fleeting world of streaming analytics. You can browse through this list of learning resources from Packt to know more. Learning Apache Flink Learning Real Time processing with Spark Streaming Real Time Streaming using Apache Spark Streaming (video) Real Time Analytics with SAP Hana Real-Time Big Data Analytics

0
0
39459

Aaron Lazar

23 Apr 2018

5 min read

Why is Hadoop dying?

Aaron Lazar

23 Apr 2018

5 min read

Hadoop has been the definitive big data platform for some time. The name has practically been synonymous with the field. But while its ascent followed the trajectory of what was referred to as the 'big data revolution', Hadoop now seems to be in danger. The question is everywhere - is Hadoop dying out? And if it is, why is it? Is it because big data is no longer the buzzword it once was, or are there simply other ways of working with big data that have become more useful? Hadoop was essential to the growth of big data When Hadoop was open sourced in 2007, it opened the door to big data. It brought compute to data, as against bringing data to compute. Organisations had the opportunity to scale their data without having to worry too much about the cost. It obviously had initial hiccups with security, the complexity of querying and querying speeds, but all that was taken care off, in the long run. Still, although querying speeds remained quite a pain, however that wasn’t the real reason behind Hadoop dying (slowly). As cloud grew, Hadoop started falling One of the main reasons behind Hadoop's decline in popularity was the growth of cloud. There cloud vendor market was pretty crowded, and each of them provided their own big data processing services. These services all basically did what Hadoop was doing. But they also did it in an even more efficient and hassle-free way. Customers didn't have to think about administration, security or maintenance in the way they had to with Hadoop. One person’s big data is another person’s small data Well, this is clearly a fact. Several organisations that used big data technologies without really gauging the amount of data they actually would need to process, have suffered. Imagine sitting with 10TB Hadoop clusters when you don’t have that much data. The two biggest organisations that built products on Hadoop, Hortonworks and Cloudera, saw a decline in revenue in 2015, owing to their massive use of Hadoop. Customers weren’t pleased with nature of Hadoop’s limitations. Apache Hadoop v Apache Spark Hadoop processing is way behind in terms of processing speed. In 2014 Spark took the world by storm. I’m going to let you guess which line in the graph above might be Hadoop, and which might be Spark. Spark was a general purpose, easy to use platform that was built after studying the pitfalls of Hadoop. Spark was not bound to just the HDFS (Hadoop Distributed File System) which meant that it could leverage storage systems like Cassandra and MongoDB as well. Spark 2.3 was also able to run on Kubernetes; a big leap for containerized big data processing in the cloud. Spark also brings along GraphX, which allows developers to view data in the form of graphs. Some of the major areas Spark wins are Iterative Algorithms in Machine Learning, Interactive Data Mining and Data Processing, Stream processing, Sensor data processing, etc. Machine Learning in Hadoop is not straightforward Unlike MLlib in Spark, Machine Learning is not possible in Hadoop unless tied with a 3rd party library. Mahout used to be quite popular for doing ML on Hadoop, but its adoption has gone down in the past few years. Tools like RHadoop, a collection of 3 R packages, have grown for ML, but it still is nowhere comparable to the power of the modern day MLaaS offerings from cloud providers. All the more reason to move away from Hadoop, right? Maybe. Hadoop is not only Hadoop The general misconception is that Hadoop is quickly going to be extinct. On the contrary, the Hadoop family consists of YARN, HDFS, MapReduce, Hive, Hbase, Spark, Kudu, Impala, and 20 other products. While e folks may be moving away from Hadoop as their choice for big data processing, they will still be using Hadoop in some form or the other. As with Cloudera and Hortonworks, though the market has seen a downward trend, they’re in no way letting go of Hadoop anytime soon, although they have shifted part of their processing operations to Spark. Is Hadoop dying? Perhaps not... In the long run, it’s not completely accurate to say that Hadoop is dying. December last year brought with it Hadoop 3.0, which is supposed to be a much improved version of the framework. Some of the most noteworthy features are its improved shell script, more powerful YARN, improved fault tolerance with erasure coding, and many more. Although, that hasn’t caused any major spike in adoption, there are still users who will adopt Hadoop based on their use case, or simply use another alternative like Spark along with another framework from the Hadoop family. So, Hadoop’s not going away anytime soon. Read More Pandas is an effective tool to explore and analyze data - Interview insights

0
1
39249

article-image-healthcare-analytics-logistic-regression-to-reduce-patient-readmissions

Guest Contributor

20 Dec 2017

8 min read

Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions

Guest Contributor

20 Dec 2017

8 min read

[box type="info" align="" class="" width=""]We bring to you another guest post by Benjamin Rojogan on Logistic regression to aid healthcare sector in reducing patient readmission. Ben's previous post on ensemble methods to optimize machine learning models is also available for a quick read here.[/box] ER visits are not cheap for any party involved. Whether this be the patient or the insurance company. However, this does not stop some patients from being regular repeat visitors. These recurring visits are due to lack of intervention for problems such as substance abuse, chronic diseases and mental illness. This increases costs for everybody in the healthcare system and reduces quality of care by playing a role in the overflowing of Emergency Departments (EDs). Research teams at UW and other universities are partnering with companies like Kensci to figure out how to approach the problem of reducing readmission rates. The ability to predict the likelihood of a patient’s readmission will allow for targeted intervention which in turn will help reduce the frequency of readmissions. Thus making the population healthier and hopefully reducing the estimated 41.3 billion USD healthcare costs for the entire system. How do they plan to do it? With big data and statistics, of course. A plethora of algorithms are available for data scientists to use to approach this problem. Many possible variables could affect the readmission and medical costs. Also, there are also many different ways researchers might pose their questions. However, the researchers at UW and many other institutions have been heavily focused on reducing the readmission rate simply by trying to calculate whether a person would or would not be readmitted. In particular, this team of researchers was curious about chronic ailments. Patients with chronic ailments are likely to have random flare ups that require immediate attention. Being able to predict if a patient will have an ER visit can lead to managing the cause more effectively. One approach taken by the data science team at UW as well as the Department of Family and Community Medicine at the University of Toronto was to utilize logistic regression to predict whether or not a patient would be readmitted. Patient readmission can be broken down into a binary output: either the patient is readmitted or not. As such logistic regression has been a useful model in my experience to approach this problem. Logistic Regression to predict patient readmissions Why do data scientists like to use logistic regression? Where is it used? And how does it compare to other data algorithms? Logistic regression is a statistical method that statisticians and data scientists use to classify people, products, entities, etc. It is used for analyzing data that produces a binary classification based on one or many independent variables. This means, it produces two clear classifications (Yes or No, 1 or 0, etc). With the example above, the binary classification would be: is the patient readmitted or not? Other examples of this could be whether to give a customer a loan or not, whether a medical claim is fraud or not, whether a patient has diabetes or not. Despite its name, logistic regression does not provide the same output like linear regression (per se). There are some similarities, for instance, the linear model is somewhat consistent as you might notice in the equation below where you see what is very similar to a linear equation. But the final output is based on the log odds. Linear regression and multivariate regression both take one to many independent variables and produce some form of continuous function. Linear regression could be used to predict the price of a house, a person’s age or the cost of a product an e-commerce should display to each customer. The output is not limited to only a few discrete classifications. Whereas logistic regression produces discrete classifiers. For instance, an algorithm using logistic regression could be used to classify whether or not a certain stock price would be either >$50 a share or <$50 a share. Linear regression would be used to predict if a stock share would be worth $50.01, $50.02….etc. Logistic regression is a calculation that uses the odds of a certain classification. In the equation above, the symbol you might know as pi actually represents the odds or probability. To reduce the error rate, we should predict Y = 1 when p ≥ 0.5 and Y = 0 when p < 0.5. This creates a linear classifier, a boundary that when the coefficients β0 + x · β has a p value that is p < 0.5 then Y = 0. By generating coefficients that help predict the logit transformation, the method allows to classify for the characteristic of interest. Now that is a lot of complex math mumbo jumbo. Let’s try to break it down into simpler terms. Probability vs. Odds Let’s start with probability. Let’s say a patient has the probability of 0.6 of being readmitted. Then the probability that the patient won’t be readmitted is .4. Now, we want to take this and convert it into odds. This is what the formula above is doing. You would take .6/.4 and get odds of 1.5. That means the odds of the patient being readmitted are 1.5 to 1. If instead the probability was .5 for both being readmitted and not being readmitted, then the odds would be 1:1. Now the next step in the logistic regression model would be to take the odds and get the “Log odds”. You do this by taking the 1.5 and put it into the log portion of the equation. Now you will get .18(rounded). In logistic regression, we don’t actually know p. That is what we are trying to essentially find and model using various coefficients and input variables. Each input provides a value that changes how much more likely an event will or will not occur. All of these coefficients are used to calculate the log odds. This model can take multiple variables like age, sex, height, etc. and specify how much of an effect each variable has on the odds an event will occur. Once the initial model is developed, then comes the work of deciding its value. How does a business go from creating an algorithm inside a computer and translate it into action. Some of us like to say the “computers” are the easy part. Personally I find the hard part to be the “people”. After all, at the end of the day, it comes down to business value. Will an algorithm save money or not? That means it has to be applied in real life. This could take the form of a new initiative, strategy, product recommendation, etc. You need to find the outliers that are worth going after! For instance, if we go back to the patient readmission example again. The algorithm points out patients with high probabilities of being readmitted. However if the readmission costs are low, they will probably be ignored..sadly. That is how businesses (including hospitals) look at problems. Logistic regression is a great tool for binary classification. It is unlike many other algorithms that estimate continuous variables or estimate distributions. This statistical method can be utilized to classify whether a person will be likely to get cancer because of environmental variables like proximity to a highway, smoking habits, etc? This method has been used effectively in the medical, financial and insurance industry successfully for a while. Knowing when to use what algorithm takes time. However, the more problems a data scientist faces, the faster they will recognize whether to use logistic regression or decision trees. Using logistic regression provides the opportunity for healthcare institutions to accurately target at risk individuals who should receive a more tailored behavioral health plan to help improve their daily health habits. This in turn opens the opportunity for better health for patients and lower costs for hospitals. [box type="shadow" align="" class="" width=""] About the Author Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/box]

0
0
38257

article-image-two-popular-data-analytics-methodologies-every-data-professional-should-know-tdsp-crisp-dm

Amarabha Banerjee

21 Dec 2017

7 min read

Two popular Data Analytics methodologies every data professional should know: TDSP & CRISP-DM

Amarabha Banerjee

21 Dec 2017

7 min read

0
0
35288

Akram Hussain

31 Oct 2014

3 min read

Python Data Stack

Akram Hussain

31 Oct 2014

3 min read

The Python programming language has grown significantly in popularity and importance, both as a general programming language and as one of the most advanced providers of data science tools. There are 6 key libraries every Python analyst should be aware of, and they are: 1 - NumPY NumPY: Also known as Numerical Python, NumPY is an open source Python library used for scientific computing. NumPy gives both speed and higher productivity using arrays and metrics. This basically means it's super useful when analyzing basic mathematical data and calculations. This was one of the first libraries to push the boundaries for Python in big data. The benefit of using something like NumPY is that it takes care of all your mathematical problems with useful functions that are cleaner and faster to write than normal Python code. This is all thanks to its similarities with the C language. 2 - SciPY SciPY: Also known as Scientific Python, is built on top of NumPy. SciPy takes scientific computing to another level. It’s an advanced form of NumPy and allows users to carry out functions such as differential equation solvers, special functions, optimizers, and integrations. SciPY can be viewed as a library that saves time and has predefined complex algorithms that are fast and efficient. However, there are a plethora of SciPY tools that might confuse users more than help them. 3 - Pandas Pandas is a key data manipulation and analysis library in Python. Pandas strengths lie in its ability to provide rich data functions that work amazingly well with structured data. There have been a lot of comparisons between pandas and R packages due to their similarities in data analysis, but the general consensus is that it is very easy for anyone using R to migrate to pandas as it supposedly executes the best features of R and Python programming all in one. 4 - Matplotlib Matplotlib is a visualization powerhouse for Python programming, and it offers a large library of customizable tools to help visualize complex datasets. Providing appealing visuals is vital in the fields of research and data analysis. Python’s 2D plotting library is used to produce plots and make them interactive with just a few lines of code. The plotting library additionally offers a range of graphs including histograms, bar charts, error charts, scatter plots, and much more. 5 - scikit-learn scikit-learn is Python’s most comprehensive machine learning library and is built on top of NumPy and SciPy. One of the advantages of scikit-learn is the all in one resource approach it takes, which contains various tools to carry out machine learning tasks, such as supervised and unsupervised learning. 6 - IPython IPython makes life easier for Python developers working with data. It’s a great interactive web notebook that provides an environment for exploration with prewritten Python programs and equations. The ultimate goal behind IPython is improved efficiency thanks to high performance, by allowing scientific computation and data analysis to happen concurrently using multiple third-party libraries. Continue learning Python with a fun (and potentially lucrative!) way to use decision trees. Read on to find out more.

0
0
34151

article-image-top-4-chatbot-development-frameworks-developers

Sugandha Lahoti

20 Oct 2017

8 min read

Top 4 chatbot development frameworks for developers

Sugandha Lahoti

20 Oct 2017

8 min read

The rise of the bots is nigh! If you can imagine a situation involving a dialog, there is probably a chatbot for that. Just look at the chatbot market - text-based email/SMS bots, voice-based bots, bots for customer support, transaction-based bots, entertainment bots and many others. A large number of enterprises, from startups to established organizations, are seeking to invest in this sector. This has also led to an increase in the number of platforms used for chatbot building. These frameworks incorporate AI techniques along with natural language processing capabilities to assist developers in building and deploying chatbots. Let’s start with how a chatbot typically works before diving into some of the frameworks. Understand: The first step for any chatbot is to understand the user input. This is made possible using pattern matching and intent classification techniques. ‘Intents’ are the tasks that users might want to perform with a chatbot. Machine learning, NLP and speech recognition techniques are typically used to identify the intent of the message and extract named entities. Entities are the specific pieces of information extracted from the user’s response i.e. the content associated with an intent. Respond: After understanding, the next goal is to generate a response. This is based on the current input message and the context of the conversation. After specifying the intents and entities, a dialog flow is constructed. This is basically the replies/feedback expected from a chatbot. Learn: Chatbots use AI techniques such as natural language understanding and pattern recognition to store and distinguish between the context of the information provided, and elicit a suitable response for future replies. This is important because different requests might have different meanings depending on previous requests. Top chatbot development frameworks A bot development framework is a set of predefined classes, functions, and utilities that a developer can use to build chatbots easier and faster. They vary in the level of complexity, integration capabilities, and functionalities. Let us look at some of the development platforms utilized for chatbot building. API.AI API.AI, a code based framework with a simple web-based interface, allows users to build engaging voice and text-based conversational apps using a large number of libraries and SDKs including Android, iOS, Webkit HTML5, Node.js, and Python API. It also supports nearly 32 one-click platform integrations such as Google, Facebook Messenger, Twitter and Skype to name a few. API.AI makes use of an agent - a container that transforms natural language based user requests into actionable data. The software tries to find the intent behind a user’s reply and matches it to the default or the closest match. After intent matching, it executes the actions and responses the developer has defined for that intent. API.AI also makes use of entities. Once the intents and entities are specified, the bot is trained. API.AI’s training module efficiently tracks each user’s request and lets developers see how they are parsed and matched to an intent. It also allows for correction of any errors and change requests thus retraining the bot. API.AI streamlines the entire bot-creating process by helping developers provide domain-specific knowledge that is unique to a bot’s needs while working on speech recognition, intent and context management in the backend. Google has recently partnered with API.AI to help them build conversational tools like Apple’s Siri. Microsoft Bot Framework Microsoft Bot Framework allows building and deployment of chatbots across multiple platforms and services such as web, SMS, non-Microsoft platforms, Office 365, Skype etc. The Bot Framework includes two components - The Bot Builder and the Microsoft Cognitive Services. The Bot Builder comprises of two full-featured SDKs - for the.NET and the Node.js platforms along with an emulator for testing and debugging. There’s also a set of RESTful APIs for building code in other languages. The SDKs support features for simple and easy interactions between bots. They also have a large collection of prebuilt sample bots for the developer to choose from. The Microsoft Cognitive Services is a collection of intelligent APIs that simplify a variety of AI tasks such as allowing the system to understand and interpret the user's needs using natural language in just a few lines of code. These APIs allow integration to most modern languages and platforms and constantly improve, learn, and get smarter. Microsoft created the AI Inner Circle Partner Program to work hand in hand with industry to create AI solutions. Their only partner in the UK is ICS.AI who build conversational AI solutions for the UK's public sector. ICS are the first choice for many organisations due to their smart solutions that scale and serve to improve services for the general public. Developers can build bots in the Bot Builder SDK using C# or Node.js. They can then add AI capabilities with Cognitive Services. Finally, they can register the bots on the developer portal, connecting it to users across platforms such as Facebook and Microsoft Teams and also deploy it on the cloud like Microsoft Azure. For a step-by-step guide for chatbot building using Microsoft Bot Framework, you can refer to one of our books on the topic. Sabre Corporation, a customer service provider for travel agencies, have recently announced the development of an AI-powered chatbot that leverages Microsoft Bot Framework and Microsoft Cognitive Services. Watson Conversation IBM’s Watson Conversation helps build chatbot solutions that understand natural-language input and use machine learning to respond to customers in a way that simulates conversations between humans. It is built on a neural network of one million Wikipedia words. It offers deployment across a variety of platforms including mobile devices, messaging platforms, and robots. The platform is robust and secure as IBM allows users to opt out of data sharing. The IBM Watson Tone Analyzer service can help bots understand the tone of the user’s input for better management of the experience. The basic steps to create a chatbot using Watson Conversation are as follows. We first create a workspace - a place for configuring information to maintain separate intents, user examples, entities, and dialogues for each application. One workspace corresponds to one bot. Next, we create Intents. Watson Conversation makes use of multiple conditioned responses to distinguish between similar intents. For example, instead of building specific intents for locations of different places, it creates a general intent “location” and adds an entity to capture the response, like the “location- bedroom” - to the right, near the stairs, “location-kitchen”- to the left. The third step is entity establishment. This involves grouping entities that might trigger a similar response in the dialog. The dialog flow, thus generated after specifying the intents and entities, goes through testing followed by embedding this into an application. It is then connected with other services by using the conversation API. Staples, an office supply retailing firm, uses Watson Conversation in their “Easy Systems” to simplify the customer’s shopping experience. CXP Designer and Aspect NLU Aspect Customer Experience Platform is an application lifecycle management tool to build text and voice-based applications such as chatbots. It provides deployment options across multiple communication channels like text, voice, mobile web and social media networks. The Aspect CXP typically includes a CXP designer to build chatbots and the inbuilt Aspect NLU to provide advanced natural language capabilities. CXP designer works by creating dialog objects to provide a menu of options for frontend as well as backend. Menu items for the frontend are used to create intents and modules within those intents. The developer can then modify labels (of those intents and modules) manually or use the Aspect NLU to disambiguate similar questions for successful extraction of meaning and intent. The Aspect NLU includes tools for spelling correction, linguistic lexicons such as nouns, verbs etc. and options for detecting and extracting common data types such as date, time, numbers, etc. It also allows a developer to modify the meaning extraction based on how they want it if they want it! CXP designer also allows skipping of certain steps in chatbots. For instance, if the user has already provided their tracking id for a particular package, the chatbot will skip the prompt of asking them the tracking id again. With Aspect CXP, developers can create and deploy complex chatbots. Radisson Blu Edwardian, a hotel in London, has collaborated with Aspect software to build an SMS based, AI virtual host. Conclusion Another popular chatbot development platform worth mentioning is the Facebook messenger with over 100,000 monthly active bots, but without cross-platform deployment features. The above bot frameworks are typically used by developers to build chatbots from scratch and require some programming skills. However, there has been a rise in automated bot development tools of late. Some of these include Chatfuel and Motion AI and typically involve drag and drop functionalities. With such tools, beginners and non-programmers can create and deploy chatbots within few minutes. But, they lack the extended functionalities supported by typical code based frameworks such as the flexibility to store data, produce analytics or incorporate customized AI tasks. Every chatbot development system, whether framework or tool, serves a different purpose. Choosing the right one depends on the type of application to build, organizational needs, and the developer’s expertise.

0
0
33761

Amey Varangaonkar

06 Nov 2017

6 min read

NewSQL: What the hype is all about

Amey Varangaonkar

06 Nov 2017

6 min read

First, there was data. Data became database. Then came SQL. Next came NoSQL. And now comes NewSQL. NewSQL Origins For decades, relational database or SQL was the reigning data management standard in enterprises all over the world. With the advent of Big Data and cloud-based storage rose the need for a faster, more flexible and scalable data management system, which didn’t necessarily comply with the SQL standards of ACID compliance. This was popularly dubbed as NoSQL, and databases like MongoDB, Neo4j, and others gained prominence in no time. We can attribute the emergence and eventual adoption of NoSQL databases to a couple of very important factors. The high costs and lack of flexibility of the traditional relational databases drove many SQL users away. Also, NoSQL databases are mostly open source, and their enterprise versions are comparatively cheaper too. They are schema-less meaning they can be used to manage unstructured data effectively. In addition, they can scale well horizontally - i.e. you could add more machines to increase computing power and use it to handle high volumes of data. All these features of NoSQL come with an important tradeoff, however - these systems can’t simultaneously ensure total consistency. Of late, there has been a rise in another type of database systems, with the aim to combine ‘the best of both the worlds’. Popularly dubbed as ‘NewSQL’, this system promises to combine the relational data model of SQL and the scalability and speed of NoSQL. NewSQL - The dark horse in the databases race NewSQL is ‘SQL on Steroids’, say many. This is mainly because all NewSQL systems start with the relational data model and the SQL query language, but also incorporate the features that have led to the rise of NoSQL - addressing the issues of scalability, flexibility, and high performance. They offer the assurance of ACID transactions like in the relational models. However, what makes them really unique is that they allow the horizontal scaling functionality of NoSQL, and can process large volumes of data with high performance and reliability. This is why businesses really like the concept of NewSQL - the performance of NoSQL and the reliability and consistency of the SQL model, all packed in one. To understand what the hype surrounding NewSQL is all about, it’s worth comparing NewSQL database systems with the traditional SQL and NoSQL database systems, and see where they stand out: Characteristic Relational (SQL) NoSQL NewSQL ACID compliance Yes No Yes OLTP/OLAP support Yes No Yes Rigid Schema Structure Yes No In some cases Support for unstructured data No Yes In some cases Performance with large data Moderate Fast Very fast Performance overhead Huge Moderate Minimal Support from Community Very high High Low As we can see from the table above, NewSQL really comes through as the best when you’re dealing with larger datasets with a desire to lower performance overheads. To give you a practical example, consider an organization that has to work with a large number of short transactions, access a limited amount of data, but executes those queries repeatedly. For such organizations, a NewSQL database system would be a perfect fit. These features are leading to the gradual growth of NewSQL systems. However, it will take some time for more industries to adopt them. Not all NewSQL databases are created equal Today, one has a host of NewSQL solutions to choose from. Some popular solutions are Clustrix, MemSQL, VoltDB and CockroachDB. Cloud Spanner, the latest NewSQL offering by Google, became generally available in February 2017 - indicating Google’s interest in the NewSQL domain and the value a NewSQL database can offer to their existing cloud offerings. It is important to understand that there are significant differences among these various NewSQL solutions. As such you should choose a NewSQL solution carefully after evaluating your organization’s data requirements and problems. As this article on Dataconomy points out, while some databases handle transactional workloads well, they do not offer the benefit of native clustering - SAP HANA is one such example. NuoDB focuses on cloud deployments, but its overall throughput is found to be rather sub-par. MemSQL is a suitable choice when it comes to clustered analytics but falls short when it comes to consistency. Thus, the choice of the database purely depends on the task you want to do, and what trade-offs you are ready to allow without letting it affect your workflow too much. DBAs and Programmers in the NewSQL world Regardless of which database system an enterprise adopts, the role of DBAs will continue to be important going forward. Core database administration and maintenance tasks such as backup, recovery, replication, etc. will need to be taken care of. The major challenge for the NewSQL DBAs will be in choosing and then customizing the right database solution that fits the organizational requirements. Some degree of capacity planning and overall database administration skills might also have to be recalibrated. Likewise, NewSQL database programmers may find themselves dealing with data manipulation and querying tasks similar to those faced while working with traditional database systems. But NewSQL programmers will be doing these tasks at a much larger, or shall we say, at a more ‘distributed’ scale. In conclusion When it comes to solving a particular problem related to data management, it’s often said that 80% of the solution comes down to selecting the right tool, and 20% is about understanding the problem at hand! In order to choose the right database system for your organization, you must ask yourself these two questions: What is the nature of the data you will work with? What are you willing to trade-off? In other words, how important are factors such as the scalability and performance of the database system? For example, if you primarily work with mostly transactional data with a priority on high performance and high scalability, then NewSQL databases might fit your bill just perfectly. If you’re going to work with volatile data, NewSQL might help you there as well, however, there are better NoSQL solutions to tackle your data problem. As we have seen earlier, NewSQL databases have been designed to combine the advantages and power of both relational and NoSQL systems. It is important to know that NewSQL databases are not designed to replace either NoSQL or SQL relational models. They are rather intentionally-built alternatives for data processing, which mask the flaws and shortcomings of both relational and nonrelational database systems. The ultimate goal of NewSQL is to deliver a high performance, highly available solution to handle modern data, without compromising on data consistency and high-speed transaction capabilities.

0
0
32503

article-image-when-why-and-how-to-use-graph-analytics-for-your-big-data

Sunith Shetty

20 Dec 2017

10 min read

When, why and how to use Graph analytics for your big data

Sunith Shetty

20 Dec 2017

10 min read

0
0
32361

article-image-analyzing-the-chief-data-officer-role

Aaron Lazar

13 Apr 2018

9 min read

GDPR is pushing the Chief Data Officer role center stage

Aaron Lazar

13 Apr 2018

9 min read

0
0
30368

article-image-create-strong-data-science-project-portfolio-lands-job

Aaron Lazar

13 Feb 2018

8 min read

How to create a strong data science project portfolio that lands you a job

Aaron Lazar

13 Feb 2018

8 min read

Okay, you’re probably here because you’ve got just a few months to graduate and the projects section of your resume is blank. Or you’re just an inquisitive little nerd scraping the WWW for ways to crack that dream job. Either way, you’re not alone and there are ten thousand others trying to build a great Data Science portfolio to land them a good job. Look no further, we’ll try our best to help you on how to make a portfolio that catches the recruiter’s eye! David “Trent” Salazar‘s portfolio is a great example of a wholesome one and Sajal Sharma’s, is a good example of how one can display their Data Science Portfolios on a platform like Github. Companies are on the lookout for employees who can add value to the business. To showcase this on your resume effectively, the first step is to understand the different ways in which you can add value. 4 things you need to show in a data science portfolio Data science can be broken down into 4 broad areas: Obtaining insights from data and presenting them to the business leaders Designing an application that directly benefits the customer Designing an application or system that directly benefits other teams in the organisation Sharing expertise on data science with other teams You’ll need to ensure that your portfolio portrays all or at least most of the above, in order to easily make it through a job selection. So let’s see what we can do to make a great portfolio. Demonstrate that you know what you're doing So the idea is to show the recruiter that you’re capable of performing the critical aspects of Data Science, i.e. import a data set, clean the data, extract useful information from the data using various techniques, and finally visualise the findings and communicate them. Apart from the technical skills, there are a few soft skills that are expected as well. For instance, the ability to communicate and collaborate with others, the ability to reason and take the initiative when required. If your project is actually able to communicate these things, you’re in! Stay focused and be specific You might know a lot, but rather than throwing all your skills, projects and knowledge in the employer’s face, it’s always better to be focused on doing something and doing it right. Just as you’d do in your resume, keeping things short and sweet, you can implement this while building your portfolio too. Always remember, the interviewer is looking for specific skills. Research the data science job market Find 5-6 jobs, probably from Linkedin or Indeed, that interest you and go through their descriptions thoroughly. Understand what kind of skills the employer is looking for. For example, it could be classification, machine learning, statistical modeling or regression. Pick up the tools that are required for the job - for example, Python, R, TensorFlow, Hadoop, or whatever might get the job done. If you don’t know how to use that tool, you’ll want to skill-up as you work your way through the projects. Also, identify the kind of data that they would like you to be working on, like text or numerical, etc. Now, once you have this information at hand, start building your project around these skills and tools. Be a problem solver Working on projects that are not actual ‘problems’ that you’re solving, won’t stand out in your portfolio. The closer your projects are to the real-world, the easier it will be for the recruiter to make their decision to choose you. This will also showcase your analytical skills and how you’ve applied data science to solve a prevailing problem. Put at least 3 diverse projects in your data science portfolio A nice way to create a portfolio is to list 3 good projects that are diverse in nature. Here are some interesting projects to get you started on your portfolio: Data Cleaning and wrangling Data Cleaning is one of the most critical tasks that a data scientist performs. By taking a group of diverse data sets, consolidating and making sense of them, you’re giving the recruiter confidence that you know how to prep them for analysis. For example, you can take Twitter or Whatsapp data and clean it for analysis. The process is pretty simple; you first find a “dirty” data set, then spot an interesting angle to approach the data from, clean it up and perform analysis on it, and finally present your findings. Data storytelling Storytelling showcases not only your ability to draw insight from raw data, but it also reveals how well you’re able to convey the insights to others and persuade them. For example, you can use data from the bus system in your country and gather insights to identify which stops incur the most delays. This could be fixed by changing their route. Make sure your analysis is descriptive and your code and logic can be followed. Here’s what you do; first you find a good dataset, then you explore the data and spot correlations in the data. Then you visualize it before you start writing up your narrative. Tackle the data from various angles and pick up the most interesting one. If it’s interesting to you, it will most probably be interesting to anyone else who’s reviewing it. Break down and explain each step in detail, each code snippet, as if you were describing it to a friend. The idea is to teach the reviewer something new as you run through the analysis. End to end data science If you’re more into Machine Learning, or algorithm writing, you should do an end-to-end data science project. The project should be capable of taking in data, processing it and finally learning from it, every step of the way. For example, you can pick up fuel pricing data for your city or maybe stock market data. The data needs to be dynamic and updated regularly. The trick for this one is to keep the code simple so that it’s easy to set up and run. You first need to identify a good topic. Understand here that we will not be working with a single dataset, rather you will need to import and parse all the data and bring it under a single dataset yourself. Next, get the training and test data ready to make predictions. Document your code and other findings and you’re good to go. Prove you have the data science skill set If you want to get that job, you’ve got to have the appropriate tools to get the job done. Here’s a list of some of the most popular tools with a link to the right material for you to skill-up: Data science languages There's a number of key languages in data science that are essential. It might seem obvious, but making sure they're on your resume and demonstrated in your portfolio is incredibly important. Include things like: Python R Java Scala SQL Big Data tools If you're applying for big data roles, demonstrating your experience with the key technologies is a must. It not only proves you have the skills, but also shows that you have an awareness of what tools can be used to build a big data solution or project. You'll need: Hadoop, Spark Hive Machine learning frameworks With machine learning so in demand, if you can prove you've used a number of machine learning frameworks, you've already done a lot to impress. Remember, many organizations won't actually know as much about machine learning as you think. In fact, they might even be hiring you with a view to building out this capability. Remember to include: TensorFlow Caffe2 Keras PyTorch Data visualisation tools Data visualization is a crucial component of any data science project. If you can visualize and communicate data effectively, you're immediately demonstrating you're able to collaborate with others and make your insights accessible and useful to the wider business. Include tools like these in your resume and portfolio: D3.js Excel chart Tableau ggplot2 So there you have it. You know what to do to build a decent data science portfolio. It’s really worth attending competitions and challenges. It will not only help you keep up to data and well oiled with your skills, but also give you a broader picture of what people are actually working on and with what tools they’re able to solve problems.

0
2
27879

article-image-6-key-skills-data-scientist-role

Amey Varangaonkar

21 Dec 2017

6 min read

6 Key Areas to focus on while transitioning to a Data Scientist role

Amey Varangaonkar

21 Dec 2017

6 min read

[box type="note" align="" class="" width=""]The following article is an excerpt taken from the book Statistics for Data Science, authored by James D. Miller. The book dives into the different statistical approaches to discover hidden insights and patterns from different kinds of data.[/box] Being a data scientist is undoubtedly a very lucrative career prospect. In fact, it is one of the highest paying jobs in the world right now. That said, transitioning from a data developer role to a data scientist role needs careful planning and a clear understanding of what the role is all about. In this interesting article, the author highlights six key skills must give special attention to, during this transition. Let's start by taking a moment to state what I consider to be a few generally accepted facts about transitioning to a data scientist. We'll reaffirm these beliefs as we continue through this book: Academia: Data scientists are not all from one academic background. They are not all computer science or statistics/mathematics majors. They do not all possess an advanced degree (in fact, you can use statistics and data science with a bachelor's degree or even less). It's not magic-based: Data scientists can use machine learning and other accepted statistical methods to identify insights from data, not magic. They are not all tech or computer geeks: You don't need years of programming experience or expensive statistical software to be effective. You don't need to be experienced to get started. You can start today, right now. (Well, you already did when you bought this book!) Okay, having made the previous declarations, let's also be realistic. As always, there is an entry-point for everything in life, and, to give credit where it is due, the more credentials you can acquire to begin out with, the better off you will most likely be. Nonetheless, (as we'll see later in this chapter), there is absolutely no valid reason why you cannot begin understanding, using, and being productive with data science and statistics immediately. [box type="info" align="" class="" width=""]As with any profession, certifications and degrees carry the weight that may open the doors, while experience, as always, might be considered the best teacher. There are, however, no fake data scientists but only those with currently more desire than practical experience.[/box] If you are seriously interested in not only understanding statistics and data science but eventually working as a full-time data scientist, you should consider the following common themes (you're likely to find in job postings for data scientists) as areas to focus on: Education Common fields of study here are Mathematics and Statistics, followed by Computer Science and Engineering (also Economics and Operations research). Once more, there is no strict requirement to have an advanced or even related degree. In addition, typically, the idea of a degree or an equivalent experience will also apply here. Technology You will hear SAS and R (actually, you will hear quite a lot about R) as well as Python, Hadoop, and SQL mentioned as key or preferable for a data scientist to be comfortable with, but tools and technologies change all the time so, as mentioned several times throughout this chapter, data developers can begin to be productive as soon as they understand the objectives of data science and various statistical mythologies without having to learn a new tool or language. [box type="info" align="" class="" width=""]Basic business skills such as Omniture, Google Analytics, SPSS, Excel, or any other Microsoft Office tool are assumed pretty much everywhere and don't really count as an advantage, but experience with programming languages (such as Java, PERL, or C++) or databases (such as MySQL, NoSQL, Oracle, and so on.) does help![/box] Data The ability to understand data and deal with the challenges specific to the various types of data, such as unstructured, machine-generated, and big data (including organizing and structuring large datasets). [box type="info" align="" class="" width=""]Unstructured data is a key area of interest in statistics and for a data scientist. It is usually described as data having no redefined model defined for it or is not organized in a predefined manner. Unstructured information is characteristically text-heavy but may also contain dates, numbers, and various other facts as well.[/box] Intellectual Curiosity I love this. This is perhaps well defined as a character trait that comes in handy (if not required) if you want to be a data scientist. This means that you have a continuing need to know more than the basics or want to go beyond the common knowledge about a topic (you don't need a degree on the wall for this!) Business Acumen To be a data developer or a data scientist you need a deep understanding of the industry you're working in, and you also need to know what business problems your organization needs to unravel. In terms of data science, being able to discern which problems are the most important to solve is critical in addition to identifying new ways the business should be leveraging its data. Communication Skills All companies look for individuals who can clearly and fluently translate their findings to a non-technical team, such as the marketing or sales departments. As a data scientist, one must be able to enable the business to make decisions by arming them with quantified insights in addition to understanding the needs of their non-technical colleagues to add value and be successful. This article paints a much clearer picture on why soft skills play an important role in becoming a better data scientist. So why should you, a data developer, endeavor to think like (or more like) a data scientist? Specifically, what might be the advantages of thinking like a data scientist? The following are just a few notions supporting the effort: Developing a better approach to understanding data Using statistical thinking during the process of program or database designing Adding to your personal toolbox Increased marketability If you found this article to be useful, make sure you check out the book Statistics for Data Science, which includes a comprehensive list of tips and tricks to becoming a successful data scientist by mastering the basic and not-so-basic concepts of statistics.

0
0
27786

Tech Guides - Big Data

15 Useful Python Libraries to make your Data Science tasks Easier

What does a data science team look like?

Is Python edging R out in the data science wars?

Top 5 programming languages for crunching Big Data effectively

Say hello to Streaming Analytics

Why is Hadoop dying?

Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions

Two popular Data Analytics methodologies every data professional should know: TDSP & CRISP-DM

Python Data Stack

Top 4 chatbot development frameworks for developers

Trending Topics

NewSQL: What the hype is all about

When, why and how to use Graph analytics for your big data

GDPR is pushing the Chief Data Officer role center stage

How to create a strong data science project portfolio that lands you a job

6 Key Areas to focus on while transitioning to a Data Scientist role