Quoted from DBLM:Ok, I can't sleep so will post the first part of the primer on Big Data and Analytics. In it's simplest form, you need 3 components to do anything analytical as represented by the diagram below: You need data in one to many databases (1), you need analytical tools that will access the data and do things with it (2), and you need a place to run all of this (3).
Databases: Take many forms. Prevalent databases are things like Oracle and DB2, which are great for storing backend data and doing queries. With the rise of analytics, you had the development of data stores that are tailored for these workloads. These started with Data Appliances like Netezza and Teradata, and now have systems like Snowflake, Greenplum, and others. This does not take into account Mainframe databases, which are incredibly prevalent. You also have specialized databases like GRAPH, NOSQL, and others for other types of data and functions.
Analytics: Runs a wide gamut of areas, but is generally broken down into descriptive analytics and predictive analytics. Descriptive analytics is typicaly where you are running STAT type functions, finding answers, and the like. Despite all of the buzz, most analytics are descriptive in nature! Tools here include things like SAS, SPSS (IBM) and the like. You also have your Business Intelligence tools here, which handle visualization, reporting, and light analytics. These include things like Tableau (SalesForce), Business Object (Oracle), Cognos (IBM), QLIK, etc. Predictive Analytics tools cover a wide range, but what differentiates them from descriptive is that they take inputs, run models many times, and try to predict outcomes. I will go into more detail later, but this covers tools like Palantir, C3.AI, Databricks, Data Robot, Watson, UIPath, and a whole lot more. You also have a lot of open source tools that people use here, including Python, Tensorflow, etc. Most analytics firms use combinations of these approaches based upon what is trying to be done. As you build, train, and run models, you are moving data back and forth to the Database(s).
Place to run all of this: This use to be on premise in racks or special built, refrigerator-sized appliances. With the cloud, some of this moved to the cloud (AWS, Azure, Google, etc). Based upon the work that you are doing or security requirements, moving large volumes of data to the cloud can be very expensive or prohibited. Legitmately, if you have large volumes of data it is more cost effective to send the drives to the cloud provider or get them to send an 18 wheeler to transport your data. That is where you are seeing a lot of hybrid approaches, where some data is in the cloud and some is on premise. The cloud providers also have proprietary tools, which is good if you are using all AWS for example, but is bad if you are trying to use multiple clouds and/or don't want vendor lock in. The two big cost components of running things in the cloud is the cost for running an application (say, Snowflake) plus the cost of the compute environment to run everything. iceman44, this is what your customer is talking about in trying to reduce the compute costs, because both of the components can add up fast.
Cloud is not always cheaper! In fact, it is often more expensive for analytics workloads compared to running them on bare metal. However, you can have reduced overhead, skirt organizations politics (typically, infrastructure teams are separate from the business teams, etc). Going to the cloud has a lot of facets.
Ok, this is enough for part 1 for tonight, and will set the groundwork for getting into more detail in the areas in the AM. But to give you food for thought: It is one thing to get an answer, it is completely another to operationalize an answer. This is a big stumbling area for AI/ML as we stand. We will get more into this later.
[quoted image]