YSU Big Data Final

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 24 12:57
Editor
Edited
Edited
2024 Jun 20 0:36
Refs
Refs
Multiple choices Up to 30 questions, Short answers (a few words/sentences) Up to 5 questions

Driving Big Data with Hadoop Tools and Technologies

SQOOP

When the structured data is huge and RDBMS is unable to support the huge data, the data is transferred to HDFS through a tool called SQOOP.
Organizational data that are stored in relational databases are extracted and stored into Hadoop using SQOOP for further processing. SQOOP can also be used to move data from relational databases to HBase.

Flume

Flume collects data from a streaming data source such as a sensor, social media, log files from web servers, and so forth, and moves them into HDFS for processing. Flume has a flexible architecture to capture data from multiple data sources and adopts a parallel processing of data.
  • Apache Avro is an open-source data serialization framework.
  • Apache Pig, Apache Mahout, Apache Oozie

Apache Hive

Hive is a tool to process structured data in the Hadoop environment.
Metastore, Hive Query Language—HQL, JDBC/ODBC, Compiler, Parser, Plan executor

Big Data Analytics

  • Data warehouse, also termed as Enterprise Data warehouse, is a repository for the data that various organizations and business enterprises collect.
  • Business intelligence is the process of analyzing the data and produce a desirable output to the organizations and end users to assist them in decision making.
  • Prescriptive analytics provides decision support to benefit from the outcome of the analysis.
  • Diagnostic analytics is a form of analytics that enables users to understand what is happening and why did it happen so that a corrective action can be taken if something went wrong.
  • Descriptive analytics describes, summarizes, and visualizes massive amounts of raw data into a form that is interpretable by end users.
  • Big data analytics is the science of examining or analyzing large data sets with variety of data types.
  • OLTP is used to process and manage transaction-oriented applications.
  • RTAP (Real-Time Analytics Platform)
  • Online Analytical processing systems are used to process data analysis queries and perform effective analysis on massive amounts of data.
What are the types of semantic analysis?
  1. Natural Language Processing
  1. Text analytics
  1. Sentiment analysis

Big Data Analytics with Machine Learning

  • Divisive clustering is a clustering technique that starts in one giant cluster dividing the cluster into smaller clusters
  • Hierarchical clustering is a clustering technique that results in the development of a tree-like structure.
  • Hierarchical cluster is a series of partitions running from a single cluster or reversely a single large cluster can be iteratively divided into smaller clusters.
    • Agglomerative clustering is done by merging several smaller clusters into a single larger cluster from the bottom up.
    • Divisive clustering is done by dividing a single large cluster into smaller clusters.
  • Partitional clustering is the method of partitioning a data set into a set of clusters.

Big Data Visualization

  • Once the hierarchical clustering is completed the results are visualized with a graph or a tree diagram called Dendrogram.

Conventional Data Visualization Techniques

Tableau: Line Chart, Bar Chart, Pie Chart, Scatter Chart, Bubble Plot
 
 
 
 
 
 
 
 
 

Recommendations