YSU Big Data Midterm

Creator

Creator

Seonglae Cho

Created

Created

2024 Mar 14 4:11

Editor

Editor

Seonglae Cho

Edited

Edited

2024 May 7 5:5

Refs

Refs

average 74 something

LearnUs Quiz

No online search is allowed

Closed book

Format

Multiple choices: Up to 30 questions

Short answers (a few words/sentences): Up to 5 questions

Introduction to the World of Big Data

2.5 quintillion (2 500 000 000 000 000 000, or 2.5 x 1018) bytes of data are generated every day

Of the data available today, 80 percent has been generated in the last few years

Semi-Structured Data like xml

Semi-structured data are that which have a structure but does not fit into the relational database

Big Data Life Cycle*

Data Generation
Data Aggregation

process of collecting the raw data, transmitting the data to a storage

Data Preprocessing

Data Integration involves combining data from different sources to give the end
Data Cleaning like Spotting or identifying the error or Correcting the error or deleting
Data Reduction reducing the dimension and attributes of the data
Data Transformation consolidating the data into an appropriate format

Big Data Analytics
Visualizing Big Data

Challenges Faced by Big Data Technology

Heterogeneity and Incompleteness
Volume and Velocity of the Data
Data Storage
Data Privacy

3 Vs of big data.

Velocity
Volume
Variety

commodity hardware

Commodity hardware is a low-cost, low-performance, and low-specification functional hardware with no distinctive features

Big Data Storage Concepts

Hadoop

Data Warehouse

Cluster Computing

the clusters may be classified into two major types

High availability ← Replication
Load balancing ← sharding

optimize the use of resources, minimize response time, maximize throughput, and avoid overload on any one of the resources

Cluster Structure

Symmetric clusters
Asymmetric clusters with Head Node

Distributed Computing

The computer cluster architecture emerged as a result of Distributed systems

Sharding partitioning very large data sets into smaller
Replication creating copies of the same set of data across multiple servers

Master-Slave Model: one centralized device known as the master controls one or more devices known as slaves (Asymmetric clusters)
Peer-to-Peer Model: Symmetric clusters (same level)

Sharding and replication

RDBMS
NoSQL
BASE Property
NewSQL provide scalable performance by NoSQL systems combining the ACID
Scaling-up (vertical scalability)
Scaling-out (horizontal scalability)
Cluster adopts Failover mechanism to eliminate the service interruptions (process of switching to a redundant node upon the abnormal termination or failure)
switch over requires human intervention.

NoSQL Database

CAP Theorem Consistency, Availability, Partition tolerance

notion image

(atomicity, consistency, isolation, and durability)

BASE Property (Basically available, Soft state, Eventual consistency)

NoSQL

(simplest, efficient): DynamoDB, Azure Table

(data as columns instead of rows) : Apache Cassandra

(encapsulate data) : CouchDB

(nodes and the relationship): Neo4J

Insert create new collection

Capped collection where older entries are automatically overwritten when the maximum size (Size) specified is reached

Processing, Management Concepts, and Cloud Computing

Data processing

centralized processing

Distributed Processing

Batch Processing

Real-time Processing

Parallel Computing

: sharing of resources accessed from heterogeneous environment

Cloud Challenges

Cloud Services: distributed network and uses virtualized resources (pay-as-you-go)

Cloud Storage

Driving Big Data with Hadoop Tools and Technologies (~ HBASE)

Hadoop

Hadoop Ecosystem

(Hadoop Common, HDFS, Haddop YARN Haddop Mapreduce) 4 for hadoop framework

Hadoop 1.0 Limitations

The rack is a storage area where multiple DataNodes are put together

Hadoop 2.0

Column-oriented NoSQL database that is a horizontally scalable built on top of the HDFS

Automatic Failover;Auto sharding,Horizontal scalability,Column-oriented

Fault tolerance is the ability of the system to work without interruption in case of system hardware or software failure

HDFS is designed to work on commodity hardware to make it cost effective

sA slave nodein Hadoop has the DataNode and TaskTracker. A master node has a NameNode and the JobTracker.

Recommendations

//////