average 74 something
- LearnUs Quiz
- No online search is allowed
- Closed book
- Format
- Multiple choices: Up to 30 questions
- Short answers (a few words/sentences): Up to 5 questions
Introduction to the World of Big Data
- 2.5 quintillion (2 500 000 000 000 000 000, or 2.5 x 1018) bytes of data are generated every day
- Of the data available today, 80 percent has been generated in the last few years
- Semi-Structured Data like xml
- Semi-structured data are that which have a structure but does not fit into the relational database
- Big Data Life Cycle*
- Data Generation
- Data Aggregation
- process of collecting the raw data, transmitting the data to a storage
- Data Preprocessing
- Data Integration involves combining data from different sources to give the end
- Data Cleaning like Spotting or identifying the error or Correcting the error or deleting
- Data Reduction reducing the dimension and attributes of the data
- Data Transformation consolidating the data into an appropriate format
- Big Data Analytics
- Visualizing Big Data
- Challenges Faced by Big Data Technology
- Heterogeneity and Incompleteness
- Volume and Velocity of the Data
- Data Storage
- Data Privacy
- 3 Vs of big data.
- Velocity
- Volume
- Variety
- commodity hardware
- Commodity hardware is a low-cost, low-performance, and low-specification functional hardware with no distinctive features
Big Data Storage Concepts
- Cluster Computing
- the clusters may be classified into two major types
- High availability ← Replication
- Load balancing ← sharding
- optimize the use of resources, minimize response time, maximize throughput, and avoid overload on any one of the resources
- Cluster Structure
- Symmetric clusters
- Asymmetric clusters with Head Node
- Distributed Computing
- The computer cluster architecture emerged as a result of Distributed systems
- Sharding partitioning very large data sets into smaller
- Replication creating copies of the same set of data across multiple servers
- Master-Slave Model: one centralized device known as the master controls one or more devices known as slaves (Asymmetric clusters)
- Peer-to-Peer Model: Symmetric clusters (same level)
- Sharding and replication
- RDBMS
- NoSQL BASE Property
- NewSQL provide scalable performance by NoSQL systems combining the ACID
- Scaling-up (vertical scalability)
- Scaling-out (horizontal scalability)
- Cluster adopts Failover mechanism to eliminate the service interruptions (process of switching to a redundant node upon the abnormal termination or failure)
- switch over requires human intervention.
NoSQL Database
- CAP Theorem Consistency, Availability, Partition tolerance

- (atomicity, consistency, isolation, and durability)
- BASE Property (Basically available, Soft state, Eventual consistency)
- (simplest, efficient): DynamoDB, Azure Table
- (data as columns instead of rows) : Apache Cassandra
- (encapsulate data) : CouchDB
- (nodes and the relationship): Neo4J
- Insert create new collection
- Capped collection where older entries are automatically overwritten when the maximum size (Size) specified is reached
Processing, Management Concepts, and Cloud Computing
- Data processing
- centralized processing
- Distributed Processing
- Batch Processing
- Real-time Processing
- Parallel Computing
- : sharing of resources accessed from heterogeneous environment
- Cloud Challenges
- Cloud Services: distributed network and uses virtualized resources (pay-as-you-go)
- Cloud Storage
Driving Big Data with Hadoop Tools and Technologies (~ HBASE)
- Hadoop Ecosystem
- (Hadoop Common, HDFS, Haddop YARN Haddop Mapreduce) 4 for hadoop framework
- Hadoop 1.0 Limitations
- The rack is a storage area where multiple DataNodes are put together
- Hadoop 2.0
- Column-oriented NoSQL database that is a horizontally scalable built on top of the HDFS
- Automatic Failover;Auto sharding,Horizontal scalability,Column-oriented
Fault tolerance is the ability of the system to work without interruption in case of system hardware or software failure
HDFS is designed to work on commodity hardware to make it cost effective
sA slave nodein Hadoop has the DataNode and TaskTracker. A master node has a NameNode and the JobTracker.
Seonglae Cho