Two main components of HADOOP
- HDFS (Hadoop Distributed File System) used for data storage
- MapReduce for high-performance parallel data processing
To massively parallelise access and process huge datasets in distributed environment
1. Parallel processing
- data processed in parallel - fast
2. Data Locality
- processing data where it is stored instead of moving data to centralised server
Two functions of MapReduce
1. Map function
- performs actions like filtering and grouping
- output results for its shards as key-value pairs
2. Reduce function
- aggregates and summarises the result produced by map function
Describe MapReduce process
1. runs a Map function at data nodes and writes output to <key,value> store on local disk 2.
Map function uses only data in its block 3. intermediate results sent to correct Reducer such that all <key,value> results with the same key are received by the same Reducer 4. Reducer aggregates and summarises intermediate results to create final result 5. Final result is stored on HDFS(and thus replicated)
Hadoop Distributed File System
- stores different types of large datasets(structured, unstructured and semi structured)
- stores data across various nodes
- designed for fast processing of large files (high throughput) and process file without user interaction("batch processing")
- Users can only read(access), delete, create or extend a file
- caNOT change anything in file that already exists, new block added to end of existing file
1. Name nodes
- manages entire cluster(metadata)
- mastermind
- maps a file to a list of data nodes having parts of the file
- SecondaryNameNode: Standby, im case NameNode fails
2. Data nodes
- stores actual data
- maps a block id of file to physical location on its disk
3. HDFS Clients
- runs in your node
- interacts with HDFS to access data - open/close/read/write,delete,create
- HDFS is scalable(can handle huge datasets)
- HDFS is elastic(easy to add more storage ie more nodes)
- HDFS is for big data and high throughput
- Very fast batch processing
- keeps info in memory where all blocks of all files are
- Data nodes tell Name node that they are still functioning(heartbeats) every 3 seconds + block status reports
- Name node uses this to ensure there are 3 replicas of all blocks in the network at all times
- Name Node detects if node comes up again & deletes extra copies
- Name Node keeps log on disk of all writes to all files
- pipeline ensures that block is added to all replicas as efficiently as possible
- as chunk of data arrives at data node, it is sent to next DN immediately so data can be written to disk at same time
- think of a line of workers passing bricks from the bottom to the top of a stair case -
- Yet Another Resource Negotiator
- performs all processing activities by allocating resources and scheduling tasks
- More flexible resource management
Components 1. Resource Manager 2. Node Manager
1. manages container and monitors resource utilisation in each container
- container - collection of all the resources necessary to run an application: CPU cores, memory, disk space
2. launch container
- Client asks resource manager to find a node to run app
- resource manager finds suitable node manager
- Node manager launches container
- container can contain multiple CPU cores and disks
- Node Manger can request additional containers
- Denormalize
- Automatic horizontal scaling
- High availability
- Avoid(most) blocks
- open, non-relational distributed database
- supports all types of data
- HBASE is column family store that uses HDFS for storing file blocks that are actually column families
- each row has all its columns in same DataNode but arranged column family by column family on that node
HBASE vs Relational Database
1. HBASE - only column families are defined but can contain different columns from one row two to next vs RDBMS - all rows have exact same structure 2. HBASE - built for huge, wide table, easily scalable vs RDBMS - narrower, smaller tables 3. HBASE - no notion of transaction vs RDBMS - ACID transaction always 4. HBASE - Avoid joins and normalisation vs RDBMS - data is normalized 5. HBASE - good for structured, semi structured or unstructured data vs RDBMS - GOOD FOR STRUCTURED DATA
- data warehousing component which analyses data sets in a distributed environment using SQL like interface
- Hive Query Language(HQL)
- Like SQL but no joins
- provides environments for creating machine learning applications
- performs recommenders(collaborative filtering), clustering and classification
- In-memory data processing
- executes in memory computations to increase speed of data
- framework for real time data analytics in distributed computing environment
- 100x faster than MapReduce
- General purpose - "one size fits all" solution
- supports for languages
- built in libraries for streaming, machine learning