Which of the following types of HDFS nodes stores all the metadata about a file system group of answer choices?

Two main components of HADOOP

  1. HDFS (Hadoop Distributed File System) used for data storage
  1. MapReduce for high-performance parallel data processing

To massively parallelise access and process huge datasets in distributed environment 

1. Parallel processing

  • data processed in parallel - fast

2. Data Locality

  • processing data where it is stored instead of moving data to centralised server

Two functions of MapReduce 

1. Map function

  • performs actions like filtering and grouping
  • output results for its shards as key-value pairs

2. Reduce function

  • aggregates and summarises the result produced by map function

Describe MapReduce process

1. runs a Map function at data nodes and writes output to <key,value> store on local disk 2. Map function uses only data in its block 3. intermediate results sent to correct Reducer such that all <key,value> results with the same key are received by the same Reducer 4. Reducer aggregates and summarises intermediate results to create final result 5. Final result is stored on HDFS(and thus replicated)

Hadoop Distributed File System

  • stores different types of large datasets(structured, unstructured and semi structured)
  • stores data across various nodes
  • designed for fast processing of large files (high throughput) and process file without user interaction("batch processing")
  • Users can only read(access), delete, create or extend a file
  • caNOT change anything in file that already exists, new block added to end of existing file

1. Name nodes

  • manages entire cluster(metadata)
  • mastermind
  • maps a file to a list of data nodes having parts of the file
  • SecondaryNameNode: Standby, im case NameNode fails

2. Data nodes

  • stores actual data
  • maps a block id of file to physical location on its disk

3. HDFS Clients

  • runs in your node
  • interacts with HDFS to access data - open/close/read/write,delete,create

  • HDFS is scalable(can handle huge datasets)
  • HDFS is elastic(easy to add more storage ie more nodes)
  • HDFS is for big data and high throughput
  • Very fast batch processing

  • keeps info in memory where all blocks of all files are
  • Data nodes tell Name node that they are still functioning(heartbeats) every 3 seconds + block status reports
  • Name node uses this to ensure there are 3 replicas of all blocks in the network at all times
  • Name Node detects if node comes up again & deletes extra copies
  • Name Node keeps log on disk of all writes to all files

  • pipeline ensures that block is added to all replicas as efficiently as possible
  • as chunk of data arrives at data node, it is sent to next DN immediately so data can be written to disk at same time
  • think of a line of workers passing bricks from the bottom to the top of a stair case -

  • Yet Another Resource Negotiator
  • performs all processing activities by allocating resources and scheduling tasks
  • More flexible resource management

Components 1. Resource Manager 2. Node Manager

1. manages container and monitors resource utilisation in each container

  • container - collection of all the resources necessary to run an application: CPU cores, memory, disk space

2. launch container

  1. Client asks resource manager to find a node to run app
  2. resource manager finds suitable node manager
  3. Node manager launches container
  4. container can contain multiple CPU cores and disks
  5. Node Manger can request additional containers


  • Denormalize
  • Automatic horizontal scaling
  • High availability
  • Avoid(most) blocks

  • open, non-relational distributed database
  • supports all types of data
  • HBASE is column family store that uses HDFS for storing file blocks that are actually column families
  • each row has all its columns in same DataNode but arranged column family by column family on that node

HBASE vs Relational Database 

1. HBASE - only column families are defined but can contain different columns from one row two to next vs RDBMS - all rows have exact same structure 2. HBASE - built for huge, wide table, easily scalable vs RDBMS - narrower, smaller tables 3. HBASE - no notion of transaction vs RDBMS - ACID transaction always 4. HBASE - Avoid joins and normalisation vs RDBMS - data is normalized 5. HBASE - good for structured, semi structured or unstructured data vs RDBMS - GOOD FOR STRUCTURED DATA

  • data warehousing component which analyses data sets in a distributed environment using SQL like interface
  • Hive Query Language(HQL)
  • Like SQL but no joins

  • provides environments for creating machine learning applications
  • performs recommenders(collaborative filtering), clustering and classification

  • In-memory data processing
  • executes in memory computations to increase speed of data
  • framework for real time data analytics in distributed computing environment
  • 100x faster than MapReduce
  • General purpose - "one size fits all" solution
  • supports for languages
  • built in libraries for streaming, machine learning

Which of the following types of HDFS nodes stores all the metadata about a file system?

NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster. 3. NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.

What is the Hadoop Distributed File System HDFS designed to handle?

HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications. This open source framework works by rapidly transferring data between nodes. It's often used by companies who need to handle and store big data.

Which of the following is a disadvantage of the hierarchical data model?

The major disadvantage of hierarchical databases is their inflexible nature. The one-to-many structure is not ideal for complex structures as it cannot describe relationships in which each child node has multiple parents nodes.

Is the representation of the database as seen by the DBMS?

It is the representation of a database as "seen" by the DBMS. It requires a designer to match the conceptual model's characteristics and constraints to those of the selected implementation model. A representation of an internal model using the database constructs supported by the chosen database.