Introduction to Big Data and Apache Hadoop

Big Data

Basically large data set(collect of data) is called Big Data.

Three basic characteristics of Big Data:

Volume - Size of the data

Velocity - Speed at which data is generate

Variety - Various type of data i.e. Structured, Semi-structured and Unstructured

Apache Hadoop

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Two main components of Apache Hadoop: 

1. Hadoop Distributed File System (HDFS) - Scalable Distributed Storage Component

2. MapReduce - Distributed Computing Framework

Reference Architecture for Apache Hadoop and Apache Spark Project

In general, reference architecture for the Hadoop/Spark project has following layers based on the project requirements.

1. Data Source - Sensors, Web Applications, APIs, Databases, Web Logs, etc.

2. Ingestion/Message Layer - Kafka, Spark Streaming, Flume, etc.

3.1. Hadoop/Spark Cluster: Storage Layer - HDFS, S3, NoSQL databases, etc.

3.2. Hadoop/Spark Cluster: Processing Layer - Hive, Pig, MapReduce, Spark, etc.

4. Machine Learning / Data Analytics Layer - Spark ML, Python Machine Learning Library, etc.

5. Visualization Layer - Reporting tools like Tableau, Python Visualization Packages

Happy Learning !!!

Post a Comment