Apache Flume

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS).
Apache Flume uses an elegant design to make data loading easy and efficient.

It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

Ingest streaming data from multiple sources into Hadoop for storage and analysis

Flume uses channel-based transactions to guarantee reliable message delivery.
When a message moves from one agent to another, two transactions are started,
one on the agent that delivers the event and the other on the agent that receives the event.
This ensures guaranteed delivery.

Flume is a framework for populating Hadoop with data.
Agents are populated throughout ones IT infrastructure –
inside web servers,
application servers
and mobile devices,
for example – to collect data and integrate it into Hadoop.
How Apache Flume Works.?

we have lots of Servers and Systems
web applications.
operating system.
Network Devices.

which are generatig huge amount of data and logs.
If i wanted to move these data or logs in HDFS.

If we will use traditional ways we will have below problems like
High delay.
encryption of various file format ,
throughput.
limited Scalability.
Failover, load balancing etc.

Apache Flume is the solution for all these problems.
we can install Flume agent on each node and we can get all the events.
Also we can add filter to it so that we can avoid unwanted data.
we can also add metadata to it like Timestamp, hostname,uuid etc.
later we can encrypt events in a file disk.
Thens we can send events to hop Flume agent.
we can compress these events and create replica as well.
Events can be Stored in In Memory for Performance
or in Disk for Duarability.

Events can finally send to HDFS in different file format like JSon,Avro.

Apche FLUME is

Distributed:
Agents install on many machine.

Scalable:
Add more machine to transfer more events.

Reliable:
durable Storage,failure/replication.

Manageble
Ease to install,configure, reconfigure and run.

There are various destination for FLUME like HBase ,HDFS.
Flume Agent.
Responsible for transfering events.
Runs in JVM.
Consists of sources,Channels,Sinks.
Source :HTTP, JMS,RPC,NetCat.
Exec,Spolling Directory.
Source will Collects and forward events to Channel.

Channel.

It will buffer incoming events untill they are extracted by sinks.
Trade of between durability and Through put. like Memory , File, JDBC.
Sink
Removes event from Channel and forward it to their next destination.
like HBASE, HDFS, Flume Agent,File , logger.

We Can not use Flume in

Very Large Event.
if event is larger than memory or disk on an agents machine.

Infrequent Bulks Load.

Welcome to the Big data

Apache Flume

amar_kale

Leave a comment Cancel reply

Welcome to the Big data

Apache Flume

Share this:

amar_kale

Leave a comment Cancel reply