apache flume kafka #0

Supports: trusty


Uses a Kafka source, memory channel, and Avro sink in Apache Flume to ingest messages published to a Kafka topic.


Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many fail over and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Learn more at flume.apache.org.

This charm provides a Flume agent designed to ingest messages published to a Kafka topic and send them to the apache-flume-hdfs agent for storage in the shared filesystem (HDFS) of a connected Hadoop cluster. This leverages the KafkaSource jar packaged with Flume. Learn more about the Flume Kafka Source.


This charm is intended to be deployed via the apache-ingestion-kafka bundle:

juju quickstart apache-ingestion-kafka

This will deploy the Apache Hadoop platform with Apache Flume and Apache Kafka communicating with the cluster via the apache-hadoop-plugin charm.


The default Kafka topic where messages are published is unset. Set this to an existing Kafka topic as follows:

juju set flume-kafka kafka_topic='<topic_name>'

If you don't have a Kafka topic, you may create one (and verify successful creation) with:

juju action do kafka/0 create-topic topic=<topic_name> \
 partitions=1 replication=1
juju action fetch <id>  # <-- id from above command

Once the Flume agents start, messages will start flowing into HDFS in year-month-day directories here: /user/flume/flume-kafka/%y-%m-%d.


A Kafka topic is required for this test. Topic creation is covered in the Configuration section above. Generate Kafka messages with the write-topic action:

juju action do kafka/0 write-topic topic=<topic_name> data="This is a test"

To verify these messages are being stored into HDFS, SSH to the flume-hdfs unit, locate an event, and cat it:

juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/flume-kafka  # <-- find a date
hdfs dfs -ls /user/flume/flume-kafka/yyyy-mm-dd  # <-- find an event
hdfs dfs -cat /user/flume/flume-kafka/yyyy-mm-dd/FlumeData.[id]

(string) The maximum number of events stored in the channel.
(string) The maximum number of events the channel will take from a source or give to a sink per transaction.
(string) The HDFS subdirectory under /user/flume where events will be stored.
(string) Maximum number of messages written to channel in a single batch
(string) The Kafka topic to watch for messages
(string) URL from which to fetch resources (e.g., Flume binaries) instead of S3