apache flume rabbitmq #7

Supports: trusty
Add to new model


Uses a RabbitMQ source, memory channel, and Avro sink in Apache Flume to ingest messages published to a RabbitMQ topic.


Flume is a distributed, reliable, and highly-available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability, failover, and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Learn more at flume.apache.org.

This charm provides a Flume agent designed to ingest messages published to a RabbitMQ queue and send them to the apache-flume-hdfs agent for storage in the shared filesystem (HDFS) of a connected Hadoop cluster. This utilizes a RabbitMQ-Flume Plugin.


This charm leverages our pluggable Hadoop model with the hadoop-plugin interface. A base Apache Hadoop cluster is required. The suggested deployment method is to use the apache-ingestion-flume-rabbitmq bundle.

Bundle Deployment

This will deploy the Apache Hadoop platform with a pair of Apache Flume agents that facilitate communication between RabbitMQ and HDFS:

juju quickstart u/bigdata-dev/apache-ingestion-flume-rabbitmq

Manual Deployment

You may manually deploy the recommended environment as follows:

juju deploy apache-hadoop-hdfs-master hdfs-master
juju deploy apache-hadoop-yarn-master yarn-master
juju deploy apache-hadoop-compute-slave compute-slave
juju deploy apache-hadoop-plugin plugin
juju deploy apache-flume-hdfs flume-hdfs
juju deploy rabbitmq-server rabbitmq

juju add-relation yarn-master hdfs-master
juju add-relation compute-slave yarn-master
juju add-relation compute-slave hdfs-master
juju add-relation plugin yarn-master
juju add-relation plugin hdfs-master
juju add-relation flume-hdfs plugin

Continue manual deployment by colocating the flume-rabbitmq charm on the rabbitmq unit:

RABBIT_MACHINE_ID=$(juju status rabbitmq --format tabular | grep "rabbitmq/" | awk '{ print $5 }')
juju deploy --to ${RABBIT_MACHINE_ID} apache-flume-rabbitmq flume-rabbitmq

Finally, complete manual deployment by relating the flume-rabbitmq charm to both flume-hdfs and rabbitmq:

juju add-relation flume-rabbitmq rabbitmq
juju add-relation flume-rabbitmq flume-hdfs


When flume-hdfs receives data, it is stored in a /user/flume/<event_dir> HDFS subdirectory (configured by the connected Flume charm). The <event_dir> subdirectory is set to flume-rabbitmq by default for this charm. You can quickly verify the data written to HDFS using the command line. SSH to the flume-hdfs unit, locate an event, and cat it:

juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/flume-rabbitmq               # <-- find a date
hdfs dfs -ls /user/flume/flume-rabbitmq/<yyyy-mm-dd>  # <-- find an event
hdfs dfs -cat /user/flume/flume-rabbitmq/<yyyy-mm-dd>/FlumeData.<id>

This process works well for data serialized in text format (the default). For data serialized in avro format, you'll need to copy the file locally and use the dfs -text command. For example, replace the dfs -cat command from above with the following to view files stored in avro format:

hdfs dfs -copyToLocal /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id> /home/ubuntu/myFile.txt
hdfs dfs -text file:///home/ubuntu/myFile.txt

Configure the environment

The default RabbitMQ queue and virtualhost where messages are published is unset. Set this to an existing RabbitMQ queue name as follows:

juju set flume-rabbitmq rabbitmq_queuename='<queue_name>' rabbitmq_vhost='<vhost_name>'

If you changed the access to the Management GUI on RabbitMQ, you can also specify your creds with

juju set flume-rabbitmq rabbitmq_username='<user_name>' rabbitmq_password='<user_password>'

Test the deployment

Generate Rabbit messages on the flume-rabbitmq unit with the producer script:

juju set flume-rabbitmq rabbitmq_queuename='logs'
juju ssh flume-rabbitmq/0
cd /var/lib/juju/agents/unit-rabbitmq-0/charm/scripts
while read line ; do ./t1/send_log.py info $line; done < /var/log/syslog

Note that if you did not collocate your flume agent on RabbitMQ, you'll need to update this script with the private IP address of the RabbitMQ server.

To verify these messages are being stored into HDFS, SSH to the flume-hdfs unit, locate an event, and cat it:

juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/flume-rabbitmq  # <-- find a date
hdfs dfs -ls /user/flume/flume-rabbitmq/yyyy-mm-dd  # <-- find an event
hdfs dfs -cat /user/flume/flume-rabbitmq/yyyy-mm-dd/FlumeData.[id]

Contact Information



(string) The maximum number of events stored in the channel.
(string) The maximum number of events the channel will take from a source or give to a sink per transaction.
(string) The HDFS subdirectory under /user/flume where events will be stored.
(string) RabbitMQ Exchange for the source (empty by default, could be passed by relation over time)
(string) RabbitMQ password to connect to the queue
(string) Queue to connect to on the RabbitMQ server
(string) RabbitMQ user for the source (could be passed by relation over time)
(string) RabbitMQ virtualhost to connect to the queue
(string) URL from which to fetch resources (e.g., Hadoop binaries) instead of Launchpad.