apache flume rabbitmq #7
Description
Uses a RabbitMQ source, memory channel, and Avro sink in Apache Flume to ingest messages published to a RabbitMQ topic.
- Tags:
- applications ›
- bigdata ›
- apache ›
Overview
Flume is a distributed, reliable, and highly-available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability, failover, and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Learn more at flume.apache.org.
This charm provides a Flume agent designed to ingest messages published to
a RabbitMQ queue and send them to the apache-flume-hdfs
agent for storage in
the shared filesystem (HDFS) of a connected Hadoop cluster. This utilizes a
RabbitMQ-Flume Plugin.
Deployment
This charm leverages our pluggable Hadoop model with the hadoop-plugin
interface. A base Apache Hadoop cluster is required. The suggested deployment
method is to use the
apache-ingestion-flume-rabbitmq
bundle.
Bundle Deployment
This will deploy the Apache Hadoop platform with a pair of Apache Flume agents that facilitate communication between RabbitMQ and HDFS:
juju quickstart u/bigdata-dev/apache-ingestion-flume-rabbitmq
Manual Deployment
You may manually deploy the recommended environment as follows:
juju deploy apache-hadoop-hdfs-master hdfs-master
juju deploy apache-hadoop-yarn-master yarn-master
juju deploy apache-hadoop-compute-slave compute-slave
juju deploy apache-hadoop-plugin plugin
juju deploy apache-flume-hdfs flume-hdfs
juju deploy rabbitmq-server rabbitmq
juju add-relation yarn-master hdfs-master
juju add-relation compute-slave yarn-master
juju add-relation compute-slave hdfs-master
juju add-relation plugin yarn-master
juju add-relation plugin hdfs-master
juju add-relation flume-hdfs plugin
Continue manual deployment by colocating the flume-rabbitmq
charm on the
rabbitmq
unit:
RABBIT_MACHINE_ID=$(juju status rabbitmq --format tabular | grep "rabbitmq/" | awk '{ print $5 }')
juju deploy --to ${RABBIT_MACHINE_ID} apache-flume-rabbitmq flume-rabbitmq
Finally, complete manual deployment by relating the flume-rabbitmq
charm to
both flume-hdfs
and rabbitmq
:
juju add-relation flume-rabbitmq rabbitmq
juju add-relation flume-rabbitmq flume-hdfs
Usage
When flume-hdfs
receives data, it is stored in a /user/flume/<event_dir>
HDFS subdirectory (configured by the connected Flume charm). The <event_dir>
subdirectory is set to flume-rabbitmq
by default for this charm. You can
quickly verify the data written to HDFS using the command line. SSH to the
flume-hdfs
unit, locate an event, and cat it:
juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/flume-rabbitmq # <-- find a date
hdfs dfs -ls /user/flume/flume-rabbitmq/<yyyy-mm-dd> # <-- find an event
hdfs dfs -cat /user/flume/flume-rabbitmq/<yyyy-mm-dd>/FlumeData.<id>
This process works well for data serialized in text
format (the default).
For data serialized in avro
format, you'll need to copy the file locally
and use the dfs -text
command. For example, replace the dfs -cat
command
from above with the following to view files stored in avro
format:
hdfs dfs -copyToLocal /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id> /home/ubuntu/myFile.txt
hdfs dfs -text file:///home/ubuntu/myFile.txt
Configure the environment
The default RabbitMQ queue and virtualhost where messages are published is unset. Set this to an existing RabbitMQ queue name as follows:
juju set flume-rabbitmq rabbitmq_queuename='<queue_name>' rabbitmq_vhost='<vhost_name>'
If you changed the access to the Management GUI on RabbitMQ, you can also specify your creds with
juju set flume-rabbitmq rabbitmq_username='<user_name>' rabbitmq_password='<user_password>'
Test the deployment
Generate Rabbit messages on the flume-rabbitmq
unit with the producer script:
juju set flume-rabbitmq rabbitmq_queuename='logs'
juju ssh flume-rabbitmq/0
cd /var/lib/juju/agents/unit-rabbitmq-0/charm/scripts
while read line ; do ./t1/send_log.py info $line; done < /var/log/syslog
Note that if you did not collocate your flume agent on RabbitMQ, you'll need to update this script with the private IP address of the RabbitMQ server.
To verify these messages are being stored into HDFS, SSH to the flume-hdfs
unit, locate an event, and cat it:
juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/flume-rabbitmq # <-- find a date
hdfs dfs -ls /user/flume/flume-rabbitmq/yyyy-mm-dd # <-- find an event
hdfs dfs -cat /user/flume/flume-rabbitmq/yyyy-mm-dd/FlumeData.[id]
Contact Information
Help
- Apache Flume home page
- Apache Flume bug tracker
- Apache Flume mailing lists
#juju
onirc.freenode.net
Configuration
- channel_capacity
- (string) The maximum number of events stored in the channel.
- 1000
- channel_transaction_capacity
- (string) The maximum number of events the channel will take from a source or give to a sink per transaction.
- 100
- event_dir
- (string) The HDFS subdirectory under /user/flume where events will be stored.
- flume-rabbitmq
- rabbitmq_exchangename
- (string) RabbitMQ Exchange for the source (empty by default, could be passed by relation over time)
- rabbitmq_password
- (string) RabbitMQ password to connect to the queue
- guest
- rabbitmq_queuename
- (string) Queue to connect to on the RabbitMQ server
- rabbitmq
- rabbitmq_username
- (string) RabbitMQ user for the source (could be passed by relation over time)
- guest
- rabbitmq_virtualhost
- (string) RabbitMQ virtualhost to connect to the queue
- resources_mirror
- (string) URL from which to fetch resources (e.g., Hadoop binaries) instead of Launchpad.