apache flume hdfs #7
Description
Collect, aggregate, and move large amounts of data into HDFS.
- Tags:
- apache ›
- big_data ›
- hadoop ›
Overview
Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. It has a simple
and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online
analytic application. Learn more at flume.apache.org.
This charm provides a Flume agent designed to ingest events into the shared
filesystem (HDFS) of a connected Hadoop cluster. It is meant to relate to
other Flume agents such as apache-flume-syslog
and apache-flume-twitter
.
Deploying
This charm requires Juju 2.0 or greater. If Juju is not yet set up, please
follow the getting-started instructions prior to deploying this charm.
This charm is intended to be deployed via one of the apache bigtop bundles.
For example:
juju deploy hadoop-processing
This will deploy an Apache Bigtop Hadoop cluster. More information about this
deployment can be found in the bundle readme.
Now add Flume-HDFS and relate it to the cluster via the hadoop-plugin:
juju deploy apache-flume-hdfs flume-hdfs
juju add-relation flume-hdfs plugin
The deployment at this stage isn't very exciting, as the flume-hdfs
service
is waiting for other Flume agents to connect and send data. You'll probably
want to check out apache-flume-syslog or apache-flume-kafka
to provide additional functionality for this deployment.
When flume-hdfs
receives data, it is stored in a /user/flume/<event_dir>
HDFS subdirectory (configured by the connected Flume charm). You can quickly
verify the data written to HDFS using the command line. SSH to the flume-hdfs
unit, locate an event, and cat it:
juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/<event_dir> # <-- find a date
hdfs dfs -ls /user/flume/<event_dir>/<yyyy-mm-dd> # <-- find an event
hdfs dfs -cat /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id>
This process works well for data serialized in text
format (the default).
For data serialized in avro
format, you'll need to copy the file locally
and use the dfs -text
command. For example, replace the dfs -cat
command
from above with the following to view files stored in avro
format:
hdfs dfs -copyToLocal /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id> /home/ubuntu/myFile.txt
hdfs dfs -text file:///home/ubuntu/myFile.txt
Network-Restricted Environments
Charms can be deployed in environments with limited network access. To deploy
in this environment, configure a Juju model with appropriate proxy and/or
mirror options. See Configuring Models for more information.
Contact Information
Resources
Configuration
- channel_capacity
- (string) The maximum number of events stored in the channel.
- 1000
- channel_transaction_capacity
- (string) The maximum number of events the channel will take from a source or give to a sink per transaction.
- 100
- dfs_replication
- (int) The DFS replication value. The default (3) is the same default as the Namenode charm, but it may be overriden for this application.
- 3
- protocol
- (string) Ingestion protocol for the agent source. Currently only 'avro' is supported.
- avro
- resources_mirror
- (string) URL from which to fetch resources (e.g., Flume binaries) instead of S3
- roll_count
- (int) Number of events written to file before it is rolled. A value of 0 (the default) means never roll based on number of events.
- roll_interval
- (int) Number of seconds to wait before rolling the current file. Default will roll the file after 5 minutes. A value of 0 means never roll based on a time interval.
- 300
- roll_size
- (string) File size to trigger roll, in bytes. Default will roll the file once it reaches 10 MB. A value of 0 means never roll based on file size.
- 10000000
- sink_compression
- (string) Compression codec for the agent sink. An empty value will write events to HDFS uncompressed. You may specify 'snappy' here to compress written events using the snappy codec.
- sink_serializer
- (string) Specify the serializer used when the sink writes to HDFS. Either 'avro_event' or 'text' are supported.
- text
- source_port
- (int) Port on which the agent source is listening.
- 4141