apache flume twitter #22

Supports: trusty
Add to new model


Uses a Twitter source, memory channel, and Avro sink in Apache Flume
to ingest Twitter data.


Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. It has a simple
and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online
analytic application. Learn more at flume.apache.org.

This charm provides a Flume agent designed to process tweets from the Twitter
Streaming API and send them to the apache-flume-hdfs agent for storage in
the shared filesystem (HDFS) of a connected Hadoop cluster. This leverages the
TwitterSource jar packaged with Flume. Learn more about the
1% firehose.


The Twitter Streaming API requires developer credentials. You'll need to
configure those for this charm. Find your credentials (or create an account
if needed) here.

Create a secret.yaml file with your Twitter developer credentials:

    twitter_access_token: 'YOUR_TOKEN'
    twitter_access_token_secret: 'YOUR_TOKEN_SECRET'
    twitter_consumer_key: 'YOUR_CONSUMER_KEY'
    twitter_consumer_secret: 'YOUR_CONSUMER_SECRET'


This charm leverages our pluggable Hadoop model with the hadoop-plugin
interface. This means that you will need to deploy a base Apache Hadoop cluster
to run Flume. The suggested deployment method is to use the
bundle. This will deploy the Apache Hadoop platform with a single Apache Flume
unit that communicates with the cluster by relating to the
apache-hadoop-plugin subordinate charm:

juju quickstart u/bigdata-dev/apache-ingestion-flume

Alternatively, you may manually deploy the recommended environment as follows:

juju deploy apache-hadoop-hdfs-master hdfs-master
juju deploy apache-hadoop-yarn-master yarn-master
juju deploy apache-hadoop-compute-slave compute-slave
juju deploy apache-hadoop-plugin plugin
juju deploy apache-flume-hdfs flume-hdfs

juju add-relation yarn-master hdfs-master
juju add-relation compute-slave yarn-master
juju add-relation compute-slave hdfs-master
juju add-relation plugin yarn-master
juju add-relation plugin hdfs-master
juju add-relation flume-hdfs plugin

Now that the base environment has been deployed (either via quickstart or
manually), you are ready to add the apache-flume-twitter charm and
relate it to the flume-hdfs agent:

juju deploy apache-flume-twitter flume-twitter --config=secret.yaml
juju add-relation flume-twitter flume-hdfs

That's it! Once the Flume agents start, tweets will start flowing into
HDFS via the flume-twitter and flume-hdfs charms. Flume may include
multiple events in each file written to HDFS. This is configurable with various
options on the flume-hdfs charm. See descriptions of the roll_* options on
the apache-flume-hdfs charm store
page for more details.

Flume will write files to HDFS in the following location:
/user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id>. The <event_dir>
subdirectory is configurable and set to flume-twitter by default for this

Test the deployment

To verify this charm is working as intended, SSH to the flume-hdfs unit and
locate an event:

juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/<event_dir>               # <-- find a date
hdfs dfs -ls /user/flume/<event_dir>/<yyyy-mm-dd>  # <-- find an event

Since our tweets are serialized in avro format, you'll need to copy the file
locally and use the dfs -text command to view it:

hdfs dfs -copyToLocal /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id>.avro /home/ubuntu/myFile.txt
hdfs dfs -text file:///home/ubuntu/myFile.txt

You may not recognize the body of the tweet if it's not in a language you
understand (remember, this is a 1% firehose from tweets all over the world).
You may have to try a few different events before you find a tweet worth
reading. Happy hunting!

Contact Information



(string) The maximum number of events stored in the channel.
(string) The maximum number of events the channel will take from a source or give to a sink per transaction.
(string) The HDFS subdirectory under /user/flume where events will be stored.
(string) URL from which to fetch resources (e.g., Hadoop binaries) instead of Launchpad.
(string) OAuth Access token from your Twitter developer account
(string) OAuth Access token secret from your Twitter developer account
(string) OAuth Consumer key from your Twitter developer account
(string) OAth Consumer secret from your Twitter developer account
(int) Maximum number of milliseconds to wait before closing a batch
(int) Maximum number of twitter messages to put in a single batch
(string) The application to use for this Flume source. Deafult to TwitterSource bundled with Flume.