sparkler #10
Description
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.
- Tags:
- big_data ›
Overview
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.
Usage
Sparkler has dependencies on Java and Solr, also optionally, Spark so to deploy we do:
juju deploy openjdk java
juju deploy cs:~spiculecharms/apache-solr solr
juju deploy cs:~spiculecharms/sparkler
juju add-relation solr sparkler
juju add-relation java sparkler
juju add-relation solr java
Scale out Usage
Currently we don't support scaleout.
Known Limitations and Issues
Bad documentation.....
Configuration
Contact Information
Contact the developers here:
Upstream Project Name
Configuration
- crawldb-uri
- (string) Override the auto detected crawldb uri
- fetcher-server-delay
- (string) Delay (in milliseconds) between two fetch requests for the same host.
- 1000
- generate-top-groups
- (string) Generates the Top Groups
- 256
- generate-topn
- (string) Generates the top N URLs for fetching.
- 1000
- kafka-enable
- (boolean) Enable Kafka dump
- kafka-listeners
- (string) Override the Kafka listeners
- kafka-topic
- (string) The Kafka topic
- sparkler_%s
- plugins-bundle-directory
- (string) Plugins Bundle directory. Configured through Maven.
- ${project.parent.basedir}${file.separator}${project.bundles.directory}
- spark-master
- (string) Override the auto detected spark uri