This is the import.io's application running as a command-line crawler. Once provided with a relation and configuration it will
crawl a site based on the configuration provided into the relation provided.
For more information on the command-line crawler see our support page here:- http://support.import.io/knowledgebase/articles/325728
- applications ›
This charm sets up a machine to run the import.io application as a command line crawler. Use this charm to crawl your target sites and push the data directly into your target application.
The target application needs to be something that can take json documents posted over http. Currently the only application support is the elasticsearch application that just works(tm).
For more details of the import.io command-line crawler functionality please read:-
For more details of the import.io command-line crawler settings please read:-
Deploy the charm by doing this:
juju deploy importio
Currently you need elasticsearch also running
juju deploy elasticsearch juju add-relation importio elasticsearch
Known Limitations and Issues
Currently the only target we stream json documents into is elasticsearch, in theory other data stores would work as well.
The configuration does not ship with defaults for most settings. Easiest way is to:-
juju set importio --config /path/to/config.yaml
with a yaml file like so:-
connectorGuid: startUrls: maxDepth: crawlTemplate: dataTemplate: connections: pause: apiKey: userGuid:
If you have any problems with this charm, ideas or improvements please contact us at:- firstname.lastname@example.org or http://support.import.io/
- better support for multi-result crawls
- better support for re-crawling, using paths for _id mapping.
- (string) This is your api key, you can create one of these after logging into this page:- http://import.io/data/account/
- (int) The number of pages the crawler will attempt to visit at the same time. The higher you set this number, the faster you will get data. WARNING: We do not recommend to using any value higher than 5 if you are not crawling your own domain, as you may be blocked by the owner of the site.
- (string) Get the connector guid from a Crawler you have already setup from the 'my data' page:- http://import.io/data/mine/
- (string) This is a name you nominate for the crawl and is used as the elasticsearch index name
- (string) Sets the parameters of the URL pattern of the sites you want to crawl. For example, if you were only interested in crawling the beauty section at boots, you would set the where to crawl as: www.boots.com/beauty. This is helpful, because the fewer unnecessary places your crawler has to travel looking for data, the more efficient it will be at returning it.
- (string) This is a name you nominate for the data and is used as the elasticsearch type name
- (string) This is the URL pattern of your example pages. The crawler will try to extract data from any page that matches that pattern. For more details on the syntax of this template see this page:- http://support.import.io/knowledgebase/articles/247574-advanced-crawler-options
- (int) This is the maximum number of clicks from the start URL the crawler will travel to find data. By default it is set to 10 (the maximum allowed) to enable you to get all the data. However, the fewer clicks the crawler needs to travel, the quicker your data will be returned so if possible, it is a good idea to set this to a lower number.
- (int) Indicates how long the crawler will wait (in seconds) before moving from one page to the next. The smaller you set this number, the faster data will be returned. WARNING: We do not recommend setting it to zero, as you may be blocked by the owner of the site.
- (string) By default, the crawler will start from the pages you gave as examples. However, it is sometimes more efficient to start from somewhere more central to the site (like the homepage).
- (string) This is your import.io user id, you can find this after logging into this page:- http://import.io/data/account/