NCollector Studio - Documentation

Documentation for the NCollector Studio main application

Home screen

After starting NCollector Studio, you will face the home-screen. From here you can perform common tasks like starting new projects or managing previously run projects.

Screenshot main window

Library

To the left on the home screen you will see a list of previously run projects (loaded from the configured projects folder).

Managing libraries in NCollector Studio

When a project is selected you have a number of options available:

Wizard

When creating a new job, or loading a previously saved job, you can easily configure the project to you desires by clicking a few times in the user-friendly wizard. The wizard is started by clicking the appropriate option on the home screen.

Choose what kind of project you want to start. Let’s summarize a little from the introduction chapter:

Project Setup

On the initial page it’s recommended that you specify a name for your project. A default download-folder is suggested, and should be used unless you have good reasons not to. Finish off the first page by specifying a source-address and click next.

Wizard screenshot setting the initial project setup

Click the "Log In"-button next to the address if the given website requires you to log on to access it. A web browser will open, and you'll be able to fill in then authentication details in the same way as when you're logging into the site from your regular browser.

The "Add More"-button will allow you to add more urls for the project.

Link Restrictions

One of the most powerful mechanisms for filtering which files get downloaded is the options found at the “Link Restrictions”-page in the wizard. From here you can limit the crawler to only scan the website for a given depth or ignore all files outside the domain of the starting-URL(s).

Set the URL-restrictions first. The following choices are available:

Wizard screenshot setting keyword filter

"Domain only" will ignore files from other domains, but include files from sub-domains related to the start-addresses. If this option is chosen when e.g. www.example.com is used, the spider will ignore links from www.mysite.com but include links from forum.example.com.

"Sub-domain only" will ignore links from all other domains and sub-domains.

"Folder only" will allow only links from the specified folder or lower. All links pointing to higher-levels folders or other domains will be ignored. If http://example.com/images/index.htm is the start-address, this setting will cause the link http://example.com/forum to be ignored, but http://example.com/images/brazil/1.jpg to pass the test.

When the "unrestricted"-option is selected, all links will pass through. Be careful with this, and make sure some other restrictions are in place to prevent the whole internet from being crawled.

A great way to restrict your project in a natural manner is to limit it to a given depth. You can do this by changing the maximum number of levels. 0 (zero) implies unrestricted, 1 means only the initial page will be included, 2 means NCollector will include all pages that the initial page links to and so on.

Illustrating the way levels work in the application

Respect robots.txt

The crawler will respect rules set by the server regarding which links to follow or not. It's considered good behavior to enable this.

Use project Url-restrictions for embedded links

The default NCollector behavior is to download and eventually parse all types of content embedded into a webpage (e.g. javascript and css) regardless of what the "URL Restrictions" has been set to. Checking this will apply the same rules for embedded content as for the rest of the links. This will probably lead to fewer links being downloaded, but at the same time increase the risk of some wanted content being left out.

Maximum Pages

Indicates how many HTML-pages to scan for links before stopping. After the indicated number of pages has been reached, NCollector will continue downloading the links already in queue, but skip adding any more links.

Keywords

The result of a job can be tailored even more using the keyword-filters on links. By using the predefined keyword-shortcuts you can easily perform some common tasks like removing advertisements, preventing logout and removing thumbnails. Use the "Custom Keywords"-button to specify advanced keywords.

Wizard screenshot setting keyword filter

Custom Keywords

Targeted filtering on URLs can be achieved using custom keywords. Specify the keyword, whether it's inclusive or exclusive, on what kind of links to apply the rule on. In the example we're filtering out any links with the word "banner" in it.

Extensions (File-rip mode only)

The most important step when performing a file-rip is of course to specify what kind of files that you want to download. This is solved by simply putting a check-mark next to the wanted file-types.

Wizard screenshot setting file type filters

Custom Extensions

Extensions can also be typed in manually by clicking the "Custom Extensions"-button. When the predefined filerip-settings are just too limiting, we recommend you to try out the "Custom Extensions" which will allow you to specify any imaginable file-extension. Just type the extension in the textbox and click "Add" to continue. Repeat the process for as many links as you want.

Enable Quick-Rip

This mode can shorten the time it takes to complete a job significantly. This is achieved by letting the crawler take some shortcuts. The quick-rip option consist of several sub-options, which can be turned on or off individually. Also note that the default settings can be tuned in the preferences-dialog.

Downloaded Items Restrictions (FileRip mode only)

Items downloaded in FileRip mode can be restricted on file-size or image-dimensions (provided that at least one image-extension was selected in the previous step. Items will be discarded when the restrictions are not met.

Wizard screenshot setting downloaded item restrictions

In this example the downloaded items must be at least 15 kilobytes large, and the photos must be at least 400 pixels wide and 300 pixels tall. No upper limits have been set.

Start the project

After walking to the steps in the wizard, click the "Start"-button and the job will start. If the project was configured correctly, a lot of matching files will be downloaded.

Remember that you can always stop a job in progress, and reconfigure it on the fly.

Settings

NCollector Studio can be fine-tuned to your preferences in the settings screen made available by the settings button in the top right corner of the application.

Application screenshot of the options dialog

Folders

The default folders used by NCollector Studio.

Application screenshot with target download and project folder settings

NCollector Downloads Folder: The folder in which NCollector Studio puts the files downloaded during a job.

NCollector Projects Folder: The folder in which the projects are saved to and loaded from. The projects have *.wrp extension.

Connection

Proxy configuration and other connection specific settings.

Application screenshot with connection settings

Maximum Simultaneous Connections: The number of parallel downloads being active at the same time. Increase the number to increase the speed of NCollector.

Download Timeout Seconds: The number of seconds before terminating a download when the web-server is not responding.

User Agent: The web-browser to emulate when negotiating with the web-server.

A proxy-server can be configured using this option. Often your ISP (Internet service provider) or employer requires you to connect to the internet through a proxy-server.

Proxy Host: The address of the proxy server (either IP-address or name)

Proxy Server Port: The port of the proxy-server. Usually 8080 or 80.

Proxy User Name: A username is not always required. Check with your ISP.

Proxy Password: Use in conjunction with the user-name.

Other

Other settings.

Application screenshot with other specific settings

Enable log-module in user-interface: Uncheck this to prevent log-messages from being sent to the "Log-module" in the main-screen of NCollector. Some processing power can be saved by doing this. Logging to file will happen regardless of this setting.

Automatically report errors to Calluna Software: Unhandled errors (bugs) will be automatically reported to us when this is checked. We use the information to improve the quality and stability of NCollector Studio. Please note that only anonymous data are reported. We can in NO way identify your identity when this option is turned on.

From this section you can manage the default choices for the Quick Rip choice available for FileRip-jobs.

Extract only links matching extension(s) at last level: When checked only links with any of the chosen extensions (e.g. .JPG) will be added to the queue when the crawler is at the last level. This might increase performance, but can also cause some files not to be downloaded in certain cases.

Scan only HTML for links (ignore CSS and Javascript): Job will finish quicker since fewer files are examined for links by the crawler.

Treat last-level links as embedded if matching extension(s): This is a good choice if you're having trouble setting how many levels to scan for links. With this option selected direct links to chosen extensions e.g. <a href="image.jpg"> will be downloaded even if the project depth otherwise would ignore it.