NCollector Studio

Home-screen

After starting NCollector Studio, you will face the home-screen. From here you can perform common tasks like starting new jobs, loading previously saved projects and navigating between the modules. The three shortcuts in the main-area lets you start a new job in just a click.

Screenshot main window

Toolbar

On the upper left hand of the main-screen you find the toolbar.

The toolbar in NCollector Studio

From the toolbar the following tasks (from left to right) can be performed:

Modules

Use the module-selector on the upper right hand to change between the main-modules.

The modules available in NCollector Studio

The "General"-module will show information about the current-job as well as a list of the active downloads.

"Files"-module will present a list of all files downloaded during a “File Rip” project. You may then open the webpage in which the files were found, view downloaded images in the previewer, or open them in the associated program on your computer. You can also delete unwanted files on the fly. NB! The files-module is only applicable for “File Rip” projects.

Important information recorded during a job is available in the "Log"-module. The list will update continuously as the job progresses. The log-file can also be opened using the "Tools, Show Log File"-entry in the main-menu. Logging can be turned completely off in the preferences-menu to reduce memory and CPU-consumption a bit.

Wizard

When creating a new job, or loading a previously saved job, you can easily configure the project to you desires by clicking a few times in the user-friendly wizard. The wizard can either be started using the new-menu or one of the shortcut-icons at the home-screen.

Choose what kind of project you want to start. Let’s summarize a little from the introduction-chapter:

Project Setup

On the initial page it’s recommended that you specify a name for your project. A default download-folder is suggested, and should be used unless you have good reasons not to. Finish off the first page by specifying a source-address and click next.

Wizard screenshot setting the initial project setup

Click the "Log In"-button next to the address if the given website requires you to log on to access it. A web browser will open, and you'll be able to fill in then authentication details in the same way as when you're logging into the site from your regular browser.

The "Add More"-button will show the advanced options for this page.

Link Restrictions

One of the most powerful mechanisms for filtering which files get downloaded is the options found at the “Link Restrictions”-page in the wizard. From here you can limit the crawler to only scan the website for a given depth or ignore all files outside the domain of the starting-URL(s).

Set the URL-restrictions first. The following choices are available:

Wizard screenshot setting keyword filter

"Domain only" will ignore files from other domains, but include files from sub-domains related to the start-addresses. If this option is chosen when e.g. www.example.com is used, the spider will ignore links from www.mysite.com but include links from forum.example.com.

"Sub-domain only" will ignore links from all other domains and sub-domains.

"Folder only" will allow only links from the specified folder or lower. All links pointing to higher-levels folders or other domains will be ignored. If http://example.com/images/index.htm is the start-address, this setting will cause the link http://example.com/forum to be ignored, but http://example.com/images/brazil/1.jpg to pass the test.

When the "unrestricted"-option is selected, all links will pass through. Be careful with this, and make sure some other restrictions are in place to prevent the whole internet from being crawled.

A great way to restrict your project in a natural manner is to limit it to a given depth. You can do this by changing the maximum number of levels. 0 (zero) implies unrestricted, 1 means only the initial page will be included, 2 means NCollector will include all pages that the initial page links to and so on.

Illustrating the way levels work in the application

Keywords

The result of a job can be tailored even more using the keyword-filters on links. By using the predefined keyword-shortcuts you can easily perform some common tasks like removing advertisements, preventing logout and removing thumbnails. Use the "Custom Keywords"-button to specify advanced keywords.

Wizard screenshot setting keyword filter

Extensions (File-rip mode only)

The most important step when performing a file-rip is of course to specify what kind of files that you want to download. This is solved by simply putting a check-mark next to the wanted file-types.

Wizard screenshot setting file type filters

Extensions can also be typed in manually by clicking the "Custom Extensions"-button. The "Advanced settings"-button will be described later in this manual.

Downloaded Items Restrictions (FileRip mode only)

Items downloaded in FileRip mode can be restricted on file-size or image-dimensions (provided that at least one image-extension was selected in the previous step. Items will be discarded when the restrictions are not met.

Wizard screenshot setting downloaded item restrictions

In this example the downloaded items must be at least 15 kilobytes large, and the photos must be at least 400 pixels wide and 300 pixels tall. No upper limits have been set.

Start the project

After walking to the steps in the wizard, click the “Start”-button and the job will start. If the project was configured correctly, a lot of matching files will be downloaded.

Remember that you can always stop a job in progress, and reconfigure it on the fly.

Wizard - Advanced options

Each of the pages in the wizard have some additional configuration-possibilities which are recommended for experienced users only.

Project Setup

Wizard screenshot setting project initial settings

Add More

Add multiple start-addresses for the crawler to follow. Type the address in the "source address"-textfield and click the "add"-button to add it to the list. The "remove"-button removes a selected address from the list.

 

Link Restrictions

Wizard screenshot setting link restrictions

Respect robots.txt

The crawler will respect rules set by the server regarding which links to follow or not. It's considered good behavior to enable this.

Use project Url-restrictions for embedded links

The default NCollector behavior is to download and eventually parse all types of content embedded into a webpage (e.g. javascript and css) regardless of what the "URL Restrictions" has been set to. Checking this will apply the same rules for embedded content as for the rest of the links. This will probably lead to fewer links being downloaded, but at the same time increase the risk of some wanted content being left out.

Maximum Pages

Indicates how many HTML-pages to scan for links before stopping. After the indicated number of pages has been reached, NCollector will continue downloading the links already in queue, but skip adding any more links.

Wizard screenshot setting keyword filter

Link Keyword Filters

Custom Keywords

Targeted filtering on URLs can be achieved using custom keywords. Specify the keyword, whether it's inclusive or exclusive, on what kind of links to apply the rule on. In the example we're filtering out any links with the word "banner" in it.

FileRip Settings

Custom Extensions

When the predefined filerip-settings are just too limiting, we recommend you to try out the "Custom Extensions" which will allow you to specify any imaginable file-extension. Just type the extension in the textbox and click "Add" to continue. Repeat the process for as many links as you want.

Wizard screenshot setting custom extensions

Enable Quick-Rip

This mode can shorten the time it takes to complete a job significantly. This is achieved by letting the crawler take some shortcuts. The quick-rip option consist of several sub-options, which can be turned on or off individually. Also note that the default settings can be tuned in the preferences-dialog.

Report

After a job has been finished or terminated, an HTML-report will be spawned in your web browser. The report will show vital information and statistics about the job, and allow you to dive further into the results.

Screenshot of the html report

In offline browse-mode the report will serve as an entry-point to the first offline-browsable page(s).

After a file rip-job, the report will contain a list of the files downloaded. In case the job contained any image-extensions, the gallery function will be available, allowing you to easily view the images in the form of a slideshow.

Options-dialog

Application screenshot of the options dialog

NCollector Studio can be fine-tuned to your preferences in the options-dialog available from the main-menu.

Connection

Maximum Simultaneous Connections: The number of parallel downloads being active at the same time. Increase the number to increase the speed of NCollector.

Download Timeout Seconds: The number of seconds before terminating a download when the web-server is not responding.

User Agent: The web-browser to emulate when negotiating with the web-server.

Folders

Application screenshot with target download and project folder settings

NCollector Downloads Folder: The folder in which NCollector Studio puts the files downloaded during a job.

NCollector Projects Folder: The folder in which the projects are saved to and loaded from. The projects have *.wrp extension.

Other

Application screenshot with other specific settings

Enable log-module in user-interface: Uncheck this to prevent log-messages from being sent to the "Log-module" in the main-screen of NCollector. Some processing power can be saved by doing this. Logging to file will happen regardless of this setting.

Automatically report errors to Calluna Software: Unhandled errors (bugs) will be automatically reported to us when this is checked. We use the information to improve the quality and stability of NCollector Studio. Please note that only anonymous data are reported. We can in NO way identify your identity when this option is turned on.

Proxy

A proxy-server can be configured using this option. Often your ISP (Internet service provider) or employer requires you to connect to the internet through a proxy-server.

Application screenshot with proxy specific settings

Proxy Host: The address of the proxy server (either IP-address or name)

Proxy Server Port: The port of the proxy-server. Usually 8080 or 80.

Proxy User Name: A username is not always required. Check with your ISP.

Proxy Password: Use in conjunction with the user-name.

Quick Rip

Application screenshot with quick rip specific settings

From this section you can manage the default choices for the Quick Rip choice available for FileRip-jobs.

Extract only links matching extension(s) at last level: When checked only links with any of the chosen extensions (e.g. .JPG) will be added to the queue when the crawler is at the last level. This might increase performance, but can also cause some files not to be downloaded in certain cases.

Scan only HTML for links (ignore CSS and Javascript): Job will finish quicker since fewer files are examined for links by the crawler.

Treat last-level links as embedded if matching extension(s): This is a good choice if you're having trouble setting how many levels to scan for links. With this option selected direct links to chosen extensions e.g. <a href="image.jpg"> will be downloaded even if the project depth otherwise would ignore it.

Report

Application screenshot with report specific settings

Insert NCollector Signature in Translated Pages: This option is applicable only in Offline Browse Mode, and will result in a clickable link "Ripped by NCollector Studio - Calluna Software" to be inserted as a header in all translated pages.

Show HTML Report When Job Finishes: A HTML-report with summary about the job will be showed when the job is finished (or gets terminated).