Console Application

Finding the console application

NCollector Studio can be operated to its full potential using the console-mode. This mode has been optimized to provide the user with maximum performance and flexibility. The console-mode is recommended only for experienced users.

It can be started either by clicking the “NCollector Studio Console”-shortcut located in the Start-menu, or by manually locating it using the command-line.

After clicking the icon you will be taken to the folder in which the program has been installed. From here you can run the NCollector Studio command-line tool using the options described in the following chapters.

The console application help command

Please note that project files (*.wrp) created in the NCollector Studio UI can also be loaded from the command-line and vice versa.

Command line options

An overview of the available command-line options can be seen at any time typing the command:

ncconsole.exe /?

Options applicable to all jobs:

Parameter Description Example

-mirror

Downloads all links found, including HTML-pages.

-mirror

-offlinebrowse

Downloads all links found, including HTML-pages. Translates links to local relative paths after download completes.

-offlinebrowse

-filerip

Downloads only links with a given extension.

-filerip

-url

The address to start ripping from. Can be used multiple times. Must include protocol prefix (http://, https://)

-url=http://NCollector Studio.org

-domainonly

Crawler will only scan and download links from the domains (including sub domains) specified as start addresses.

-domainonly

-subdomainonly

Crawler will only scan and download links from the domains (not including other sub domains) specified as start addresses.

-subdomainonly

-folderonly

Crawler will not download links higher than the folder specified in the start addresses.

-folderonly

-levels

The number of levels that crawler will scan for links. Levels can be thought of as how many links it would take to reach a page if browsed in the web-browser. Default is one level.

-levels=2

-maxpages

The number of pages to scan for links. Default is unlimited.

-maxpages=100

-userobots.txt

Tells NCollector Studio to follow the rules specified in the servers robots.txt file. This option might cause fewer links to be found, and eventually slow down ripping.

-userobotstxt

-name

The name of the project. When a name is specified the downloaded files will be put in a folder with this name. Caching function will also prevent files that have not been changed from being downloaded when running the same project multiple times.

-name=MyRipProject

-save

Will save a project-template to disk for reuse later on. All settings will be saved when this option is specified.

-save=MyRipProject.wrp

-load

Instructs NCollector Studio to load a previously saved project template. No other options can be used together with this option.

-load=MyRipProject.wrp

-noreport

Will instruct NCollector Studio to skip the generation of the post-session HTML-report.

-noreport

-keyword

Instructs NCollector Studio to include or exclude links with given keywords in the address. The keyword filter supports a number of options which should precede the keyword option. The options are:

include|exclude

complete|withoutfilename|onlyfilename

downloaded|spidered|all

See below for detailed descriptions.

-keyword=ferrari

/include

Links with the given keyword should be included. The default.

-keyword= ferrari /include

/exclude

Links with the given keyword should be excluded.

-keyword= ferrari /exclude

/complete

The complete URL should be checked when parsing for the given keyword. The default.

-keyword= ferrari /complete

/withoutfilename

The URL excluding the filename should be checked when parsing for the given keyword.

-keyword= ferrari /withoutfilename

/onlyfilename

Only the filename should be checked when parsing for the given keyword.

-keyword= ferrari /onlyfilename

/downloaded

Only downloaded files should be checked for keyword. The default.

-keyword= ferrari /downloaded

/spidered

Only spidered files should be checked for keyword.

-keyword= ferrari /spidered

/all

All files should be checked for keyword.

-keyword= ferrari /all

Options applicable in Filerip-mode only:

Parameter Description Example

-extension

File type to rip. Only applicable in filerip-mode. Can be used multiple times. At least one is mandatory.

-extension=.jpg

-minsize

The minimum size in kilobytes of the files downloaded in file rip mode.

-minsize=100

-maxsize

The maximum size in kilobytes of the files downloaded in file rip mode.

-maxsize=1000

-minwidth

The minimum width in pixels of images downloaded in file rip mode.

-minwidth=640

-minheight

The minimum height in pixels of images downloaded in file rip mode.

-minheight=480

-maxwidth

The maximum width in pixels of images downloaded in file rip mode.

-maxwidth=1024

-maxheight

The maximum height in pixels of images downloaded in file rip mode.

-maxheight=768

Examples

Rip JPEG-images with a given minimum dimension

Scenario: We want to rip all JPEG-images with a minimum width of 300 pixels and minimum height of 200 pixels. The spider should search for links 2 levels deep from the start-address.

Command-line:
ncconsole.exe –filerip –url=http://calluna-software.com –levels=2 –extension=.jpg –minwidth=300 –minheight=200

Rip all MPEG-movies with a given minimum and maximum file size

Scenario: We want to rip all MPEG-movies with a minimum size of 1mb, and maximum size of 10mb. The spider should scan maximum 100 pages for links.

Command-line:
ncconsole.exe –filerip –url=http://movies.com –maxpages=100 –extension=mpg –minsize=1000 –maxsize=10000

Download a site, two levels deep, for offline-browsing

Scenario: We want to be able to browse a site without an internet-connection. We want to download two levels, and restrict the spider to only follow links within the specified subdomain.

Command-line:
ncconsole.exe –offlinebrowse –url=http://galleries.photo.com –levels=2 –subdomainonly

Download a site, maximum 1000 pages or 9 levels, and save the project for reuse

Scenario: We want to make a copy of the web pages and files on the servers in a domain. Links should not be translated. The project should be saved so that we can easily start it again another time.

Command-line:
ncconsole.exe –mirror –url=http://www.myblogs.com –maxpages=1000 –levels=9 –domainonly –save=myblogs.wrp –name=myblogs

Start a job from an already saved project-template

Scenario: We want to run the job in the previous example again.

Command-line:
ncconsole.exe –load=myblogs.wrp