Harvesting configuration

Adding an harvesting source

In order to add an harvesting source, you need to log into CKAN web interface as an administrator. You have to type the address for the harvesting page, since there is no direct link for it

http://YOUR_SITE/harvest
../_images/harvest1.png

Press the “Add Harvest Source” button and this page will open:

../_images/harvest2.png

In order to add properly a CSW source you’ll have to set:

  • The CSW URL endpoint
  • A title/name for this source, for your reference
  • The harvest type, i.e. “PBZ CSW Server” in “Source type”
  • The update frequency
  • The configuration in JSON format.
  • The owner Organization: all the dataset harvested from this source will be assigned to that Organization.

The JSON configuration allows these parameters:

  • cql: a CQL filter for harvesting only a subset of the records in the remote node
  • default_tags: all dataset harvested from this source will have these tags attached.
  • default_extras: all dataset harvested from this source will have these key/value pairs attached. The value may have tokens in the form {token}, where the allowed tokens are:
    • harvest_source_url
    • harvest_source_title
    • harvest_job_id
    • harvest_object_id
  • version: geonetwork version. Currently only 2.6, and 2.10 are accepted. This is needed to find out which services are available on the GeoNetowrk instance. For instance, in order to extract the categories associated to a metadata in version 2.6, the only way to find it is to get a MEF of the metadata and to parse the info.xml file packed into it.
  • private_datasets: If True sets as private the dataset. A dataset can be private if belongs to an organization Furthermore these other parameters will be read:
  • group_mapping: it’s a dict that associates category names to group names. If this parameter is defined, GN will be queried about the categories of the metadata, a matching group for each category will be searched in the provided mapping and will then be associated to the metadata if one or more are found
  • harvest_iso_categories: requires ‘group_mapping’ defined. If True allow to harvest metadata using ISO categories instead GeoNetwork internal categories.
  • ckan_locales_mapping: provides a mapping between CKAN locales and localed in CSW (used for the gmd:LocalisedCharacterString element).
  • default_license: with this property you can specify the default license to use for the CKAN’s dataset if none useLimitation has been found into the metadata. Below an example:

Note

cql filtering has been added with this commit.

Note

default_tags and default_extras will be available only if, when installing ckanext-spatial, these commits have been included manually:

https://github.com/ckan/ckanext-spatial/pull/58

Below the configuration used for the Provincia di Bolzano Harvester:

{
     "private_datasets": "False",
     "version": "2.6",
     "harvest_iso_categories": "True",
     "default_extras": {
          "geocat_layer_url": "http://geocatalogo.retecivica.bz.it/geokatalog/#!home&layer="
     },
     "organisation_mapping": [
          {"key": "pab-foreste" ,"value": "Ufficio 32.3 - Ufficio Pianificazione forestale"},
          {"key": "comune-bolzano-sistema-informativo-territoriale" ,"value": "Ufficio Sistema Informativo Territoriale, Comune di Bolzano"},
          {"key": "pab-agenzia-provinciale-per-l-ambiente" ,"value": "Ripartizione 29 - Agenzia provinciale per l'ambiente"},
          {"key": "pab-servizio-strade" ,"value": "Ripartizione 12 - Servizio Strade"},
          {"key": "pab-protezione-antincendi-e-civile" ,"value": "Ripartizione 26 - Protezione antincendi e civile"},
          {"key": "pab-natura-paesaggio-e-sviluppo-del-territorio" ,"value": "Ripartizione 28 - Natura, paesaggio e sviluppo del territorio"},
          {"key": "pab-opere-idrauliche" ,"value": "Ripartizione 30 - Opere idrauliche"},
          {"key": "pab-libro-fondiario-catasto-fondiario-e-urbano" ,"value": "Ripartizione 41 - Libro fondiario, catasto fondiario e urbano"},
          {"key": "pab-foreste" ,"value": "Ripartizione 32 - Foreste"},
          {"key": "comune-merano-urbanistica-ed-edilizia-privata" ,"value": "Ufficio urbanistica ed edilizia privata del comune di Merano"},
          {"key": "pab-agricoltura" ,"value": "Ripartizione 31 - agricoltura"},
          {"key": "pab-dipartimento-economia-innovazione-e-europa" ,"value": "Dipartimento Economia, Innovazione e Europa"},
          {"key": "pab-edilizia-e-servizio-tecnico" ,"value": "Ripartizione 11 - Edilizia e servizio tecnico"},
          {"key": "pab-beni-culturali" ,"value": "Ripartizione 13 - Beni culturali"},
          {"key": "pab-economia" ,"value": "Ripartizione 35 - Economia"},
          {"key": "eurac" ,"value": "Istituto di Telerilevamento Applicato, EURAC"},
          {"key": "pab-natura-paesaggio-e-sviluppo-del-territorio" ,"value": "Cartografia provinciale e coordinamento geodati - Ripartizione 28"},
          {"key": "provincia-autonoma-di-bolzano-alto-adige-astat" ,"value": "Istituto Provinciale di Statistica ASTAT"},
          {"key": "pab-agenzia-provinciale-per-l-ambiente" ,"value": "29. Agenzia provinciale per l'ambiente"}
     ],
     "group_mapping": {
          "farming": "farming",
          "utilitiesCommunication": "boundaries",
          "transportation": "boundaries",
          "inlandWaters": "environment",
          "geoscientificInformation": "geoscientificinformation",
          "environment": "environment",
          "climatologyMeteorologyAtmosphere": "climatologymeteorologyatmosphere",
          "planningCadastre": "boundaries",
          "imageryBaseMapsEarthCover": "boundaries",
          "elevation": "boundaries",
          "boundaries": "boundaries",
          "structure": "boundaries",
          "location": "boundaries",
          "economy": "economy",
          "society": "economy",
          "biota": "environment",
          "intelligenceMilitary": "boundaries",
          "oceans": "environment",
          "health": "health"
     },
     "ckan_locales_mapping":{
          "ita": "it",
          "ger": "de"
     },
     "default_license": "cc-zero"
}

Running the harvesters

In order to have an harvester to fetch the remote datasets, it should be run.

At each run

  • new datasets will be locally added,
  • remotely updated ones will be updated locally as well, and
  • old datasets that cannot be found anymore on the remote site will be locally deleted.

You can configure the update frequency in this dropdown menu in the harvesting configuration page.

../_images/ckan_harvest_scheduling.png

A background script (installed at paragraph Script harvesting) that runs every 15 minutes will look for any new reharvesting request and run the related harvester if needed.

The choice always will try and run the selected harvester every time the 15 minutes script runs. If the previous harvesting run is not finished yet, a new one will not be launched.

The harvesters with frequancy choices Daily, Weekly, … will be examined by the script run every 15 minutes. If the script finds that the last run was over a day ago (or one week, two weeks, etc, according to the choice), the harvester will be run again.

Harvester with the manual choice will never be run automatically: the harvester run can be triggered by pressing the “Reharvest” button in the harvester administration page: When the “reharvest” button is pressed, the harvesting will run within the next 15 minutes.

Frequency customizations

You may have noted that the harvesting script that has been configured during the CKAN installation have been slightly customized.

We added the line containing the harvester job-all, an additional command that will force harvesting from all the configured sources. This means that all harvesting sources will be handled as if they have the “Always” frequency set.

If you need some more fine-grained frequency, you should set up your own script, and tell cron to run it whenever you need.

Tip

For next instructions you may need some knowledge of the CKAN command line interface

E.g. Let’s say you want to run the Provincia di Bolzano harvester every days.

  1. Find the code of the harvest source you want to run at your own frequency

    Log into the hub machine as user ckan and issue the commands:

    . /usr/lib/ckan/default/bin/activate
    paster --plugin=ckanext-harvest harvester sources --config=/etc/ckan/default/production.ini
    

    Last command will run the sources command on the harvester plugin. You will get the list of the configured sources:

    [...]
    
    Source id: 874fefea-2378-44a3-8c63-3b1503e3d747
          url: http://geoservices.retecivica.bz.it/geonetwork/srv/ita/csw
         type: PBZ CSW Server
       active: True
    frequency: MANUAL
         jobs: 31
    
    There are 1 active harvest sources
    

    We see that the code for the Provincia di Bolzano harvester is 874fefea-2378-44a3-8c63-3b1503e3d747.

  2. Create a script for running your harvester

    Create a script /usr/lib/ckan/run_pbz_harvester.sh (in the same file system location of the previous one) with this content:

    . /usr/lib/ckan/default/bin/activate
    paster --plugin=ckanext-harvest harvester job 874fefea-2378-44a3-8c63-3b1503e3d747 --config=/etc/ckan/default/production.ini
    

    and make it executable

    chmod +x /usr/lib/ckan/run_pbz_harvester.sh
    
  3. Tell cron to run your script

    Now we have to tell cron to run the script every 6 hours. Open the crontab in editing mode with crontab -e and add the line

    0 8 * * * /usr/lib/ckan/run_pbz_harvester.sh
    
  4. Tell CKAN not to do anything else

    Set the Provincia di Bolzano harvester update frequency to “manual”, so that CKAN will not trigger the harvesting on its own. Now the harvesting will be run by the cronjob every day. Should you need an “extra” run of the harvester you may press the “reharvest” button whenever you like, and it will run within the next 15 minutes as usual.