Project

General

Profile

Lib-beam » History » Revision 2

Revision 1 (Herve Caumont, 2013-06-18 14:49) → Revision 2/4 (Herve Caumont, 2013-06-19 18:05)

h1. BEAM Arithm tutorial 

 

 {{>toc}} 

 

 h2. Introduction 

 

 BEAM is an open-source toolbox and development platform for viewing, analysing and processing of remote sensing raster data. Originally developed to facilitate the utilisation of image data from Envisat's optical instruments, BEAM now supports a growing number of other raster data formats such as GeoTIFF and NetCDF as well as data formats of other EO sensors such as MODIS, AVHRR, AVNIR, PRISM and CHRIS/Proba. Various data and algorithms are supported by dedicated extension plug-ins. 

 

 BEAM Graph Processing Tool (gpt) is a tool used to execute BEAM raster data operators in batch-mode. The operators can be used stand-alone or combined as a directed acyclic graph (DAG). Processing graphs are represented using XML. 

 

 Our tutorial uses the BandMaths operator and the Level 3 Binning Processor applied to Envisat MERIS Level 1 Reduced Resolution products to create an application to represent algal blooms. 

 

 > Definition (source Wikipedia): An algal bloom is a rapid increase or accumulation in the population of algae (typically microscopic) in an aquatic system. Algal blooms may occur in freshwater as well as marine environments. Typically, only one or a small number of phytoplankton species are involved, and some blooms may be recognized by discoloration of the water resulting from the high density of pigmented cells. Although there is no officially recognized threshold level, algae can be considered to be blooming at concentrations of hundreds to thousands of cells per milliliter, depending on the severity. Algal bloom concentrations may reach millions of cells per milliliter. Algal blooms are often green, but they can also be other colors such as yellow-brown or red, depending on the species of algae. 

 

 h2. The application 

 

 As introduced above, our applications uses the *BandMaths* operator and the *Level 3 Binning* processor. 

 

 h3. The BandMaths Operator 

 

 The *BandMaths* operator can be used to create a product with multiple bands based on mathematical expression. All products specified as source must have the same width and height, otherwise the operator will fail. The geo-coding information and metadata for the target product is taken from the first source product.   

   

 In our application we will apply the mathematical expression below to all input MERIS Level 1 Reduced Resolution products to detect the algal blooms: 

 

 <pre> 
 
 l1_flags.INVALID?0:radiance_13>15?0:100+radiance_9-(radiance_8+(radiance_10-radiance_8)*27.524/72.570) 
 
 </pre> 

 

 h3. The Level 3 Binning Processor 

 

 The term binning refers to the process of distributing the contributions of Level 2 pixels in satellite coordinates to a fixed Level 3 grid using a geographic reference system. In most cases a sinusoidal projection is used to realize Level 3 grid comprising a fixed number of equal area bins with global coverage. This is for example true for the SeaWiFS Level 3 products. 

 

 As long as the area of an input pixel is small compared to the area of a bin, a simple binning is sufficient. In this case, the geodetic center coordinate of the Level 2 pixel is used to find the bin in the Level 3 grid whose area is intersected by this point. If the area of the contributing pixel is equal or even larger than the bin area, this simple binning will produce composites with insufficient accuracy and visual artefacts such as Moiré effects will dominate the resulting datasets. 

 

 h3. The application workflow 

 

 Our application can be described as an activity diagram where the BandMaths operator is applied to all input MERIS Level 1 products whose outputs are used as inputs to the Level 3 Binning processor. Since the BandMaths operator is an independent chore, each MERIS Level 1 can be processed in parallel. The Level 3 binning processor instead needs all the outputs to increment the values of the bins and generate the level 3 product. 

 

 h2. BeamArithm implementation 
 
 
 
 h3. Tutorial approach 
 
 
 
 The goal of this tutorial is to get you acquainted to CIOP as an environment to implement scientific applications.  
  
 The driver is to analyze analyse the implemented application rather than have you install software, edit files, copy data etc. All these steps have been already done! 
 
 
 
 h3. Tutorial requirements 
 
 
 
 You need access to a running Sandbox. sandbox. The procedure to start a Sandbox sandbox is outside the scope of this tutorial. 
 
 
 
 h3. Tutorial files and artifacts installation on the Sandbox 
 
 sandbox 
 
 Log on your Sandbox. sandbox. List the available tutorials with: 
 
 
 
 <pre> 
 
 [user@sb ~]$ ciop-tutorial list  
  
 </pre> 
 
 
 
 This will list the available tutorials: 
 
 
 
 <pre> 
 
 ... 
 
 beam-arithm 
 
 ... 
 
 </pre> 

 

 Get the tutorial description with: 

 

 <pre> 
 
 [user@sb ~]$ ciop-tutorial info    beam-arithm 
 
 </pre> 
 
 
 
 This displays the tutorial information: 
 
 
 
 <pre> 
 
 TBW 
 
 </pre> 
 
 
 
 Install the tutorial: 
 
 
 
 <pre> 
 
 [user@sb ~]$ ciop-tutorial install beam-arithm 
 
 </pre> 
 
 
 
 This will take a few minutes. Once the installation is concluded you get will the BeamArithm application ready to run. 
 
 
 
 > Tip: check the [[ciop-tutorial]] Command Line (CLI) CLI reference (UPCOMING) 

  

 h3. Execute the BeamArithm processing steps one by one 
 
 
 
 CIOP allows you to process independently the nodes of the workflow. 
 
 
 
 > It may sound obvious but to run the second node of the workflow, the first node has to have run successfully at least once 
 
 
 
 List the nodes of the workflow 
 
 
 
 <pre> 
 
 [user@sb ~]$ ciop-simjob -n 
 
 </pre> 
 
 
 
 This will output: 
 
 
 
 <pre> 
 
 node_expression 
 
 node_binning 
 
 </pre> 
 
 
 
 where *node_expression* is the _BandMaths operator_ and the *node_binning* is the _Level 3 Binning Processor_. 

 

 > Tip: check the [[ciop-simjob]] CLI reference  
 
  
 
 Execute the *node_expression* workflow node: 
 
 
 
 <pre> 
 
 [user@sb ~]$ ciop-simjob node_expression 
 
 </pre> 
 
 
 
 The CIOP framework will take the MERIS Level 1 products and execute the BEAM BandMaths operator taking advantage of the Hadoop Map/Reduce cluster. 
 
 The output of the command above provides you with a tracking URL. Open it on your favorite browser.  
 
  
 
 > Tip: if you execute a node more than once, do not forget to use the flag -f to remove the results of the previous execution 
 
 
 
 After a few minutes, the outputs generated will be listed as hdfs resources. You can inspect the results in the HDFS mount point of your sandbox with: 
 
 
 
 <pre> 
 
 [user@sb ~]$ ls -l /share/tmp/sandbox/node_expression/data 
 
 </pre> 
 
 
 
 > Tip: remember CIOP relies of the Hadoop HDFS distributed storage to manage input and output data. More information about the sandbox: [[Understanding the sandbox]] 
 
 
 
 Now, execute the *node_binning* node: 
 
 
 
 <pre> 
 
 [user@sb ~]$ ciop-simjob node_binning 
 
 </pre> 
 
 
 
 As for the *node_expression*, after a few minutes, the list of the generated products is shown. 

 

 h3. Execute the BeamArithm processing workflow 
 
 
 
 While executing the single nodes can be very practical for debugging the application individual processing steps, CIOP allows you to process the entire workflow automatically. To do so run the command: 
 
 
 
 <pre> 
 
 [user@sb ~]$ ciop-simwf 
 
 </pre> 

 

 This will display an ascii status output of the application workflow execution.  

  

 > Tip: check the [[ciop-simwf]] CLI reference  

  

 > Tip: each workflow run has a unique identifier and the results of a run are never overwritten when executing the workflow again. 

 

 After a few minutes, the same outputs are generated and available as hdfs resources. As for the single node execution, these resources can be accessed on the HDFS mount point. 
 
 To do so, you need the run identifier. Obtain it with the command: 

 

 <pre> 
 
 [user@sb ~]$ ciop-simwf -l 
 
 </pre>  

  

 You should have a single run identifier. Use it to list the generated results: 

 

 <pre> 
 
 [user@sb ~]$ ls -l /share/tmp/sandbox/run/<run identifier>/node_binning/data 
 
 </pre> 

 

 This will list the same results as the single *node_binning* execution. 

 

 Now that you have seen CIOP manage the BeamArithm application, we will go through all files composing the application. 

 

 h2. BeamArithm CIOP application breakdown 

 

 h3. The application descriptor file: application.xml 

 

 Each CIOP application is described with an application descriptor file. This file is always named application.xml and is found in the /application file system in your sandbox. 

 

 This file contains two main sections:  
  
 * a section where the application job templates are described 
 
 * a section with the workflow definition combining the job templates as a DAG 

 

 > Tip: check the DAG definition here: [[CIOP terminology and definitions]] 

 

 > Tip: learn about the application descriptor file here: [[Understanding the sandbox]] 
 
 
 
 h4. Job templates 

 

 The listing below show the job templates section of the application descriptor file. 

 

 <pre><code class="xml"> 
 
 <jobTemplates> 
		 
		 <!-- BEAM BandMaths operator job template    --> 
		 
		 <jobTemplate id="expression"> 
			 
			 <streamingExecutable>/application/expression/run</streamingExecutable> 
			 
			 <defaultParameters> 						
				 						
				 <parameter id="expression">l1_flags.INVALID?0:radiance_13>15?0:100+radiance_9-(radiance_8+(radiance_10-radiance_8)*27.524/72.570)</parameter> 
			 
			 </defaultParameters> 
		 
		 </jobTemplate> 
		 
		 <!-- BEAM Level 3 processor job template    --> 
		 
		 <jobTemplate id="binning"> 
			 
			 <streamingExecutable>/application/binning/run</streamingExecutable> 
			 
			 <defaultParameters> 						
				 						
				 <parameter id="cellsize">9.28</parameter> 
				 
				 <parameter id="bandname">out</parameter> 
				 
				 <parameter id="bitmask">l1_flags.INVALID?0:radiance_13>15?0:100+radiance_9-(radiance_8+(radiance_10-radiance_8)*27.524/72.570)</parameter> 
				 
				 <parameter id="bbox">-180,-90,180,90</parameter> 
				 
				 <parameter id="algorithm">Minimum/Maximum</parameter> 
				 
				 <parameter id="outputname">binned</parameter> 
				 
				 <parameter id="resampling">binning</parameter> 
				 
				 <parameter id="palette">#MCI_Palette 
 
 color0=0,0,0 
 
 color1=0,0,154 
 
 color2=54,99,250 
 
 color3=110,201,136 
 
 color4=166,245,8 
 
 color5=222,224,0 
 
 color6=234,136,0 
 
 color7=245,47,0 
 
 color8=255,255,255 
 
 numPoints=9 
 
 sample0=98.19878118960284 
 
 sample1=98.64947122314665 
 
 sample2=99.10016125669047 
 
 sample3=99.5508512902343 
 
 sample4=100.0015413237781 
 
 sample5=100.4522313573219 
 
 sample6=100.90292139086574 
 
 sample7=101.35361142440956 
 
 sample8=101.80430145795337</parameter> 
				 
				 <parameter id="band">1</parameter> 
				 
				 <parameter id="tailor">true</parameter> 
			 
			 </defaultParameters> 
			 
			 <defaultJobconf> 
		        	 
		        	 <property id="ciop.job.max.tasks">1</property> 
		         
		         </defaultJobconf> 
		 
		 </jobTemplate> 
	 
	 </jobTemplates> 
 
 </code></pre> 

 

 > Tip: check the validity of the application descriptor file with [[ciop-appcheck]] 

 

 > Tip: learn more about the application descriptor file here: [[Understanding the sandbox]] 

 

 Each *job template* has the mandatory element defining the streaming executable. 

 

 Example: the streaming executable for the job template *expression* is: 

 

 <pre><code class="xml"> 
 
 <streamingExecutable>/application/expression/run</streamingExecutable> 
 
 </code></pre> 

 

 > Tip: do not forget to _chmod_ the streaming executable with executable rights, e.g. @chmod 755 /application/expression/run@ 
 
 Both job templates, expression and binning, define a set of defaults parameters. 

 

 Example: the job template expression defines a default expression for the parameter *expression*: 

 

 <pre><code class="xml"> 
 
 <defaultParameters> 						
	 						
	 <parameter id="expression">l1_flags.INVALID?0:radiance_13>15?0:100+radiance_9-(radiance_8+(radiance_10-radiance_8)*27.524/72.570)</parameter> 
 
 </defaultParameters> 
 
 </code></pre> 

 

 The job template *binning* defines the default job configuration. 

 

 As explained above, the job template *binning* does a temporal and spatial aggregation of the *expression* job outputs. The *binning* job will thus be a single job instance (the •expression* job instead exploits the parallelism offered by CIOP). 
 
 To express such a job configuration we've added the XML tags: 

 

 <pre><code class="xml"> 
 
 <defaultJobconf> 
	 
	 <property id="ciop.job.max.tasks">1</property> 
 
 </defaultJobconf> 
 
 </code></pre> 

 

 > Tip: for a list of possible of job default properties read [[Application descriptor]] 

 

 h4. Streaming executable for expression job template 

 

 It is important to keep in mind that job input and ouput are text references (e.g. to data).  
  
 Indeed, when a job process input, it actually reads line by line the reference workflow and the child job will read via _stdin_ the outputs of the parent job (or nay other source e.g. catalogued data). 

 

 > Tip: if you need to combine results of a parent job with other values (e.g. bounding boxes to process over a set on input products) you will have to add a simple job that combines the outputs and values 

 

 The job template *expression* steaming executable is Bourne Again SHell (bash) script: 

 

 <pre> 
 
 #!/bin/bash 

 

 # source the ciop functions (e.g. ciop-log) 
 
 source ${ciop_job_include} 

 

 export BEAM_HOME=$_CIOP_APPLICATION_PATH/share/beam-4.11 
 
 export PATH=$BEAM_HOME/bin:$PATH 

 

 # define the exit codes 
 
 SUCCESS=0 
 
 ERR_NOINPUT=1 
 
 ERR_BEAM=2 
 
 ERR_NOPARAMS=5 

 

 # add a trap to exit gracefully 
 
 function cleanExit () 
 
 { 
    
    local retval=$? 
    
    local msg="" 
    
    case "$retval" in 
      
      $SUCCESS)        msg="Processing successfully concluded";; 
      
      $ERR_NOPARAMS) msg="Expression not defined";; 
      
      $ERR_BEAM)      msg="Beam failed to process product $product (Java returned $res).";; 
      
      *)               msg="Unknown error";; 
    
    esac 
    
    [ "$retval" != "0" ] && ciop-log "ERROR" "Error $retval - $msg, processing aborted" || ciop-log "INFO" "$msg" 
    
    exit $retval 
 
 } 
 
 trap cleanExit EXIT 

 

 # create the output folder to store the output products 
 
 mkdir -p $TMPDIR/output 
 
 export OUTPUTDIR=$TMPDIR/output 

 

 # retrieve the parameters value from workflow or job default value 
 
 expression="`ciop-getparam expression`" 

 

 # run a check on the expression value, it can't be empty 
 
 [ -z "$expression" ] && exit $ERR_NOPARAMS 


 


 # loop and process all MERIS products 
 
 while read inputfile  
  
 do 
	 
	 # report activity in log 
	 
	 ciop-log "INFO" "Retrieving $inputfile from storage" 

	 

	 # retrieve the remote geotiff product to the local temporary folder 
	 
	 retrieved=`ciop-copy -o $TMPDIR $inputfile` 
	
	 
	
	 # check if the file was retrieved 
	 
	 [ "$?" == "0" -a -e "$retrieved" ] || exit $ERR_NOINPUT 
	
	 
	
	 # report activity 
	 
	 ciop-log "INFO" "Retrieved `basename $retrieved`, moving on to expression" 
	 
	 outputname=`basename $retrieved` 

	 

	 BEAM_REQUEST=$TMPDIR/beam_request.xml 
 
 cat << EOF > $BEAM_REQUEST 
 
 <?xml version="1.0" encoding="UTF-8"?> 
 
 <graph> 
   
   <version>1.0</version> 
   
   <node id="1"> 
     
     <operator>Read</operator> 
       
       <parameters> 
         
         <file>$retrieved</file> 
       
       </parameters> 
   
   </node> 
   
   <node id="2"> 
     
     <operator>BandMaths</operator> 
     
     <sources> 
       
       <source>1</source> 
     
     </sources> 
     
     <parameters> 
       
       <targetBands> 
         
         <targetBand> 
           
           <name>out</name> 
           
           <expression>$expression</expression> 
           
           <description>Processed Band</description> 
           
           <type>float32</type> 
         
         </targetBand> 
       
       </targetBands> 
     
     </parameters> 
   
   </node> 
   
   <node id="write"> 
     
     <operator>Write</operator> 
     
     <sources> 
        
        <source>2</source> 
     
     </sources> 
     
     <parameters> 
       
       <file>$OUTPUTDIR/$outputname</file> 
    
    </parameters> 
   
   </node> 
 
 </graph> 
 
 EOF 
    
    gpt.sh $BEAM_REQUEST &> /dev/null 
    
    res=$? 
    
    [ $res != 0 ] && exit $ERR_BEAM 

	 

	 cd $OUTPUTDIR 
	
	 
	
	 outputname=`echo $(basename $retrieved)`.dim 
	 
	 outputfolder=`echo $(basename $retrieved)`.data 

	 

	 tar cfz $outputname.tgz $outputname $outputfolder &> /dev/null 
	 
	 cd - &> /dev/null 
	
	 
	
	 ciop-log "INFO" "Publishing $outputname.dim and $outputname.data" 
	 
	 ciop-publish $OUTPUTDIR/$outputname.tgz 
	 
	 cd - &> /dev/null 	
	
	 	
	
	 # cleanup 
	 
	 rm -fr $retrieved $OUTPUTDIR/$outputname.d* $OUTPUTDIR/$outputname.tgz  

  

 done 

 

 exit 0 
 
 </pre> 

 

 The first line tells Linux to use the bash interpreter to run this script.  

  

 > Tip: always set the interpreter, there is no other way to tell CIOP how to execute the streaming executable 

 

 The block after is mandatory as it defines the CIOP functions (ciop-log, ciop-getparam, etc.) needed to write the streaming executable script: 

 

 <pre> 
 
 # source the ciop functions (e.g. ciop-log) 
 
 source ${ciop_job_include} 
 
 </pre> 

 

 After that, we set a few environment variables needed to have BEAM working: 

 

 <pre> 
 
 export BEAM_HOME=$_CIOP_APPLICATION_PATH/share/beam-4.11 
 
 export PATH=$BEAM_HOME/bin:$PATH 
 
 </pre> 

 

 After that, we set the error handling. Although this block is not mandatory, it is a good practice to set clear error codes and use a _trap_ function: 

 

 <pre> 
 
 # define the exit codes 
 
 SUCCESS=0 
 
 ERR_NOINPUT=1 
 
 ERR_BEAM=2 
 
 ERR_NOPARAMS=5 

 

 # add a trap to exit gracefully 
 
 function cleanExit () 
 
 { 
    
    local retval=$? 
    
    local msg="" 
    
    case "$retval" in 
      
      $SUCCESS)        msg="Processing successfully concluded";; 
      
      $ERR_NOPARAMS) msg="Expression not defined";; 
      
      $ERR_BEAM)      msg="Beam failed to process product $product (Java returned $res).";; 
      
      *)               msg="Unknown error";; 
    
    esac 
    
    [ "$retval" != "0" ] && ciop-log "ERROR" "Error $retval - $msg, processing aborted" || ciop-log "INFO" "$msg" 
    
    exit $retval 
 
 } 
 
 trap cleanExit EXIT 
 
 </pre> 

 

 CIOP framework provides a temporary location unique to the job/parameter execution (very important if more than one processing node is used).  
  
 In it, we'll define where our results will be written: 

 

 <pre> 
 
 # create the output folder to store the output products 
 
 mkdir -p $TMPDIR/output 
 
 export OUTPUTDIR=$TMPDIR/output 
 
 </pre> 

 

 Then, we read the processing parameters using ciop-getparam and do a simple check on the value (it cannot be empty): 

 

 <pre> 
 
 # retrieve the parameters value from workflow or job default value 
 
 expression="`ciop-getparam expression`" 

 

 # run a check on the expression value, it can't be empty 
 
 [ -z "$expression" ] && exit $ERR_NOPARAMS 
 
 </pre> 

 

 At this point we loop the input MERIS Level 1 products and copy them locally to the TMPDIR location: 

 

 <pre> 
 
 # loop and process all MERIS products 
 
 while read inputfile  
  
 do 
	 
	 # report activity in log 
	 
	 ciop-log "INFO" "Retrieving $inputfile from storage" 

	 

	 # retrieve the remote geotiff product to the local temporary folder 
	 
	 retrieved=`ciop-copy -o $TMPDIR $inputfile` 
	
	 
	
	 # check if the file was retrieved 
	 
	 [ "$?" == "0" -a -e "$retrieved" ] || exit $ERR_NOINPUT 
	
	 
	
	 ... 
 
 done 	
 	
 </pre> 

 

 > Tip: always report activity using ciop-log, if you don't report activity CIOP will kill the process if the walltime is reached 

 

 We finally apply the BandMaths operator to the retrieved MERIS Level 1 product: 

 

 <pre> 
 
 # loop and process all MERIS products 
 
 while read inputfile  
  
 do 
	 
	 ... 

	 

	 # report activity 
	 
	 ciop-log "INFO" "Retrieved `basename $retrieved`, moving on to expression" 
	 
	 outputname=`basename $retrieved` 

	 

	 BEAM_REQUEST=$TMPDIR/beam_request.xml 
 
 cat << EOF > $BEAM_REQUEST 
 
 <?xml version="1.0" encoding="UTF-8"?> 
 
 <graph> 
   
   <version>1.0</version> 
   
   <node id="1"> 
     
     <operator>Read</operator> 
       
       <parameters> 
         
         <file>$retrieved</file> 
       
       </parameters> 
   
   </node> 
   
   <node id="2"> 
     
     <operator>BandMaths</operator> 
     
     <sources> 
       
       <source>1</source> 
     
     </sources> 
     
     <parameters> 
       
       <targetBands> 
         
         <targetBand> 
           
           <name>out</name> 
           
           <expression>$expression</expression> 
           
           <description>Processed Band</description> 
           
           <type>float32</type> 
         
         </targetBand> 
       
       </targetBands> 
     
     </parameters> 
   
   </node> 
   
   <node id="write"> 
     
     <operator>Write</operator> 
     
     <sources> 
        
        <source>2</source> 
     
     </sources> 
     
     <parameters> 
       
       <file>$OUTPUTDIR/$outputname</file> 
    
    </parameters> 
   
   </node> 
 
 </graph> 
 
 EOF 
    
    gpt.sh $BEAM_REQUEST &> /dev/null 
    
    res=$? 
    
    [ $res != 0 ] && exit $ERR_BEAM 

    

    ... 
   
 
   
 done 

 

 At this stage the produced results are packaged and published in the CIOP distirbuted filesystem and available for the *binnning* job using ciop-publish: 

 

 <pre> 
 
 # loop and process all MERIS products 
 
 while read inputfile  
  
 do 
	 
	 ... 
	
	 
	
	 cd $OUTPUTDIR 
	
	 
	
	 outputname=`echo $(basename $retrieved)`.dim 
	 
	 outputfolder=`echo $(basename $retrieved)`.data 

	 

	 tar cfz $outputname.tgz $outputname $outputfolder &> /dev/null 
	 
	 cd - &> /dev/null 
	
	 
	
	 ciop-log "INFO" "Publishing $outputname.dim and $outputname.data" 
	 
	 ciop-publish $OUTPUTDIR/$outputname.tgz 
	 
	 cd - &> /dev/null 	
	
	 	
	
	 # cleanup 
	 
	 rm -fr $retrieved $OUTPUTDIR/$outputname.d* $OUTPUTDIR/$outputname.tgz  
  
 done 
 
 </pre> 

 

 Tip: ciop-publish does more than a simple copy of data, it also "echoes" the destination URL and this string(s) will be used as input for the *binning* job 


 


 This concludes the tutorial.