Wednesday, August 31, 2011

Hello World With Java & Pentaho

This is a simple java program which takes a transformation "first_transformation.ktr" and executes the transformation



Create a simple Test.java file and execute
 
import org.pentaho.di.core.KettleEnvironment;
import org.pentaho.di.core.exception.KettleException;
import org.pentaho.di.trans.Trans;
import org.pentaho.di.trans.TransMeta;

/**
 * Hello world!
 *
 */
public class Test
{
    public static void main( String[] args )
    {
        try {
            KettleEnvironment.init();
            TransMeta metaData = new TransMeta("first_transformation.ktr");
            Trans trans = new Trans( metaData );
            trans.execute( null );
            trans.waitUntilFinished();
            if ( trans.getErrors() > 0 ) {
                System.out.print( "Error Executing transformation" );
            }
        } catch( KettleException e ) {
            e.printStackTrace();
        }
    }
}


This is the simple transformation created using spoon tool
filename: first_transformation.ktr

<?xml version="1.0" encoding="UTF-8"?>
<transformation>
  <info>
    <name>first_transformation</name>
    <description/>
    <extended_description/>
    <trans_version/>
    <trans_type>Normal</trans_type>
    <directory>&#47;</directory>
    <parameters>
    </parameters>
    <log>
<trans-log-table><connection/>
<schema/>
<table/>
<size_limit_lines/>
<interval/>
<timeout_days/>
<field><id>ID_BATCH</id><enabled>Y</enabled><name>ID_BATCH</name></field><field><id>CHANNEL_ID</id><enabled>Y</enabled><name>CHANNEL_ID</name></field><field><id>TRANSNAME</id><enabled>Y</enabled><name>TRANSNAME</name></field><field><id>STATUS</id><enabled>Y</enabled><name>STATUS</name></field><field><id>LINES_READ</id><enabled>Y</enabled><name>LINES_READ</name><subject/></field><field><id>LINES_WRITTEN</id><enabled>Y</enabled><name>LINES_WRITTEN</name><subject/></field><field><id>LINES_UPDATED</id><enabled>Y</enabled><name>LINES_UPDATED</name><subject/></field><field><id>LINES_INPUT</id><enabled>Y</enabled><name>LINES_INPUT</name><subject/></field><field><id>LINES_OUTPUT</id><enabled>Y</enabled><name>LINES_OUTPUT</name><subject/></field><field><id>LINES_REJECTED</id><enabled>Y</enabled><name>LINES_REJECTED</name><subject/></field><field><id>ERRORS</id><enabled>Y</enabled><name>ERRORS</name></field><field><id>STARTDATE</id><enabled>Y</enabled><name>STARTDATE</name></field><field><id>ENDDATE</id><enabled>Y</enabled><name>ENDDATE</name></field><field><id>LOGDATE</id><enabled>Y</enabled><name>LOGDATE</name></field><field><id>DEPDATE</id><enabled>Y</enabled><name>DEPDATE</name></field><field><id>REPLAYDATE</id><enabled>Y</enabled><name>REPLAYDATE</name></field><field><id>LOG_FIELD</id><enabled>Y</enabled><name>LOG_FIELD</name></field></trans-log-table>
<perf-log-table><connection/>
<schema/>
<table/>
<interval/>
<timeout_days/>
<field><id>ID_BATCH</id><enabled>Y</enabled><name>ID_BATCH</name></field><field><id>SEQ_NR</id><enabled>Y</enabled><name>SEQ_NR</name></field><field><id>LOGDATE</id><enabled>Y</enabled><name>LOGDATE</name></field><field><id>TRANSNAME</id><enabled>Y</enabled><name>TRANSNAME</name></field><field><id>STEPNAME</id><enabled>Y</enabled><name>STEPNAME</name></field><field><id>STEP_COPY</id><enabled>Y</enabled><name>STEP_COPY</name></field><field><id>LINES_READ</id><enabled>Y</enabled><name>LINES_READ</name></field><field><id>LINES_WRITTEN</id><enabled>Y</enabled><name>LINES_WRITTEN</name></field><field><id>LINES_UPDATED</id><enabled>Y</enabled><name>LINES_UPDATED</name></field><field><id>LINES_INPUT</id><enabled>Y</enabled><name>LINES_INPUT</name></field><field><id>LINES_OUTPUT</id><enabled>Y</enabled><name>LINES_OUTPUT</name></field><field><id>LINES_REJECTED</id><enabled>Y</enabled><name>LINES_REJECTED</name></field><field><id>ERRORS</id><enabled>Y</enabled><name>ERRORS</name></field><field><id>INPUT_BUFFER_ROWS</id><enabled>Y</enabled><name>INPUT_BUFFER_ROWS</name></field><field><id>OUTPUT_BUFFER_ROWS</id><enabled>Y</enabled><name>OUTPUT_BUFFER_ROWS</name></field></perf-log-table>
<channel-log-table><connection/>
<schema/>
<table/>
<timeout_days/>
<field><id>ID_BATCH</id><enabled>Y</enabled><name>ID_BATCH</name></field><field><id>CHANNEL_ID</id><enabled>Y</enabled><name>CHANNEL_ID</name></field><field><id>LOG_DATE</id><enabled>Y</enabled><name>LOG_DATE</name></field><field><id>LOGGING_OBJECT_TYPE</id><enabled>Y</enabled><name>LOGGING_OBJECT_TYPE</name></field><field><id>OBJECT_NAME</id><enabled>Y</enabled><name>OBJECT_NAME</name></field><field><id>OBJECT_COPY</id><enabled>Y</enabled><name>OBJECT_COPY</name></field><field><id>REPOSITORY_DIRECTORY</id><enabled>Y</enabled><name>REPOSITORY_DIRECTORY</name></field><field><id>FILENAME</id><enabled>Y</enabled><name>FILENAME</name></field><field><id>OBJECT_ID</id><enabled>Y</enabled><name>OBJECT_ID</name></field><field><id>OBJECT_REVISION</id><enabled>Y</enabled><name>OBJECT_REVISION</name></field><field><id>PARENT_CHANNEL_ID</id><enabled>Y</enabled><name>PARENT_CHANNEL_ID</name></field><field><id>ROOT_CHANNEL_ID</id><enabled>Y</enabled><name>ROOT_CHANNEL_ID</name></field></channel-log-table>
<step-log-table><connection/>
<schema/>
<table/>
<timeout_days/>
<field><id>ID_BATCH</id><enabled>Y</enabled><name>ID_BATCH</name></field><field><id>CHANNEL_ID</id><enabled>Y</enabled><name>CHANNEL_ID</name></field><field><id>LOG_DATE</id><enabled>Y</enabled><name>LOG_DATE</name></field><field><id>TRANSNAME</id><enabled>Y</enabled><name>TRANSNAME</name></field><field><id>STEPNAME</id><enabled>Y</enabled><name>STEPNAME</name></field><field><id>STEP_COPY</id><enabled>Y</enabled><name>STEP_COPY</name></field><field><id>LINES_READ</id><enabled>Y</enabled><name>LINES_READ</name></field><field><id>LINES_WRITTEN</id><enabled>Y</enabled><name>LINES_WRITTEN</name></field><field><id>LINES_UPDATED</id><enabled>Y</enabled><name>LINES_UPDATED</name></field><field><id>LINES_INPUT</id><enabled>Y</enabled><name>LINES_INPUT</name></field><field><id>LINES_OUTPUT</id><enabled>Y</enabled><name>LINES_OUTPUT</name></field><field><id>LINES_REJECTED</id><enabled>Y</enabled><name>LINES_REJECTED</name></field><field><id>ERRORS</id><enabled>Y</enabled><name>ERRORS</name></field><field><id>LOG_FIELD</id><enabled>N</enabled><name>LOG_FIELD</name></field></step-log-table>
    </log>
    <maxdate>
      <connection/>
      <table/>
      <field/>
      <offset>0.0</offset>
      <maxdiff>0.0</maxdiff>
    </maxdate>
    <size_rowset>10000</size_rowset>
    <sleep_time_empty>50</sleep_time_empty>
    <sleep_time_full>50</sleep_time_full>
    <unique_connections>N</unique_connections>
    <feedback_shown>Y</feedback_shown>
    <feedback_size>50000</feedback_size>
    <using_thread_priorities>Y</using_thread_priorities>
    <shared_objects_file/>
    <capture_step_performance>N</capture_step_performance>
    <step_performance_capturing_delay>1000</step_performance_capturing_delay>
    <step_performance_capturing_size_limit>100</step_performance_capturing_size_limit>
    <dependencies>
    </dependencies>
    <partitionschemas>
    </partitionschemas>
    <slaveservers>
    </slaveservers>
    <clusterschemas>
    </clusterschemas>
  <modified_user>-</modified_user>
  <modified_date>2011&#47;08&#47;31 19:03:08.937</modified_date>
  </info>
  <notepads>
  </notepads>
  <order>
  <hop> <from>Generate Rows</from><to>Write to log</to><enabled>Y</enabled> </hop>  </order>
  <step>
    <name>Generate Rows</name>
    <type>RowGenerator</type>
    <description/>
    <distribute>Y</distribute>
    <copies>1</copies>
         <partitioning>
           <method>none</method>
           <schema_name/>
           </partitioning>
    <fields>
      <field>
        <name>Test</name>
        <type>String</type>
        <format/>
        <currency/>
        <decimal/>
        <group/>
        <nullif>Hello World!</nullif>
        <length>-1</length>
        <precision>-1</precision>
      </field>
    </fields>
    <limit>10</limit>
     <cluster_schema/>
 <remotesteps>   <input>   </input>   <output>   </output> </remotesteps>    <GUI>
      <xloc>123</xloc>
      <yloc>213</yloc>
      <draw>Y</draw>
      </GUI>
    </step>

  <step>
    <name>Write to log</name>
    <type>WriteToLog</type>
    <description/>
    <distribute>Y</distribute>
    <copies>1</copies>
         <partitioning>
           <method>none</method>
           <schema_name/>
           </partitioning>
      <loglevel>log_level_basic</loglevel>
      <displayHeader>Y</displayHeader>
    <fields>
      <field>
        <name>Test</name>
        </field>
      </fields>
     <cluster_schema/>
 <remotesteps>   <input>   </input>   <output>   </output> </remotesteps>    <GUI>
      <xloc>331</xloc>
      <yloc>212</yloc>
      <draw>Y</draw>
      </GUI>
    </step>

  <step_error_handling>
  </step_error_handling>
   <slave-step-copy-partition-distribution>
</slave-step-copy-partition-distribution>
   <slave_transformation>N</slave_transformation>
</transformation>

-------




Assuming all the dependent jars are included in class path the above program should result in the following output.

INFO  31-08 19:14:46,992 - first_transformation - Dispatching started for transformation [first_transformation]
INFO  31-08 19:14:47,024 - first_transformation - This transformation can be replayed with replay date: 2011/08/31 19:14:47
INFO  31-08 19:14:47,039 - Generate Rows - Finished processing (I=0, O=0, R=0, W=10, U=0, E=0)
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 1------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 2------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 3------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 4------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 5------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 6------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 7------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 8------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 9------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log -
------------> Linenr 10------------------------------
Test = Hello World!

====================
INFO  31-08 19:14:47,039 - Write to log - Finished processing (I=0, O=0, R=10, W=10, U=0, E=0)



The POM file used with to build and run this example is

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.ameeth.poc</groupId>
    <artifactId>pentaho</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>pentaho</name>
    <url>http://maven.apache.org</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <pentaho.kettle.version>4.0.1-GA</pentaho.kettle.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.16</version>
        </dependency>
        <dependency>
            <groupId>pentaho.kettle</groupId>
            <artifactId>kettle-core</artifactId>
            <version>${pentaho.kettle.version}</version>
        </dependency>
        <dependency>
            <groupId>pentaho.kettle</groupId>
            <artifactId>kettle-db</artifactId>
            <version>${pentaho.kettle.version}</version>
        </dependency>
        <dependency>
            <groupId>commons-vfs</groupId>
            <artifactId>commons-vfs</artifactId>
            <version>1.0</version>
        </dependency>
        <dependency>
            <groupId>pentaho.kettle</groupId>
            <artifactId>kettle-engine</artifactId>
            <version>${pentaho.kettle.version}</version>
        </dependency>
        <dependency>
            <groupId>pentaho.kettle</groupId>
            <artifactId>kettle-ui-swt</artifactId>
            <version>${pentaho.kettle.version}</version>
        </dependency>
        <dependency>
            <groupId>pentaho-library</groupId>
            <artifactId>libformula</artifactId>
            <version>1.1.7</version>
            <exclusions>
                <exclusion>
                    <groupId>commons-logging</groupId>
                    <artifactId>commons-logging-api</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.codehaus.janino</groupId>
            <artifactId>janino</artifactId>
            <version>2.5.16</version>
        </dependency>
        <dependency>
            <groupId>rhino</groupId>
            <artifactId>js</artifactId>
            <version>1.7R2</version>
        </dependency>
        <dependency>
            <groupId>javax.mail</groupId>
            <artifactId>mail</artifactId>
            <version>1.4.1</version>
        </dependency>
    </dependencies>
</project>

28 comments:

  1. Thanks a LOT! I am a Java newbie and office clerk, trying to ease my work with "automated synchronization", read "Mulesoft.org ESB" and other weekend-projects. ;)

    Your post (thanks for linking it in the pentaho wiki! Made it SO much easier to find as "reliable" with search engines) goes very well together with http://pentahodev.blogspot.com/2009/08/developdebug-kettle-plugin-in-eclipse.html

    Here is my command line command on Windows7:
    C:\foo>"C:\Program Files (x86)\Java\jdk1.7.0
    \bin\javac.exe" -cp .;lib\kettle-engine.jar;lib\kettle-core.jar;libext\*;libext\
    pentaho\*;libext\commons\*;lib\kettle-db.jar Test.java

    I copied all referenced folders/files from a download of Data%20Integration/4.2.0-stable/pdi-ce-4.2.0-stable.zip (not the source version, although I did experiment with it).

    Then
    C:\foo>"C:\Program Files (x86)\Java\jdk1.7.0
    \bin\java.exe" -cp .;lib\kettle-engine.jar;lib\kettle-core.jar;libext\*;libext\
    pentaho\*;libext\commons\*;lib\kettle-db.jar Test
    spat out the log lines! *hoooray*

    Two weeks! Without your and that other post had it been im-pos-si-ble!..

    ReplyDelete
  2. I am trying but got following error,

    C:\Documents and Settings\vj\Desktop\software\data-integration>java -cp .;lib\kettle-engine.jar;lib\kettle-core.jar;libext\*;libext\pentaho\*;libext\commons\*;lib\kettle-db.jar "C:\projects\test\test-kettle\com\testme\TestMe"
    Exception in thread "main" java.lang.NoClassDefFoundError: C:\projects\test\
    test-kettle\com\testme\TestMe
    Caused by: java.lang.ClassNotFoundException: C:\projects\test\test-kettle\com\testme\TestMe
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    Could not find the main class: C:\projects\test\test-kettle\com\testme\TestMe. Program will exit.

    ReplyDelete
  3. Its not able to find the main class "Could not find the main class: C:\projects\test\test-kettle\com\testme\TestMe. Program will exit. " Please check how to run a java program.

    ReplyDelete
  4. Replies
    1. me auto responde de aqui baje la version estable hasta este momento es la 4.3.0
      http://repository.pentaho.org/artifactory/pentaho/

      Delete
  5. What about to run a job, saved on the enterprise repository, from a java class??

    ReplyDelete
  6. Hi,

    This example give me a error on this line:

    KettleEnvironment.init();

    The error is the following:

    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Appender
    at org.pentaho.di.core.KettleEnvironment.init(KettleEnvironment.java:69)
    at org.pentaho.di.core.KettleEnvironment.init(KettleEnvironment.java:53)
    at teste.Teste.main(Teste.java:24)
    Caused by: java.lang.ClassNotFoundException: org.apache.log4j.Appender
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 3 more
    Java Result: 1

    How I resolve that?

    ReplyDelete
    Replies
    1. log4j is already included in the path let me know how you are running the application

      Delete
  7. Good morning, Ameeth.

    First a great 2013.
    I wonder if you ever needed to create a java applet to call a job created in Kettle?
    Because when I run a job, it does nothing. Only starts and lost.
    If you can answer, I thank you.

    Gelson

    ReplyDelete
  8. Exception in thread "main" java.lang.NoSuchMethodError: org.pentaho.di.i18n.BaseMessages.getString(Ljava/lang/Class;Ljava/lang/String;[Ljava/lang/String;)Ljava/lang/String;
    at org.pentaho.di.core.logging.LogLevel.(LogLevel.java:39)
    at org.pentaho.di.core.logging.DefaultLogLevel.(DefaultLogLevel.java:36)
    at org.pentaho.di.core.logging.DefaultLogLevel.getInstance(DefaultLogLevel.java:41)
    at org.pentaho.di.core.logging.DefaultLogLevel.getLogLevel(DefaultLogLevel.java:50)
    at org.pentaho.di.core.logging.LogChannel.(LogChannel.java:40)
    at org.pentaho.di.core.logging.LogChannel.(LogChannel.java:28)
    at org.pentaho.di.core.plugins.BasePluginType.(BasePluginType.java:71)
    at org.pentaho.di.core.plugins.BasePluginType.(BasePluginType.java:82)
    at org.pentaho.di.core.plugins.StepPluginType.(StepPluginType.java:76)
    at org.pentaho.di.core.plugins.StepPluginType.getInstance(StepPluginType.java:82)
    at org.pentaho.di.core.KettleEnvironment.init(KettleEnvironment.java:83)
    at com.penta.practice.App.main(App.java:17)

    ReplyDelete
    Replies
    1. These exception can be due to library version mismatch. Are you using a maven build

      Delete
  9. Unable to load class for step/plugin with id . Check if the plugin is available in the plugins subdirectory of the Kettle distribution.
    When i am trying to run a transformation from java code , Am getting such an error .Thanks beforehand !

    ReplyDelete
    Replies
    1. Please post a complete stack trace

      Delete
  10. Say "first_transformation.ktr" having many parameters. Then how will I pass parameter/value along with transformation call.
    -- Chandrajit Samanta
    chandrajit.samanta@gmail.com

    ReplyDelete
    Replies
    1. you can call trans.setParameterValue(key,value)

      and then trans.activateParameters();

      Delete
  11. This comment has been removed by the author.

    ReplyDelete
  12. hi Ameeth I need to talk you about pentaho even i stay in pune,9158552080 is my contact number.contact me as soon as possible need ur help in pentaho since I am working on that tool.

    ReplyDelete
  13. Hi, I am trying to learn on running Pentaho with a Java application. I hope you could assist me. Where can I get the packages for:

    org.pentaho.di.core.KettleEnvironment;
    org.pentaho.di.core.exception.KettleException;
    org.pentaho.di.trans.Trans;
    org.pentaho.di.trans.TransMeta;

    ? Error message upon trying to run the program: Test.java:1: error: package org.pentaho.di.core does not exist


    Thank you for this blog.

    ReplyDelete
  14. Hi,

    import org.pentaho.di.core.KettleEnvironment;
    for the above import class which jar file is required,because i didn't get any API.
    i am using java code in eclipse to run the ktr.please do help if you have any idea about.

    Thanks
    Tabrez

    ReplyDelete
  15. If you are familiar with the maven build then the post includes the content of pom.xml file which has all the required dependencies.

    ReplyDelete
  16. i am not aware of maven only heard the name .so what i need to do ...?
    plugin is enough or i need to fully install the maven tool.?

    thanks
    Tabrez

    ReplyDelete
    Replies
    1. I did a quick dependency check and below is the list of libraries with there version which will be required to work it correctly.

      log4j:log4j:jar:1.2.16
      pentaho.kettle:kettle-core:jar:4.0.1-GA
      pentaho.kettle:kettle-db:jar:4.0.1-GA
      commons-vfs:commons-vfs:jar:1.0
      commons-logging:commons-logging:jar:1.0.4
      pentaho.kettle:kettle-engine:jar:4.0.1-GA
      pentaho.kettle:kettle-ui-swt:jar:4.0.1-GA
      pentaho-library:libformula:jar:1.1.7
      pentaho-library:libbase:jar:1.1.6
      org.codehaus.janino:janino:jar:2.5.16
      rhino:js:jar:1.7R2
      javax.mail:mail:jar:1.4.1
      javax.activation:activation:jar:1.1

      Either get some understanding of maven and use it OR download these libraries with the given version
      Hope this helps

      Delete
  17. Hi Ameeth I would like to thank you for the great post.I would like to generate a report by passing hdfspath programmatically using kettle api.Is it possible?

    Thanks in advance
    Manasa

    ReplyDelete
    Replies
    1. Check the latest version of Pentaho they support the HDFS integration

      http://wiki.pentaho.com/display/BAD/Hadoop

      Delete
  18. Hi Ameeth ,
    I have created maven java project and also I have added kettle dependency ,
    But whenever I will execute the .ktr file then I get the following error

    Unable to load class for step/plugin with id [ConcatFields]. Check if the plugin is available in the plugins subdirectory of the Kettle distribution.

    So please suggest how to fix them?

    Thanks in advance
    Pratik

    ReplyDelete
  19. The sample transformation code doesn't give any errors when the transformation has wrong DB credentials? Please help ?

    ReplyDelete
  20. Hi Amith

    Can you please let me know how to store .ktr file into MySql db.I have done all the setup with Maven and added all the dependencies. However, I'm not sure how to get the Connection object using Kettle API.Kindly provide me a sample program to get Connection Object and execution of Sql stmts.
    Thanks
    Sirisha

    ReplyDelete