Processing large XML files in the SOA Suite

Emiel Paasschens November 27, 2015 SOA 13 Comments

Read large XML files in chunks

Introduction

At my current project, XML files are uploaded by the end-user in order to be processed in the Oracle SOA Suite. The XML files contain information about employers and their employees. Because an employer can have hundreds and even thousands of employees, these XML files can be quite large.
Processing such large XML files consumes a lot of memory and can become a bottleneck ,especially when multiple end users are uploading large XML files at the same time. It even can cause a server to crash because of an OutOfMemory problem.
The best way to solve this, is to read and process these large XML files in chunks, so XML fragments are read and processed, instead of the complete XML file.
My colleague, Aldo Schaap, already did and described this for CSV files in his blog “Processing large files through SOA Suite using Synchronous File Read“. I thankfully used his blog to do the same for XML processing. However, a few things are slightly different in reading XML instead of CSV, so that’s the reason for this blog.
Another reason is that I ran into another problem, which I will describe later on in this blog. To be able to solve this problem I have to ‘pre transform’ the XML file. This means the XML file needs to be transformed before it is read by the SOA Suite. To achieve this I used the pre processing features of the file adapter with a custom (Java) valve. This pre en post processing is described in the blog “SOA Suite File Adapter Pre and Post processing using Valves and Pipelines” by colleague Lucas Jellema.
The combination of these two blogs provided me the solution for my problem.

Problem Description

Back to my problem. The large XML files, which have to be parsed, contain one ‘Message’ element as root. This root element contains one or more employers with some basic employers information and each employer can contain multiple employee elements, up to thousands, with employee information and employment information. In the real use case the XML structure contains Dutch element names and the XML is very specific about the business problem. For the purpose of this blog, I’ve reduced the problem to a basic XML structure with English names and used some basic sample data. XSD source:

&lt;schema attributeFormDefault="unqualified" elementFormDefault="qualified"
  targetNamespace="http://www.amis.nl/chunkreadxml"
  xmlns:tns="http://www.amis.nl/chunkreadxml"
  xmlns="http://www.w3.org/2001/XMLSchema"&gt;
  &lt;element name="Message"&gt;
    &lt;complexType&gt;
      &lt;sequence&gt;
        &lt;element name="MsgProperties" type="tns:tMsgProperties" minOccurs="0"/&gt;
        &lt;element name="Employer" type="tns:tEmployer" minOccurs="0" maxOccurs="unbounded"/&gt;
      &lt;/sequence&gt;
      &lt;attribute name="test" type="boolean"/&gt;
    &lt;/complexType&gt;
  &lt;/element&gt;
  &lt;complexType name="tMsgProperties"&gt;
    &lt;sequence&gt;
      &lt;element name="MsgId" type="string"/&gt;
      &lt;element name="SenderId" type="string" minOccurs="0"/&gt;
    &lt;/sequence&gt;
  &lt;/complexType&gt;
  &lt;complexType name="tEmployer"&gt;
    &lt;sequence&gt;
      &lt;element name="Name" type="string"/&gt;
      &lt;element name="EmployerNr" type="int"/&gt;
      &lt;element name="Address"&gt;
        &lt;complexType&gt;
          &lt;sequence&gt;
            &lt;element name="Street" type="string"/&gt;
            &lt;element name="PostalCode" type="string"/&gt;
            &lt;element name="City" type="string"/&gt;
            &lt;element name="CountryCode" type="string"/&gt;
          &lt;/sequence&gt;
        &lt;/complexType&gt;
      &lt;/element&gt;
      &lt;element name="Employee" type="tns:tEmployee" minOccurs="0" maxOccurs="unbounded"/&gt;
    &lt;/sequence&gt;
  &lt;/complexType&gt;
  &lt;complexType name="tEmployee"&gt;
    &lt;sequence&gt;
      &lt;element name="EmployeeNr" type="string"/&gt;
      &lt;element name="DOB" type="date"/&gt;
      &lt;element name="FamilyName" type="string"/&gt;
      &lt;element name="Initials" type="string"/&gt;
      &lt;element name="Gender" type="string"/&gt;
      &lt;element name="Nat" type="string"/&gt;
      &lt;element name="EmploymentDate" type="date"/&gt;
    &lt;/sequence&gt;
  &lt;/complexType&gt;
&lt;/schema&gt;

Structure (XSD):

My test data as ‘large XML file’:

&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;Message xmlns="http://www.amis.nl/chunkreadxml" test="true"&gt;
  &lt;MsgProperties&gt;
    &lt;MsgId&gt;10&lt;/MsgId&gt;
    &lt;SenderId&gt;Empa&lt;/SenderId&gt;
  &lt;/MsgProperties&gt;
  &lt;Employer&gt;
    &lt;Name&gt;AMIS Services BV&lt;/Name&gt;
    &lt;EmployerNr&gt;12345678&lt;/EmployerNr&gt;
    &lt;Address&gt;
      &lt;Street&gt;Edisonbaan 15&lt;/Street&gt;
      &lt;PostalCode&gt;3439 MN&lt;/PostalCode&gt;
      &lt;City&gt;Nieuwegein&lt;/City&gt;
      &lt;CountryCode&gt;NL&lt;/CountryCode&gt;
    &lt;/Address&gt;
    &lt;Employee&gt;
      &lt;EmployeeNr&gt;230&lt;/EmployeeNr&gt;
      &lt;DOB&gt;1973-08-13&lt;/DOB&gt;
      &lt;FamilyName&gt;Paasschens&lt;/FamilyName&gt;
      &lt;Initials&gt;E.P.&lt;/Initials&gt;
      &lt;Gender&gt;M&lt;/Gender&gt;
      &lt;Nat&gt;NL&lt;/Nat&gt;
      &lt;EmploymentDate&gt;2006-07-01&lt;/EmploymentDate&gt;
    &lt;/Employee&gt;
    &lt;Employee&gt;
      &lt;EmployeeNr&gt;311&lt;/EmployeeNr&gt;
      &lt;DOB&gt;1988-01-28&lt;/DOB&gt;
      &lt;FamilyName&gt;van der Kleij&lt;/FamilyName&gt;
      &lt;Initials&gt;E.&lt;/Initials&gt;
      &lt;Gender&gt;F&lt;/Gender&gt;
      &lt;Nat&gt;NL&lt;/Nat&gt;
      &lt;EmploymentDate&gt;2010-03-01&lt;/EmploymentDate&gt;
    &lt;/Employee&gt;
    &lt;Employee&gt;
      &lt;EmployeeNr&gt;315&lt;/EmployeeNr&gt;
      &lt;DOB&gt;1962-03-31&lt;/DOB&gt;
      &lt;FamilyName&gt;Uijtewaal&lt;/FamilyName&gt;
      &lt;Initials&gt;P.&lt;/Initials&gt;
      &lt;Gender&gt;M&lt;/Gender&gt;
      &lt;Nat&gt;NL&lt;/Nat&gt;
      &lt;EmploymentDate&gt;2010-09-01&lt;/EmploymentDate&gt;
    &lt;/Employee&gt;
  &lt;/Employer&gt;
  &lt;Employer&gt;
    &lt;Name&gt;Oracle NV&lt;/Name&gt;
    &lt;EmployerNr&gt;87654321&lt;/EmployerNr&gt;
    &lt;Address&gt;
      &lt;Street&gt;250 Oracle Pkwy&lt;/Street&gt;
      &lt;PostalCode&gt;CA 94065&lt;/PostalCode&gt;
      &lt;City&gt;Redwood City&lt;/City&gt;
      &lt;CountryCode&gt;US&lt;/CountryCode&gt;
    &lt;/Address&gt;
    &lt;Employee&gt;
      &lt;EmployeeNr&gt;1&lt;/EmployeeNr&gt;
      &lt;DOB&gt;1944-08-07&lt;/DOB&gt;
      &lt;FamilyName&gt;Ellison&lt;/FamilyName&gt;
      &lt;Initials&gt;L.J.&lt;/Initials&gt;
      &lt;Gender&gt;M&lt;/Gender&gt;
      &lt;Nat&gt;US&lt;/Nat&gt;
      &lt;EmploymentDate&gt;1977-06-16&lt;/EmploymentDate&gt;
    &lt;/Employee&gt;
    &lt;Employee&gt;
      &lt;EmployeeNr&gt;2&lt;/EmployeeNr&gt;
      &lt;DOB&gt;1957-01-01&lt;/DOB&gt;
      &lt;FamilyName&gt;Hurd&lt;/FamilyName&gt;
      &lt;Initials&gt;M.V.&lt;/Initials&gt;
      &lt;Gender&gt;M&lt;/Gender&gt;
      &lt;Nat&gt;US&lt;/Nat&gt;
      &lt;EmploymentDate&gt;2010-09-06&lt;/EmploymentDate&gt;
    &lt;/Employee&gt;
    &lt;Employee&gt;
      &lt;EmployeeNr&gt;3&lt;/EmployeeNr&gt;
      &lt;DOB&gt;1961-12-07&lt;/DOB&gt;
      &lt;FamilyName&gt;Catz&lt;/FamilyName&gt;
      &lt;Initials&gt;S.A.&lt;/Initials&gt;
      &lt;Gender&gt;F&lt;/Gender&gt;
      &lt;Nat&gt;US&lt;/Nat&gt;
      &lt;EmploymentDate&gt;1999-04-01&lt;/EmploymentDate&gt;
    &lt;/Employee&gt;
  &lt;/Employer&gt;
&lt;/Message&gt;

Chunk reading with BPEL and the JCA File Adapter

Because the XML file needs to be processed chunk by chunk, a BPEL process with a loop is used. Each iteration reads next chunk from file and processes this XML snippet. This continues until end of the XML file is read.

First an XML file reader has to be configured for synchronous read as External Reference. So drag a File Adapter from the Component Palette to the External References swim lane and configure it the same as described in the blog by Aldo, so:
Give it an appropriate name, e.g. ‘SynchReadXML’.

Choose ‘Define from operation and schema (specified later)’

Choose ‘Synchronous Read File’ and enter an appropriate name, e.g. ‘SynchReadXML’.

Choose for ‘Logical Name’ directory and enter an appropriate name e.g. ‘CHUNKED_FILES_DIR’.

Enter a dummy file name. We will overwrite this in the read call in the BPEL.

Select the XSD file and the root element:

And finally finish the wizard.

Now the ‘magic’ starts. Open the just created jca file, in our case the SynchReadXML_file.jca and

change the implementation class from “oracle.tip.adapter.file.outbound.FileReadInteractionSpec” to “oracle.tip.adapter.file.outbound.ChunkedInteractionSpec”.
Add property ChunkSize with value 1 for now (later you’ll see what it stands for)

The file should now look like this:

For this test project I use the name of XML file which has to be read as input string and as output just an “OK” string. For flexibility is always handy to have a Mediator as composite entrance.
So drag a Mediator in the composite, give it an appropriate name, apply the “Synchronous Interface” template and check the “Create Composite Service with SOAP Bindings” checkbox with the singleString as input and output.

Now we need the BPEL process in which we’re going to read the XML bit by bit, not literal of course :-), but chunk by chunk or XML fragment by XML fragment. Drag a BPEL process into the composite and select BPEL 2.0 Specification, enter an appropriate name and choose template “Synchronous BPEL Process”, uncheck the “Expose as a SOAP service” checkbox and select as input and output the singleString:

Wire them together with the connection points.

Now open the BPEL process and add the following variables (name, type, initialization):

isEOF, boolean, initialize with false
filename, string, no initialization needed
lineNumber, string, initialize with empty string! (is different from reading csv)
columnNumber, string, initialize with empty string! (is different from reading csv)
isMessageRejected, string, no initialization needed
rejectionReason, string, no initialization needed
noDataFound, string, no initialization needed

In source code:

The only difference from reading a CSV file as described in the blog of Aldo, is that the lineNumber and columnNumber variables must be initialized with an empty string, otherwise it’s not going to work!

Drag an Assign activity in the BPEL processs (between receive en reply), name it ‘AssignFilename’ and assign the filename with the input variable.

Drag a While loop in the BPEL process (between AssignFilename and reply) and loop while not EOF.

Drag an Invoke activity inside the While activity, give it an appropriate name, invoke the file adapter and create both the input and output variables (with the green + icon).

Open the Properties tab of the Invoke and add ‘To property’ jca.file.FileName with the filename variable as value.

Now go to source mode and add missing To and From properties (they are not present in the wizard).

Drag an If activity just below the Invoke, but still inside the While loop and check if variable noDataFound contains the string ‘true’.

Label in the BPEL flow the if branch with “NoDataFound” and the else branch with “DataFound”.

Drag an Assign activity in the if branch and assign true to variable isEOF, so the while loop will end.

We now put an Empty activity in the else branch, because this satisfies the purpose of this blog. Give the Empty activity the name “ProcessXMLfragment”, because this is the place where the processing of the XML fragments should be done, in this show case the processing of one employee. In our real business case we have to invoke another webservice which can handle only one employment relation (one employer and one employee) per request.
My advice is to invoke another BPEL activity which does the processing, so there is a clear separation of chunk reading and processing the data.

Finally we assign an OK string as output with an Assign activity just before the reply.
The flow should now look like this:
(Note that the Catch activities, which normally implement the exception handling, is left out)

We’re almost ready to deploy and test the composite. Two more little things need to be done.
The first one is that we have to map in the Mediator the request to the BPEL input and the same applies to the output. Because it’s only a single string, I do this with a direct assign instead of invoking an xslt mapping.

Now we only have to create a configPlan where we specify the physical directory where the XML files are read: right-mouse click on the composite name (composite.xml) and create a config plan. In the created config plan you will find a ‘reference’ element with the name of the jca file adapter, in our case “SynchreadXML”. You will see that already a placeholder has been created. Enter the physical directory of the runtime environment where the XML file, which is to be read, is located.

Finally you can deploy the composite and test it to find out what happens!

After deployment, you can test the composite in Enterprise Manger with the TEST page.
Enter in the Request tab the XML filename containing the test data and press button “Test Web Service”. The composite nicely returns an “OK” as output string in the Response message.
Now open the ‘Flow Trace’ window to inspect what happened:

Apparently the file has been read in four chunks. To see what exactly had been read, click on the BPEL process. In the Audit trail, expand the “payload” items to investigate what data has been read each time:

As you can see, the result is that the SOA Suite reads each time one child element of the root element. It nicely returns the correct LineNumber and ColumnNumber for the next read until from property IsEOF is set by the File Adapter, meaning End Of File reached.
This will not solve my problem, because the employer is the child element of the root and that’s the one which can contain thousands of employee elements as child elements. So when the employer element is read, also all its containing employee elements will be read too, while I just want to chunk-read on the employees instead of the employers!
Now remember that in the jca file, the ChunkSize setting is set to 1. What happens when we set this to 2?
After changing the jca file, redeploy and test it. The result is that there are only two read actions:

After inspecting the payloads, it turns out two child elements of the root elements are read each time. The first time it reads both the MsgProperties element and the first Employer element. The second time only the second Employer element is read. Because this is the last child element of the root element, also the from property IsEOF correctly is set to true.
So that’s not going to help my problem, but it’s nice to know that the jca adapter can read a specified amount of elements in one read. We’re going to use this later on for a little performance tuning.

Pre-transform the XML before reading

As already mentioned in the introduction of this blog, the solution for this is to change the XML before it is read by the SOA Suite. This can be achieved with a custom valve, as described in the blog by Lucas. In this situation, I choose the relocate the employee elements: from the employer element to the root element, just after the employer element where they were initially located. In this way we still know the context of the employee elements: in the context of the preceding employer element.

To do this, the following actions have to be done:

Create a custom valve and deploy it on the runtime environment
Configure the valve in the project to be used by the jca file
Change the XSD of the XML file being read accordingly
Change the BPEL logic to adjust to the XML changes

1. Create a custom valve and deploy it on the runtime environment

In the custom Java valve the XML is transformed with a SAX transformation. I’ve also tried it with a StAX transformation, which should be less memory consuming, but I can’t get it to work. It does work when I test it locally, in the main method with a local file, but it doesn’t work with the JCA file adapter. I’ve no idea why. Maybe there is a reader who can help me out… The code is still in the source and can be activated based on composite property “useStAX” (true/false). Nevertheless, with the SAX transformation is also works. The Java code:

package nl.amis.fileadaptervalves;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import java.util.logging.Logger;

import javax.xml.namespace.QName;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.XMLStreamWriter;
import javax.xml.stream.util.StreamReaderDelegate;
import javax.xml.transform.Result;
import javax.xml.transform.Source;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stax.StAXResult;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

import oracle.tip.pc.services.pipeline.AbstractValve;
import oracle.tip.pc.services.pipeline.InputStreamContext;
import oracle.tip.pc.services.pipeline.PipelineException;

import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.helpers.XMLReaderFactory;


public class EmployerValve extends AbstractValve {

  private static final String USE_STAX_PROPERTY_NAME = "useStAX";
  private static final String TMP_FILE_PREFIX = "fileAdapterValve";
  private static final String TMP_FILE_SUFFIX = ".tmp";
  private static final String TAG_Employee = "Employee";
  private static final String TAG_Employer = "Employer";

  private File file = null;
  private final Logger logger = Logger.getLogger(this.getClass().getName());

  public EmployerValve() {
    super();
  }

  public InputStreamContext execute(InputStreamContext inputStreamContext) throws PipelineException, IOException {
    // Get the input stream that is passed to the Valve

    logger.finest("================START FileAdapterAbzendValve================");
    String s = (String)getPipeline().getPipelineContext().getProperty(USE_STAX_PROPERTY_NAME);
    boolean useStAX = !"false".equalsIgnoreCase(s); //defaults to true
    logger.finest("useStAX=" + useStAX);

    //modify xml being read
    InputStream originalInputStream = inputStreamContext.getInputStream();
    InputStream newInputStream = transform(originalInputStream, useStAX);
    inputStreamContext.closeStream();
    inputStreamContext.setInputStream(newInputStream);
    logger.finest("================END FileAdapterAbzendValve================");
    return inputStreamContext;
  }

  private InputStream transform(final InputStream input, final boolean useStax) throws PipelineException, IOException {
    final InputStream is;
    final OutputStream out;

    file = File.createTempFile(TMP_FILE_PREFIX, TMP_FILE_SUFFIX);
    out = new FileOutputStream(file);
    logger.finest("tempfile=" + file.getAbsoluteFile());
    try {
      final Source src;
      final Result res;
      if (useStax){

        final XMLStreamReader xmlStreamReader = XMLInputFactory.newFactory().createXMLStreamReader(input, "UTF-8");
        final StreamReaderDelegate streamReaderDelegate = new StreamReaderDelegate(xmlStreamReader){
          private boolean startEmployer = false;
          private boolean endingEmployer = false;

          @Override
          public int next() throws XMLStreamException{
            int event;
            if (endingEmployer){
              event = START_ELEMENT; //end of ending werkgever node, continue with starting werknemer node
                                     //in which the parent already is.
              logger.finest("StAX: end faking ending element " + TAG_Employer + ", so now returning start element");
            }
            else{
              event = getParent().next();
            }
            endingEmployer = false;
            if (!startEmployer &amp;&amp; event == START_ELEMENT){
              if (TAG_Employer.equals(getParent().getLocalName())){
                startEmployer = true;
                logger.finest("StAX: starting element " + TAG_Employer + " found");
              }
            }
            else if (startEmployer &amp;&amp; event == START_ELEMENT){
              if (TAG_Employee.equals(getParent().getLocalName())){
                startEmployer = false;
                event = END_ELEMENT;
                endingEmployer = true; //we're going into the state of ending the werkgever node
                logger.finest("StAX: starting element " + TAG_Employee + " found, start faking end element " + TAG_Employer);
              }
            }
            else if (event == END_ELEMENT){
              if (TAG_Employer.equals(getParent().getLocalName())){
                event = getParent().next(); //goto next node, so this node is skipped.
                logger.finest("StAX: skip ending element " + TAG_Employer);
              }
            }
            return event;
          }

          @Override
          public String getLocalName(){
            if (endingEmployer){ //we're in the state of ending the werkgever node (parent is in ending werknemer node)
              return TAG_Employer;
            }
            else{
              return getParent().getLocalName();
            }
          }

          @Override
          public QName getName(){
            QName qName = getParent().getName();//we're in the state of ending the werkgever node (parent is in ending werknemer node)
            if (endingEmployer){
             qName = new QName(qName.getNamespaceURI(), TAG_Employer, qName.getPrefix());
            }
            return qName;
          }

          @Override
          public int getEventType(){
            if (endingEmployer){//we're in the state of ending the werkgever node (parent is in ending werknemer node)
             return END_ELEMENT;
            }
            else{
              return getParent().getEventType();
            }
          }

        };

        src = new StAXSource(streamReaderDelegate);
        final XMLStreamWriter xmlStreamWriter = XMLOutputFactory.newFactory().createXMLStreamWriter(out, "UTF-8");
        res = new StAXResult(xmlStreamWriter);
      }
      else{
        final XMLReader xr = new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {
          private boolean found = false;

          @Override
          public void startElement(final String uri, final String localName, final String qName, final Attributes atts) throws SAXException {
            if (!(found) &amp;&amp; TAG_Employee.equals(qName)) {
              found = true;
              logger.finest("SAX: Insert closing tag " + TAG_Employer);
              super.endElement(uri, TAG_Employer, TAG_Employer);
            }
            super.startElement(uri, localName, qName, atts);
          }

          @Override
          public void endElement(final String uri, final String localName, final String qName) throws SAXException {
            if (found &amp;&amp; TAG_Employer.equals(qName)) {
              //System.out.println("uri:" + uri + " localName:" + localName + " qName:" + qName);
              //delete this closing tag
              found = false;
              logger.finest("SAX: Skip closing tag " + TAG_Employer);
            } else {
              super.endElement(uri, localName, qName);
            }
          }
        };
        src = new SAXSource(xr, new InputSource(input));
        res = new StreamResult(out);
      }
      TransformerFactory.newInstance().newTransformer().transform(src, res);
      logger.finest("Transformation done!");

     } catch (SAXException e) {
      e.printStackTrace();
      } catch (XMLStreamException e) {
       e.printStackTrace();
    } catch (TransformerConfigurationException e) {
      e.printStackTrace();
    } catch (TransformerException e) {
      e.printStackTrace();
    }
    out.flush();
    out.close();
    is = new FileInputStream(file);
    return is;
  }

  public void test(final boolean useStAX ) throws IOException, ParserConfigurationException, SAXException, PipelineException, InterruptedException {
    long start = System.currentTimeMillis();

    File file = new File("Employers.xml");
    if (!file.exists()){
      System.out.print("File does not exist!");
    }
    else{
      FileInputStream in = new FileInputStream(file);
      InputStream result = transform(in, useStAX);
      ByteArrayOutputStream bos = new ByteArrayOutputStream();
      byte[] buffer = new byte[2048];
      int bytesRead;
      while ((bytesRead = result.read(buffer)) != -1) {
        bos.write(buffer, 0, bytesRead);
      }
      String s = new String(bos.toByteArray(), "UTF-8");
      in.close();
      result.close();
      System.out.println("&gt;&gt;" + s + "&lt;&lt;&quot;);

      System.out.println(&quot;duration: &quot; + (System.currentTimeMillis() - start) + &quot;ms&quot;);
    }
  }


  public void finalize(InputStreamContext inputStreamContext) {
    try {
      cleanup();
    } catch (PipelineException e) {
      e.printStackTrace();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

  public void cleanup() throws PipelineException, IOException {
    if (file != null &amp;&amp; file.exists()){
      file.delete();
      file = null;
    }
  }

  public static void main(String[] args) throws IOException, ParserConfigurationException, SAXException, PipelineException, InterruptedException {
    EmployerValve valve = new EmployerValve();
    valve.test(false);
    valve.finalize(null);
    valve.test(true);
    valve.finalize(null);
  }

}

See Lucas’ blog how to create the jar and install it in the runtime environment. Don’t forget to restart (including the admin server) your environment.

2. Configure the valve in the project to be used by the jca file

Lucas’ blog also explains how to configure the valve. In short it means we need to add a pipeline XML file to our project in which the EmployerValve is configured:

This EmployerPipeline has to be attached to the Fileadapter jca file with property PipelineFile:

3. Change the XSD of the XML file being read accordingly

Because the Employer valve changes the XML before it’s read by the SOA Suite, we have to change the XSD in the project so it matches with the changed XML file. The Message element, the root element, now contains a sequence of a MsgProperties element and one or more sequences with a Employer element followed by zero or more Employee elements:

&lt;element name="Message"&gt;
  &lt;complexType&gt;
    &lt;sequence&gt;
      &lt;element name="MsgProperties" type="tns:tMsgProperties" minOccurs="0"/&gt;
      &lt;sequence maxOccurs="unbounded"&gt;
        &lt;element name="Employer" type="tns:tEmployer" /&gt;
        &lt;element name="Employee" type="tns:tEmployee" minOccurs="0" maxOccurs="unbounded"/&gt;
      &lt;/sequence&gt;
    &lt;/sequence&gt;
    &lt;attribute name="test" type="boolean"/&gt;
  &lt;/complexType&gt;
&lt;/element&gt;
...

If you want to try to use the StAX transformation, which for some reason doesn’t work in my environment, the property useStAX has to be added in the composite.xml:
You can change this setting to ‘true’ in Enterprise Manager to activate the StAX transformation: go to the Composite page, click on the ‘SynchReadXML’ JCA Adapter in the ‘Services and References’ section and then switch to the Properties tab.

4. Change the BPEL logic to adjust to the XML changes

Finally we have to adjust the logic in BPEL by looping over the read elements. Based on the type of element, with expression local-name($InvokeSynchReadXML_SynchReadXML_OutputVariable.body/*[1]) = ‘MsgProperties’, we decide what to do. Assume that for processing one Employee element, also the information of both the MsgProperties element and the Employer element is needed. So in case the MsgProperties is read, we store the information in a MsgPropeties variable and the same applies for an Employer element (in an Employer variable). When an Employee element is read, we can use these variables to complete the data for processing an Employee. The flow now looks like this:

After deployment and test with the test data, we see in the Flow Trace that, as expected, 10 read actions has been performed: 1 MsgProperties element, 2 Employer elements, 6 Employee elements and a last try to detect that EOF has been reached:

When inspecting the SynchReadXMLProcess it turns out the switch on type of elements works well, as well as the assignment of the variables!

Tuning for performance

Pure from a functional point of view this works well for now. The problem of reading large XML files which are so large they may result into an OutOfMemory problem causing the server to crash, is solved. But from a performance perspective this still is not the best situation. Reading only one child element of the root each a time can take quite some time when the large XML file contains thousands of employees. So it would be better to read more elements in one read. This means that the BPEL has to change as well by looping over the single elements from the array of elements that has been read each time. We can use the same logic of switching on type of element within the loop. This results into a flow like this:

To test if this tuned logic still works correctly, I’ve set the property ChunkSize in the jca adapter to 4.
After invoking the composite with the test data, the Flow Trace informs that, as expected, 3 reads has been done (two times 4 elements and a final read of 1 remainder element).
Inspecting the SynchReadXMLProcess proves that the iterating over the elements also works well:

In our real business case, we’ve set the ChunkSize to 50. This value is specific for our business case, because it’s depending upon your runtime environment, how complex the XML is, how large the large XML files are and the processing of the data itself. When, in our business case, we want to improve the performance even more, we have to change the logic of processing one employee into processing multiple employees at once.

Conclusion

Using the chuck reading features of the file adapter in combination with the preprocessing of the file adapter, gives the opportunity to process large XML files in fragments. This can prevent OutOfMemory problems and gives the ability to further performance tuning.
This is ideal for batch processes or, as in our business case, asynchronous process where the end-user uploads a file and immediately only gets the response that the file has been received and will be processes (the feedback of the processing the file, including errors, is stored in a database and is shown in a dedicated screen).

Resources

The resources are available for download:

Tags:big data, file adapter, large xml, oracle soa suite, performance tuning, XML

About The Author

Emiel Paasschens

Emiel is a Solutions Architect and a Java & Integration consultant in The Netherlands.

13 Comments

Ariel October 31, 2017
hi, i tried it wich chunksize > 1 , but everytime the last chunk is missed and get no_data_found=true.
for example: if chunksize=2 and the file has 7 records : 3 calls are ok (2 records each time) the 4th time i should get only 1 record but i get no_data_found. what do i do wrong?
- Emiel Paasschens October 31, 2017
  Hi Ariel,
  It’s difficult to see without source code, etc. I can only guess…
  Have you changed the implementation class from “oracle.tip.adapter.file.outbound.FileReadInteractionSpec” to “oracle.tip.adapter.file.outbound.ChunkedInteractionSpec” in the jca file?
  Did you specify a correct xsd as data format in the fileadapter configuration, so it ‘knows’ it has to read xml data (and not a text/csv file)?
  You should be able to see what’s going on in the trace of the instance (the SOA Suite must run in Development mode).
  Hopefully this helps you out.
  Success!
  - Mark January 3, 2018
    Hi, I had the same issue and found there was an Oracle patch to resolve the problem. This was on SOA 12.1.3.0.
    - Björn Schulz February 17, 2018
      Hey Mark,
      I’m facing the same problem but I’m running on SOA 12.2.1 already.
      Can you post the Patch which solved your problem?
      Thank you so much!
      Best Björn
    - Denmark February 26, 2019
      Hi Mark,
      I am encountering this issue as well. Would you know the patch applied?
      Hi Emiel,
      This is the scenario:
      We implemented chunk size of 10 for example. If we have 35 records to process, the last 5 will not be read since NoDataFound is returning TRUE.
      Thanks!
Anurag Gupta July 23, 2016
I faced a similar situation like this and I used debatching feature. Preprocess , debatching , post processing. Now you wrote a valve, what we did was wrote a python to break the chain, and then pass those to the file adapter and then do making the chain. Debatching is meant for processing large files where as I think sync read is meant for use cases where mid process file data processing is required. Nevertheless you did it a workaround which comes effective. I wish I can work in your organization.
Lolke Dijkstra April 25, 2016
It was a nice read.
Actually.. this is a typical case for which we designed LDX+. The steps to process such large file comprise:
– generate code from XSD using LDX+,
– implement the MessageProcessor (the processor is actually already there, it just needs to be altered to do what you need)
The runtime configuration allows you to configure the processing: children may be detached from their parent elements by simple configuration.
Because it is such a typical case, I did the steps above and created the project in Eclipse.
Here is a bitbucket link: https://bitbucket.org/lolkedijkstra/ldx-samples
Aurimas Lacitis January 17, 2016
Hello,
Thank You for the great post.
We have a similar case and plan to use the solution, You provided. I wonder about the performance and the sizing parameters, if You tried the solution in production environment. What is the maximum file size and the maximum number of concurrent requests, that could be processed in this way? Of course it depends on the hardware, but maybe You have any numbers?
Thank You.
- Emiel Paasschens January 19, 2016
  Hi Aurimas,
  No I don’t have any number. But even if I had some, it wouldn’t be of any value for you. They’re very dependent on the complexity of the XML, the BPEL processing, hardware and infra structure settings (weblogic, SOA settings in em, SOA Db, etc).
Emiel Paasschens January 14, 2016
Hi Sunny,
It still depends on your environment. If the SOA Suite is available, then I would recommend to do it as described in this blog where you replace the Empty “Process Employee” with an invoke to another BPEL process which does the look-ups, validates and writes to a single employee file.
Sunny January 12, 2016
Good one ! I have a question here. In case you need to read a file with 30000 repeating records(like employee list having 30000 employee recrods) and after processing (like validation of some fields and lookups) ,you need to write EmployeeResponseList similar to EmployeeRequest having the employee Id for each as one single output file. Need to use OSB as first choice but feel to recommend your suggestions.
Emiel Paasschens January 5, 2016
Hi Sameer,
It all depends on your specific situation, the context and your environment, how often (frequency), how is the action (tranformation) triggered, what is done with the output file, whether there already is a SOA environment in place or ODI, etc.
The most simple solution is to write a little Java program with a SAX (or StAX) transformation, just like the pre-transform in valve as described in the article (you can use the Java code as an example).
Sameer January 4, 2016
Nice article..but what if I need to write a single output file for all those chunk reads? Do you recommend using ODI in such case? I have a scenario where I need to read a large XML file > 20 MB, apply transformation on few fields and then write a single XML output file. Thanks.