Recently, I was working on a project where we had to re-write a vb application in java. The application would read from database invoice numbers based on an xml config logic (like range of invoices or invoices for a date range and so on), and for each invoice it would create and write in separate xml and excel files based with filenames being the same as invoice numbers. That sounds simple enough right? So why am I talking about sprng batch? Here’s the interesting part of the app – for every invoice there is a BillingOrder (one to one relationship), for every BillingOrder there is 1 to N relationship with Shipping, for every Shipping there is another 1 to N relationship with Products and believe me for every Product there is a UserDetail. Now what was so special (or lack there of) about this vb application was that it for each invoice it read, it created a new xml and worte the invoice, closed it, did that for n number of invoices. For BillingOrder it opened (not created) the matching invoice, appended data to it, closed it, did that for exactly n number of BillingOrder. You see where I am going with this?
Now immediately you would acknowledge that Spring Batch is the way to go. Problem solved? I wish. The problem was even more compounded by the fact the we had to turn this module in, in a very short period and while they would appreciate if we used Spring Batch, it really was just a re-write. But the heroes that we were (and still are ;)) we just had to use Spring Batch . So here’s our little story below.
If you read the Spring Batch guide (oh we used the 2.1.8 version), you will find workable examples of Reading and Writing in chunks. For each Reader there is a Writer. I am not saying you have to read once and write once. That would not make sense of using a batch application, now would it? You can read 1..100 or n times from a source (db or text or xml), generally use a domain object to populate it and when your threshold for chunking is reached (1..100 or n times) your writer wrties them out to db or text or xml and most importantly (at least to me) sets those objects (1..100 or n number of them) to null. GC happens when it does. Anyways, for our application we would then have these options:
- Usung domain objects as DTOs. In our job’s first step,Read x number of invoices from the db and send them over (application does that for you) to the writer and have them written out one by one, creating new invoice xmls with invoice numbers as name. Use a stepListener to send invoice number list to the next step. In that next step, read exactly x number of BillingOrders from another table in the db and send them over to another writer (remember I said 1 reader : 1 writer) and open each invoice and append to them the BillingOrder values as childNodes. The next step would then be sending BillingOrder ids using another stepListener so that Shipping details can be written. Perfect? Well it gets tricky here. This is 1 to N relationship. So whether or not you send the Shipping details in a sorted order – meaning <Shipping> lists with BillingOrder 1 first, then with BillingOrder 2… you will still have to open the xml files shippingList.size()number of times. You might tweak the code a little bit to stop this from happening but still.. for x number of invoices you are opening at least (x times 5) xml files. Good solution? Hardly.
- Option 2. The object oriented way. The right way. Now instead of using our domains as dumb Data Transfer Objects we are going to implement relationships in them. Invoice now has BillingOrder billing (why not the other way? I will explain below). BillingOrder will have a List<Shipping> shippingList (use Set if you want to make sure they are unique), Shipping has List<Product> productList and Product has UserDetail userDetail. Now our invoiceReader will also change. First of all we will need only 1 reader because we want only 1 writer instance associated with 1 xml ouput file. So we don’t use any StepListeners either. Consequently, only 1 step in this job. What we will do is read out all the invoices from the db into a list of invoice objects. Loop through the invoice object list, populating billingOrder instance variables of each invoice object (this is why I had billingOrder inside invoice object. Invoice has to know about BillingOrder). At the same time create another loop to populate each billingOrder of Invoice with Shipping. You will have multiple loops but at then end, you will be able to send 1 invoice object or 100 or n number of invoice objects with all the relationsips intact to the writer. The writer will only create/open 1 xml for 1 complete invoice. If this seems like a memory issue you can reduce the chunk threshold.
If you’ve thoroughly read this, you might have one question. Why didn’t Jason Kidd shoot this well early in his career? Kidding. You would be asking “Where is the config file read?” What you need to understand about steps is that there is only one instance of reader and exactly one instance of writer for each step execution. For example, there are 500 rows of invoices and you set the chunk threshold to 100. When the step starts, a new instance of reader (InvoiceReader) is created. After 100 have been read and sent for writing, the same invoiceReader instance is used again and next hundred is sent. How else would it know to process the next 100? This happens 5 times and on the sixth occasion, a null is sent (not an exception, you really need to send null) to mark the end of the step. Since only one instance of reader (and writer) exists, you can create an instance level variable in the reader like boolean isFirstRead and set it to true on the first instance read. The code to read the config file will be within this if condition. Then in the recurring cycles, this config file will not be read. And you would not start from the top.
So that as they say is that. I wanted to write pseduo code to explain the loops but I kept on writing and didn’t realize that I wasn’t.