Spring Batch - Chunk processing

One of the great advantages of Spring Batch is the Chunk-oriented processing.
This time, let's take a look at what Chunk-oriented processing is.

What is Chunk?

Chunk in Spring Batch refers to the number of rows processed between each commit when working with chunks of data .
In other words, Chunk-oriented processing means reading data one at a time, creating a chunk called Chunk, and then processing transactions in chunk units .

The transaction is important here.
Because it executes the transaction in chunk unit, if it fails, it is rolled back as much as the corresponding chunk , and it reflects the range of the previously committed transaction.

As the chunk-oriented processing means that the data is processed in chunk units in the end, it is expressed as follows.

The figure in the official document only deals with individual items .
Please note that the picture above is a bit different because it covers even chunk units.

Reader reads one data
Processor read data is processed
After collecting the processed data in a separate space, if it is accumulated in chunk units, it is transferred to the Writer and the Writer is stored in a batch.

You only need to remember that each case is handled by Reader and Processor, and that it is processed by Chunk in Writer .

If you express Chunk-oriented processing in Java code, it looks like this:

for(int i=0; i<totalSize; i+=chunkSize){ // chunkSize
List items = new Arraylist();
for(int j = 0; j < chunkSize; j++){
Object item = itemReader.read()
Object processedItem = itemProcessor.process(item);
items.add(processedItem);
}
itemWriter.write(items);
}

Did you understand the meaning of processing by chunkSize ?
Now, let's see how Chunk-oriented processing is going on by looking at the actual Spring Batch internal code.

ChunkOrientedTasklet Peek

It is a ChunkOrientedTaskletclass that covers the entire logic of Chunk-oriented processing .
Just looking at the class name will tell you what you are doing at once.

Here is the code to look closely execute().
You can see that the entire code for working in chunks is here.

chunkProvider.provide()In the Reader, it gets data as much as Chunk size.
chunkProcessor.process() Process and store the data received from Reader in (Writer).
Retrieving the data that chunkProvider.provide()I can see coming the surface gabosi import data.

inputsread()Calls until this ChunkSize is accumulated .
If read()you look inside, it actually ItemReader.readcalls.

In other words, ItemReader.readit provide()is a job to look up data one by one and stack data as much as the chunk size .

Now, let's see how to process and store the accumulated data like this.

SimpleChunkProcessor Peek

It ChunkProcessoris responsible for containing the Processor and Writer logic .

Since it is an interface, it must have a real implementation.
It SimpleChunkProcessoris basically used .

If you look at the above class, you can see in detail how Spring Batch handles chunks.
The core logic responsible for processing process()is.
The process()code for this is below.

Chunk<I> inputsTakes as a parameter.
This data is the chunkProvider.provide()item stacked up to the ChunkSize received earlier .
transform()In passed inputsthe doProcess()transfer to and receives the converted value.
transform()A large amount of data processed through is write()stored in batches.
write()Can be saved or can be transferred to an external API.
This depends on how the developer implemented the ItemWriter.
Here, transform()is doProcess()called through a loop .
That method uses ItemProcessor process().

doProcess() If there is no ItemProcessor, if the item is returned as it is, it process()is processed as the ItemProcessor and returned.

And the processed data is processed by doWrite()calling SimpleChunkProcessor as shown above.

Now, by looking at the actual code for Chunk-oriented processing, we looked at how it is handled.
Below, let's solve the misunderstanding about ChunkSize.

Page Size vs Chunk Size

If you have used Spring Batch before, you may have used PagingItemReader a lot.
Some people who used PagingItemReader misunderstand Page Size and Chunk Size in the same sense.
Page Size and Chunk Size have different meanings .

Chunk Size refers to the transaction unit to be processed at one time , and Page Size refers to the amount of items to be searched at one time .

Now, let's take a look at the actual Spring Batch ItemReader code how the two are different.

Let's first look at AbstractItemCountingItemStreamItemReaderthe read()methods of the parent class of PagingItemReader .

doRead()Is called when there is no data to read or when the page size is exceeded doReadPage().
When there is no data to read, it means when read is first started.
When the page size is exceeded, for example, the page size is 10, but the data to be read this time is the 11th data.
In this case, doReadPage()you can say that because you exceed the Page Size, you call.

That is, it is cut and searched in units of pages .

If you think of paging inquiry in bulletin board making, it will be easy to understand.

doReadPage()From now on, sub-implementation classes generate paging queries in their own way.
Here, we will look at the commonly used JpaPagingItemReader code.

Page Size as specified in the Reader offset, limitspecify the value by creating a paging query ( createQuery()) and use ( query.getResultList()you).
Query execution results are resultsstored in. Each time
stored resultsin read()is called , it is taken out and delivered one by one .

In other words, Page Size is a value to specify the page size in the paging query .

What if the two values are different?
If PageSize is 10 and ChunkSize is 50, if there are 5 page views in the ItemReader, 1 transaction occurs and Chunk is processed .

Performance issues may occur because 5 query queries are generated for one transaction processing. So in Spring Batch's PagingItemReader I left the following comment at the top of the class:

Setting a fairly large page size and using a commit interval that matches the page size should provide better performance.
(Setting a significantly larger page size and using a commit interval that matches the page size improves performance.)

In addition to performance issues, if you use two different values, if you use JPA, the persistence context is broken.
(Please refer to it as we 've summarized the problem before )

Although the meaning of the two values is different, it is generally a good idea to match the two values with the various issues mentioned above , so it is recommended to match the two values.

Search This Blog

Theprogrammersfirst | A technical portal.