Apache Beam Performance Between Python Vs Java Running on GCP Dataflow
We have Beam data pipeline running on GCP dataflow written using both Python and Java. In the beginning, we had some simple and straightforward python beam jobs that works very well. So most recently we decided to transform more java beam to python beam job. When we having more complicated job, especially the job requiring windowing in the beam, we noticed that there is a significant slowness in python job than java job which end up using more cpu and memory and cost much more.
some sample python code looks like:
step1 = (
read_from_pub_sub
| "MapKey" >> beam.Map(lambda elem: (elem.data[key], elem))
| "WindowResults"
>> beam.WindowInto(
beam.window.SlidingWindows(360,90),
allowed_lateness=args.allowed_lateness,
)
| "GroupById" >> beam.GroupByKey()
And Java code is like:
PCollection<DataStructure> step1 =
message
.apply(
"MapKey",
MapElements.into(
TypeDescriptors.kvs(
TypeDescriptors.strings(), TypeDescriptor.of(DataStructure.class)))
.via(event -> KV.of(event.key, event)))
.apply(
"WindowResults",
Window.<KV<String, CustomInterval>>into(
SlidingWindows.of(Duration.standardSeconds(360))
.every(Duration.standardSeconds(90)))
.withAllowedLateness(Duration.standardSeconds(this.allowedLateness))
.discardingFiredPanes())
.apply("GroupById", GroupByKey.<String, DataStructure>create())
We noticed Python is always using like 3 more times CPU and memory than Java needed. We did some experimental tests that just ran JSON input and JSON output, same results. We are not sure that is just because Python, in general, is slower than java or the way the GCP Dataflow execute Beam Python and Java is different. Any similar experience, tests and reasons why this is are appreciated.
from Recent Questions - Stack Overflow https://ift.tt/33RkTrP
https://ift.tt/eA8V8J
Comments
Post a Comment