Java based development involves working with several open source software. It is very easy to start using a open source tool but it needs careful examination and evaluation. One of the things that i have done in recent past is working with GATE. GATE is an open source software for text processing. While i am not an expert in text processing and/or machine learning, i had to deal with it while designing a larger messaging module in a system.
One of the machine learning system that we (at qubit) were developing using GATE was too slow. Upon some investigation we found out GATE controllers took a LOT (several seconds to a minute) of time to initialise. These controllers were reading some XML files and then GATE was trying to create a FSM from it, i was told. Our task at hand involved invoking these controller thousands to millions times across several machines. We had two problems:
- Controllers took a lot of time to initialise.
- We were creating countless number of controllers, one for each invocation. And we were throwing them away once the task was done.
Let's see why, 2 above is bigger problem than it seems at first glance. GATE comes up as a tool and it can also be used as an embedded system (meaning one can use GATE as an independent module/service inside a JVM). In the embedded mode one can use controllers to create GATE applications. These controllers are of different types and a controller internally loads a file called a Processing Resource. Processing resources are at times can be things like Named Entity Extraction rules. In GATE one can write such rules in JAPE files.
The way GATE (or the subsystem that deals with jape rules) uses these rules is to convert each rules into a dynamic java class at runtime. With each controller creation, GATE creates new dynamic classes! These dynamic classes are all the same (well, i guess) for different object of the same type. And this is where the problem lies. If one keeps on creating and throwing these objects as normal objects, one would run out of memory. And the memory here is PermGen space and not the heap.
Since the system keeps on creating dynamic classes and they are never unloaded, after some time we run out of PermGen space. For several hundred thousand tasks, a machine would run out of permgen space quickly. We had a memory leak.
BTW, looking at jconsole on a remote AWS instance can be a challenge in itself. I found an excellent solution with SOCKS proxy here.
For 1, we were limited in our corrective steps as it was something to do with internals of GATE as we were not ready to commit any time looking into what was going inside GATE.
For 2, The immediate short term fix was to increase the size of the permgen space. Passing this at JVM startup:
But this usually only postpones the inevitable. As a more robust solution, we decided to use an object pool. We decided to create only very few (around 5-10) controller objects as at any given time we would be running only 5-10 thread concurrently for the tasks. And we chose Apache Commons Pool**as we wanted to avoid re-inventing the wheel. Commons pool gives fantastic facilities. One such feature is a GenericKeyedObjectPool. It acts as a map of pools and this is something we wanted. We wanted to create a pool of object for each type of GATE controller. Each controller would load a corresponding file at initialisation and our deployment strategy would throw tasks at N number of machines so all the machines would need to have these controller objects in their respective pools. So far so good. Once done with the pooling mechanism, the permgen errors vanished or at least we thought so…
Until, it came back again! This time after the server running for 5-7 days.
We were out of permgen space again. It was time to burn midnight oil again. Now this is a situation which shatters a developer. He was reported a tough problem, he thought he understood it well and found a fix and the fix seemed to work for some time. The doubts that now come in are: did he actually understood the problem? If yes, did he apply the right fix? If yes, why is this failing now?
It was back to square one. First task was to reproduce the problem. Which i did by running huge number of tasks in quick time and reducing permgen space on one of the servers. Assured that the problem was, in fact happening, i started looking into heap dumps to verify that it was those dynamic classes that were being created. This confirmed that new controller objects are being created at a slower pace.
I kept looking into almost all aspects of the task flow as to what was causing this now when after using an object pool, new controllers should not be created more than configured pool size.
Finally, while looking at the permgen graph in jconsole, i saw a pattern. New objects were being created after a set interval of time. And it stuck to me that the objects are may be getting evicted from the pool for some reason. Indeed. There are several config options for the object pool. Two of them were:
maxActive and maxIdle
I had configured it in such a way that maxActivc was larger than maxIdle, for example, my maxActive was 10 and maxIdle was 5. So at peak times the pool would go ahead and create 10 objects and at non-peak time 5 of those objects would be evicted. At peak time next day or next run, 5 new objects would be created and so on. So this eviction strategy was causing leaks now. I made both maxActive and maxIdle to be same to solve it.
While i was right in using a third party solution, i failed to fully realize the configs and its impacts on my specific problem. What looked like an ideal thing to do (keeping only a few active objects), turned out to be the root of a problem.