Example: call ID 1fe19f9aadd645eed3e73baf6f6070c1@141.31.8.62
plays the welcome prompt and then hangs up since the ASR initialization is failing:
2015-08-19 20:08:42,256 DEBUG ~.~.~.~.~.RTPRecogChannel [Thread-1396]: Borrowing recognition engine from object pool...
java.util.NoSuchElementException: Pool exhausted
at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1110)
at org.speechforge.cairo.server.recog.RecogInterface.activateRecEngine(RecogInterface.java:68)
at org.speechforge.cairo.server.recog.RTPRecogChannel.recognize(RTPRecogChannel.java:191)
at org.speechforge.cairo.server.recog.MrcpRecogChannel.recognize(MrcpRecogChannel.java:215)
at ...
This issue is found 149 times in the DB:
select count(1) from haleflogs where message like '%java.util.NoSuchElementException: Pool exhausted%';
All the 148 calls found suffering from this problem ended up not completed:
create table tmpC2 as select distinct b.cairoCallId from haleflogs a, callsExt b where message like '%java.util.NoSuchElementException: Pool exhausted%' and a.id between b.minId and b.maxId and a.machineIp = b.cairoMachineIp;
select code, count(1) from completeCalls where callId in (select * from tmpC2) group by code;
The problem is happening throughout extensions:
+-----------+----------+
| extension | count(1) |
+-----------+----------+
| 1010 | 2 |
| 1111 | 32 |
| 2222 | 36 |
| 3333 | 19 |
| 4444 | 22 |
| 7703 | 6 |
| 7704 | 3 |
| 7709 | 25 |
| 7711 | 3 |
+-----------+----------+
select extension, count(1) from completeCalls where callId in (select * from tmpC2) group by extension;
on different Halef instances:
+-------------+----------+
| cairoIp | count(1) |
+-------------+----------+
| 141.31.8.62 | 123 |
| 141.31.8.74 | 25 |
+-------------+----------+
select cairoIP, count(1) from completeCalls where callId in (select * from tmpC2) group by cairoIP;
and consistently over time. We had 71 such calls in August so far of altogether 357 accounting for 20%.
It looks as if the hang-up caused by this issue happens either before or after the first recognition:
+---+----------+
| b | count(1) |
+---+----------+
| 0 | 69 |
| 1 | 79 |
+---+----------+
select cnt-1 b, count(1) from (select callId, count(1) cnt from events where callId in (select * from tmpC2) and event in ('VXML URL', 'Recognition Hypothesis') group by callId) a group by b;
Yesterday, there were 16 production calls:
select count(1) from completeCalls where timestamp > '2015-08-19 10:00:00' and extension = '2222';
out of which 6 were not completed:
select count(1) from completeCalls where timestamp > '2015-08-19 10:00:00' and extension = '2222' and code is null;
all of them associated with the pool being exhausted:
select count(1) from completeCalls where timestamp > '2015-08-19 10:00:00' and extension = '2222' and callId in (select * from tmpC2);
You can select log information of individual calls by running either of the following:
select * from hli where callId = '1fe19f9aadd645eed3e73baf6f6070c1@141.31.8.62';
(displaying only the leading 100 chars of log messages)
select * from hli2 where callId = '1fe19f9aadd645eed3e73baf6f6070c1@141.31.8.62';
(displaying the full message)
+-----------------------------+----------+
| round(timestamp/1000000, 0) | count(1) |
+-----------------------------+----------+
| 20150619 | 1 |
| 20150620 | 2 |
| 20150701 | 35 |
| 20150704 | 2 |
| 20150705 | 1 |
| 20150727 | 3 |
| 20150729 | 17 |
| 20150730 | 2 |
| 20150731 | 14 |
| 20150803 | 2 |
| 20150804 | 4 |
| 20150805 | 6 |
| 20150806 | 14 |
| 20150807 | 3 |
| 20150808 | 1 |
| 20150811 | 2 |
| 20150813 | 5 |
| 20150814 | 5 |
| 20150815 | 2 |
| 20150817 | 15 |
| 20150818 | 5 |
| 20150819 | 7 |
+-----------------------------+----------+
drop table tmpC2;
create table tmpC2 as select distinct b.cairoCallId from haleflogs a, callsExt b where message like '%java.util.NoSuchElementException: Pool exhausted%' and a.id between b.minId and b.maxId and a.machineIp = b.cairoMachineIp;
select round(timestamp/1000000, 0), count(1) from completeCalls where callId in (select * from tmpC2) group by round(timestamp/1000000, 0);
Patrick applied a potential fix. So far, the issue has not appeared again. @Patrick, please explain what you did.
I force the RTPChannel to return a potentially existing engine back to the pool before fetching a new one.
cairo-VM/src/cairo/cairo-server/trunk/src/main/java/org/speechforge/cairo/server/recog/RTPChannel.java
192 _recInterface.returnRecEngine();
193 _recInterface.activateRecEngine(appType, grammarLocation, hotword);
There was a call today with the bug (5c048ecdab8fe4f514e0b1753fa433e0@141.31.8.61), but it was the Kaldi Halef TJR199/201. So, has the issue been fixed for good?
We are currently live on TJR214/215 with Ext 4444 and suffered three NULL calls so far:
drop table tmpI1;
create table tmpI1 (select cairoCallId from cr where date = 20150831 and halefcheck = 'PASSED' and c = 0 and extension = 4444);
It turns out that all of them were caused by the pool exhausted bug:
drop table tmpC2;
create table tmpC2 as select distinct b.cairoCallId from haleflogs a, callsExt b where message like '%java.util.NoSuchElementException: Pool exhausted%' and a.id between b.minId and b.maxId and a.machineIp = b.cairoMachineIp;
select * from tmpI1 where cairoCallId in (select * from tmpC2);
The bug fix was not propagated into TJR214/215 yet. Remarkably, there are no NPEs as of yet:
select * from hli where callId in (select * from tmpI1) and message like '%nullp%';
Huh. Is this bug and https://sourceforge.net/p/halef/tickets/83/ interrelated by any chance?