Monday, February 21, 2005

Aborts rate observations

I just noticed that the level of aborts that occurs in the system decreases down to a point as I increase the offered load (e.g., the number of clients).

With browsing mix, 120 clients and 2 slave servers, I got an abort rate of 1.27%. Increasing the number of clients to 150 seemed to yield an abort rate of 0.30% increasing the clients to 250 caused an abort rate of 1.80%.

The observations with 8 slave databases are: 1.32% (500 clients), 1.96% (600 clients), 1.87% (550 clients). The throughput remained the same with the three configurations, only the latency changed a bit (by approximately 100ms).


I will have to run step-function tests in order to determine the abort rate trend. The axes of change in this situation should be a) The offered load and b) The database size.

Friday, February 18, 2005

Throughput drop when switching from 1 master to master + 1 slave

I've also started looking into the annoying drop in throughput that occurs when switching from configuration with a single node (the master only) to configuration with two nodes (1 master + 1 slave).

Using the latest TPC-W PHP code and the fixed client emulator, what happens is (with 100 clients):

master only throughput is 105 WIPS
master + slave throughput is 55 WIPS

Initially, I considered that the problem is caused by the lack of flushing and page faults in the master-only configuration. However, even running the master+1 slave configuration, with reads going to the master (e.g., the slave is unused, but the master still prepares flushes and send them) seems to yield the same results.

I also looked into the hypothesis, that when the slave is heavily loaded, it receives (and processes) flush packets more slowly, compared to when it is not loaded. Thus, the acknowledgement to the master gets delayed. Since the master cannot commit another transaction before the previous one has completed, subsequent update transactions will have to wait. Unfortunately, this is also not the case, as the flushes are processed very fast and the responses are received in a very short time.

I am planning to look into the scheduler and the proxy source code to find out whether a legacy code could be causing transactions to be delayed for some reason. However, since I removed all the unused stuff, the chance of this being a problem is minor.

Client Emulator and TPC-W PHP Code Problems

Client Emulator

At the beginning of this week (Feb 13), I discovered that even with the mainly read-only Browsing mix of the TPCW benchmark, the slave nodes were unevenly loaded, whereas, the master node was not yet saturated with updates (looking at the list of pending update transactions).

Me & Gokul made an investigation, and found a problem with the workload generating loop. Upon return from select, only one socket was processed, even if more than one completed. Since this is the last step of the loop, the control flow then goes to the send phase, which, by implementation only sends a new request, when the previous one has been processed. Thus, with the above problem, little or no requests were submitted, with the control flow proceeding to the receive phase, which processed just one request again. This caused the offered load to be really low and uneven.

Fixing this problem seems to generate a more even load.


TPC-W PHP Implementation

Gokul came across code in the PHP implementation of the TPC-W benchmark, which depended on obsoleted functions for accessing the GET request parameters. This caused execution errors in the refreshCart and deleteShoppingCart queries, which practically means that they never got executed. Thus, the offered update workload was less, compared to the one suggested by TPC-W.

Fixing this problem with the pages increased the level of abortions to a max of 2.7% for the shopping mix with 4 slaves.

From the brief run that I made, it turned out that the level of aborts increased as the number of slave databases increased, which does not make sense (e.g. 1 Slave (1.35%), 2 Slaves (2.22%), 4 Slaves (2.70%), 8 Slaves (0.80%)).

Currently, I am researching into this problem.

Thursday, February 17, 2005

My first post

This is my first post - just testing.