This file includes a description of several sample command-line
invocations along with guidance on how to interpret output file
results. 

The following script can be used to compare the performance of
Parallel, G1, and Shenandoah garbage collectors on the same workload
with different memory sizes.  The GC log file from the parallel GC run
is most useful in determining application allocation rates and live
memory usage.  It is more difficult to extract this information from
G1 and Shenandoah because the states of individual objects are in flux
at each moment that the log reports for the concurrent G1 and
Shenandoah collectors are issued. 

#!/bin/sh
for i in 256M 384M 512M 640M 768M 896M 1024M \
         1152M 1280M 1408M 1536M 1664M 1792M 1920M 2048M
do

echo Running Parallel GC with $i heap size
>&2 echo Running Parallel GC with $i heap size

$JAVA_HOME/bin/java \
     -showversion \
     -Xlog:gc:batch.parallel-gc.$i.log \
     -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:-UseBiasedLocking \
     -XX:+DisableExplicitGC -Xms$i -Xmx$i \
     -XX:+UseParallelGC -XX:ParallelGCThreads=8 \
     -jar $EXTREMEM_HOME/extremem.jar \
     -jar src/main/java/extremem.jar \
     -dDictionarySize=50000 -dCustomerThreads=20000 -dReportCSV=true

echo Running G1 GC with $i heap size
>&2 echo Running G1 GC with $i heap size

$JAVA_HOME/bin/java \
     -showversion \
     -Xlog:gc:batch.g1-gc.$i.log \
     -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:-UseBiasedLocking \
     -XX:+DisableExplicitGC -Xms$i -Xmx$i \
     -jar $EXTREMEM_HOME/extremem.jar \
     -dDictionarySize=50000 -dCustomerThreads=20000 -dReportCSV=true

echo Running Shenandoah GC with $i heap size
>&2 echo Running Shenandoah GC with $i heap size

$JAVA_HOME/bin/java \
     -showversion \
     -Xlog:gc:batch.shenandoah-gc.$i.log \
     -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:-UseBiasedLocking \
     -XX:+DisableExplicitGC -Xms$i -Xmx$i \
     -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC \
     -XX:+ShenandoahPacing -XX:ShenandoahPacingMaxDelay=1 \
     -jar $EXTREMEM_HOME/extremem.jar \
     -dDictionarySize=50000 -dCustomerThreads=20000 -dReportCSV=true

done     

Here's a different script that represents a workload configuration
that keeps more memory alive and thus requires larger heap size
configurations of the JVM.  In this code, the first loop runs with
Parallel GC for the purpose of calibrating the workload's allocation
rates and live-memory usage.  The second loop provides comparisons
between G1 and Shenandoah GC on this workload.

#!/bin/sh
for i in 30G 48G
do

echo Running Parallel GC with $i heap size on huge workload
>&2 echo Running Parallel GC with $i heap size on huge workload

$JAVA_HOME/bin/java \
     -showversion \
     -Xlog:gc:batch.calibrate-shen.parallel-gc.$i.log \
     -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:-UseBiasedLocking \
     -XX:+DisableExplicitGC -Xms$i -Xmx$i \
     -XX:+UseParallelGC -XX:ParallelGCThreads=16 \
     -jar $EXTREMEM_HOME/extremem.jar \
     -dDictionarySize=500000 -dNumCustomers=10000 -dNumProducts=200000 \
     -dCustomerReplacementCount=40 -dCustomerThreads=20000 \
     -dServerThreads=100 -dProductReplacementCount=8 \
     -dProductReplacementPeriod=36s \
     -dProductNameLength=25 -dProductDescriptionLength=120 \
     -dProductReviewLength=32 -dSelectionCriteriaCount=8 \
     -dBuyThreshold=0.25 -dSaveForLaterThreshold=0.75 \
     -dBrowsingExpiration=1m -dSimulationDuration=40m \
     -dInitializationDelay=100ms -dReportCSV=true

done     

for i in 30G 32G 48G 64G 96G 112G
do

echo Running G1 GC with $i heap size on huge workload
>&2 echo Running G1 GC with $i heap size on huge workload

$JAVA_HOME/bin/java \
     -showversion \
     -Xlog:gc:batch.show-shen.g1-gc.$i.log \
     -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:-UseBiasedLocking \
     -XX:+DisableExplicitGC -Xms$i -Xmx$i \
     -jar $EXTREMEM_HOME/extremem.jar \
     -dDictionarySize=500000 -dNumCustomers=10000 -dNumProducts=200000  \
     -dCustomerReplacementCount=40 -dCustomerThreads=20000 \
     -dServerThreads=100 -dProductReplacementCount=8 \
     -dProductReplacementPeriod=36s \
     -dProductNameLength=25 -dProductDescriptionLength=120 \
     -dProductReviewLength=32 -dSelectionCriteriaCount=8 \
     -dBuyThreshold=0.25 -dSaveForLaterThreshold=0.75 \
     -dBrowsingExpiration=1m -dSimulationDuration=20m \
     -dInitializationDelay=100ms -dReportCSV=true

echo Running Shenandoah GC with $i heap size on huge workload
>&2 echo Running Shenandoah GC with $i heap size on huge workload

$JAVA_HOME/bin/java \
     -showversion \
     -Xlog:gc:batch.show-shen.shenandoah-gc.$i.log \
     -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:-UseBiasedLocking \
     -XX:+DisableExplicitGC -Xms$i -Xmx$i \
     -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC \
     -XX:+ShenandoahPacing -XX:ShenandoahPacingMaxDelay=1 \
     -jar $EXTREMEM_HOME/extremem.jar \
     -dDictionarySize=500000 -dNumCustomers=10000 -dNumProducts=200000 \
     -dCustomerReplacementCount=40 -dCustomerThreads=20000 \
     -dServerThreads=100 -dProductReplacementCount=8 \
     -dProductReplacementPeriod=36s \
     -dProductNameLength=25 -dProductDescriptionLength=120 \
     -dProductReviewLength=32 -dSelectionCriteriaCount=8 \
     -dBuyThreshold=0.25 -dSaveForLaterThreshold=0.75 \
     -dBrowsingExpiration=1m -dSimulationDuration=20m \
     -dInitializationDelay=100ms -dReportCSV=true
done


The CSV report that is written to standard output independently
reports the timeliness of many different repetitive operations that
comprise the Extremem workload.  Certain of the operations are more
susceptible to randomness in the execution path lengths than others.
For example, in one configuration, the number of products available
for selection (based on lookups of randomly selected keywords within
the data base of all existing products) ranged from 376 to 945.  This
is represented by the following excerpt of standard output:

+-------------------------
|Products available for selection:
|max,945
|min,376
|average,609.597
+-------------------------

Given this, we would expect to see a three-fold variance in
execution times for product selection based on the size of the
selection set alone, independent of GC interference.  Following are
the lines in the output report that describe the times required to
process particular stepss of each "customer interaction".  Preparation
represents the time required to gather up and present to the customer
all of the products that matched the customer inquiry.  The following
excerpt from standard output reports on the times required to perform
this step.  Note that, out of a total of 600,000 customer
interactions, the time required to present product options ranged from
2 microseconds to 2.072.002 seconds.  This is far worse than the
3-fold variation that would have been predicted by differences in
problem size.  The additional variance is due primarily to interference
caused by various garbage collection activities.

+-------------------------
|Timeliness of preparation
|Total Measurement,Min,Max,Mean,Approximate Median
|600000,2,2072002,19753,9216
|Buckets in use,12
|Bucket Start,Bucket End, Bucket Tally
|-256,0,0
|0,1024,6981
|1024,5120,3158
|5120,13312,443628
|13312,29696,113963
|29696,46080,13483
|46080,78848,9486
|78848,144384,3469
|144384,275456,1864
|275456,537600,1345
|537600,1061888,1541
|1061888,2110464,1082
+-------------------------

Following the line of output that reports summary information, the
distribution of measured service response times is reported as a 
weighted histogram.  The intepretation of the above "bucket-list" data
is that:

 6,981 samples required less than 1.024 ms
 3,158 samples required less than 5,120 ms and at least 1.024 ms
 443,628 samples required less than 13.312 ms and at least 5.120 ms
 113,963 samples required less than 29.696 ms and at least 13.312 ms
 13,483 samples required less than 46.080 ms and at least 29.696 ms
 9,486 samples required less than 78.848 ms and at least 46.080 ms
 3,469 samples required less than 144.384 ms and at least 78.848 ms
 1,864 samples required less than 275.456 ms and at least 144.384 ms
 1,345 samples required less than 537.600 ms and at least 275.456 ms
 1,541 samples required less than 1.061888 s and at least 537.600 ms
 1,082 samples required less than 2.110464 s and at least 1.061888 s


Each of the service response times reported by Extremem represents
slightly different mixes of new-object allocation, read access to
previously allocated objects, and mutation of previoiusly allocated
objects.

Product replacement processing is a service that has a more
predictable workload.  Each time product replacement processing is
performed, the same number of randomly selected products are
replaced, as specified on the extremem command line.  Note that even
with product replacement processing, there is some expected variation
in processing effort, since the underlying representation of the
product data base is a balanced binary tree protected by a
multiple-reader single-writer mutual exclusion lock.  If a large
number of customers happen to be looking up products at the moment a
request to remove products arrives, this will force a delay on the
modifying thread while it waits for all the customer threads to
complete their lookup requests.  The standard output file also
provides information regarding contended access to shared data
structures.  For example, the following lines report that the typical
writer of the products data base (i.e. the typical attempt to replace
products) has to perform an average of 1.976 wait operations with the
total number of wait operations for any single writer ranging from 0
to 573.

+-------------------------
|Products concurrency report:
|Total reader requests,1199860
|requiring this many waits,2010665
|average waits,1.6757497
|ranging from,0
|to,783
|
|Total writer requests,160000
|requiring this many waits,316137
|average waits,1.9758563
|ranging from,0
|to,573
+-------------------------

It is sometimes interesting to observe that the same workload
configuration exhibits different concurrency contention under
different garbage collection approaches.  This is a secondary affect
that results when a thread that holds a lock becomes blocked waiting
for garbage collection services to be performed.  This is a form of
priority inversion in that garbage collection, typically running at a
lower priority that application threads, is able to cause high
priority application threads to delay their efforts.

For this same workload configuration, Product replacement processing
timeliness is represented by the following entries in the report
written to standard output:

+-------------------------
|Product replacement processing:
|batches,20000
|total,160000
|min per batch,8
|max per batch,8
|average per batch,8.0
|
|Total Measurement,Min,Max,Mean,Approximate Median
|20000,5,1436893,43450,10496
|Buckets in use,12
|Bucket Start,Bucket End, Bucket Tally
|-256,256,114
|256,2304,18
|2304,6400,1
|6400,14592,13535
|14592,30976,2225
|30976,47360,1037
|47360,80128,1034
|80128,145664,824
|145664,276736,608
|276736,538880,312
|538880,1063168,256
|1063168,2111744,36
+-------------------------

In 20,000 measurements, the time required to replace 8 products ranged
from 5 microseconds to 1.436893 seconds.  Compare the performance with
different GC techniques for a better appreciation of the full direct
and indirect impacts that GC implementation approaches have on service
response times.