Smart Rule Cache Project Log: 2008

Monday, May 19, 2008

Evaluate the performance of hyper rule cache - Cache hit ratio

Hit ratio of hyper rule cache

Trace	Number of pkts	2 actions(%)	100 actions(%)
Trace 3	7439801	99.32883	99.19437
Trace 5	7432650	99.35800	99.24929
Trace 10	7448271	99.36112	99.22063

Evaluate the performance of hyper rule cache - Number of evolving rules

Number of evolving rules of hyper rule cache (100 actions)

Trace	Number of pkts	Avg	Max
Trace 3	7439801	17	24
Trace 5	7432650	16	24
Trace 10	7448271	18	27

Number of evolving rules of hyper rule cache (2 actions)

Trace	Number of pkts	Avg	Max
Trace 3	7439801	14	21
Trace 5	7432650	14	22
Trace 10	7448271	15	23

Evaluate the performance of hyper rule cache - Average cache update time

The following table shows the average cache update time of hyper rule cache on 3 different traces and the rule set(1802 rules) with 2 and 100 actions respectively.

Average cache update time of hyper rule cache

Trace	Number of pkts	2 actions	100 actions
Trace 3	7439801	169μs	219μs
Trace 5	7432650	159μs	187μs
Trace 10	7448271	166μs	20μs

Evaluate the performance of hyper rule cache - experiment setup

We evaluate the performance of hyper rule cache using 3 real traffic traces and a real rule set with 1802 rules, obtained from a tier-1 ISP. We randomly generate 2 and 100 actions respectively for each rule in the rule set.

All the experiments are run in a machine with Dual Core AMD Opteron(tm) Processor 275 CPU (2.2
GHz). The total memory of the machine is 16GB. Hyper rule cache and smart rule cache are run using same parameters, and under same conditions. The sliding window size is set to 1024.

Verify the correctness of Hyper rule cache

To verify the correctness of hyper rule cache, two rule sets and associated trace files are used. Each packet in the trace file is associated with the correct rule that matches it. These files can be found from the following website: http://www.arl.wustl.edu/ hs1/project/.

We randomly generate 2, 4 and 10 actions for each rule in the two rule sets, and run hyper rule cache on these 6 pairs of rule set and traces. The action of each packet that is hit in the rule cache is printed out, and is compared with its real action.

Our experiments show that for each packet hit in rule cache, hyper rule cache give the same action as its real action. The results are summarized in table 3. As can be seen in the table, in all the 6 cases, 100% of packets are correctly classified by hyper rule cache, which proves the correctness of hyper rule cache scheme.

Hyper Rule Cache

A new packet classification scheme, hyper rule cache, is proposed. Hyper rule cache achieves high speed packet classification by caching a small number of evolving rules, which are updated continuously according to the characteristics of incoming traffic.

Hyper rule cache inherits the idea of caching dynamic rules from smart rule cache scheme, with several major modifications.

Firstly, Smart rule cache uses PPDD tree structure to store rules, which consumes huge memory
space especially for larger rule set. Hyper rule cache replaces PPDD with HyperCuts decision tree. HyperCuts tree is a popular structure for packet classification.

Hyper rule cache also proposed an algorithm for checking the HyperCuts tree whether an expanded rule conflicts with the semantics of the original rule set.

Finally, Hyper rule cache modified how the evolving rules are constructed, which makes PPDD tree unnecessary.

Sunday, May 4, 2008

Hypercut coding

We've gotten the hypercut code translated from C++ to C and have integrated it into our code. We are testing the code for errors and making sure that it still gives the correct output.

Thursday, April 24, 2008

taskq and update times

We have taken a look at the cache update time when using the taskq pragma in the code.
There was a slight decrease in the average update time, 7 milliseconds to 4 milliseconds. However, the worst case for update time increased greatly. The original code had a worst case of the update time around 5-6 seconds, the taskq code had worst cases around 40 seconds with two threads and 50 seconds with 4 threads.

The results are not good. We will try again once we get the hypercuts structure working with the code and see if there is any difference then.

Wednesday, April 16, 2008

Hypercut

We found the bottleneck of smart rule cache is the data structure we used to store the rules. We use PPDD(Pruned packet decison diagram) which is tree that grows quickly with the number of rules. Thus, besides multi-threading, we looked at other alternative trees.
We found hypercut is a good one. Hypercut is based on a decision tree structure. It has limited memory usage and fast classification speed. I run the source code of hypercut on several rulesets and corresponding trace files. It works file.
I'm now considering rewriting the source code of smart rule cache with hypercut decision tree.

Thursday, April 10, 2008

initial results using intel compiler

We tested our smart rule cache program (parallelized with taskq) using a rule set of 102 rules. The results are listed below. It shows that using multithreading can reduce the average update time to roughly half of the original one.

	Num of rules	Num of threads	Avg update time (10^(-6)s)	Cumulative hit ratio (%)
original	102		173	99.98
parallel	102	2	101	99.98
parallel	102	4	104	99.98
parallel	102	8	141	99.95

Monday, April 7, 2008

ICC and taskq works

We successfully installed intel complier(ICC) and it works. We compiled several examples using taskq and they work as expected.
We modified our rulecache program using taskq and it improved average update time. For the case with 1802 rules and 4 threads, we parallelized the bottleneck recursive function and the average updated time reduced from 0.007 to 0.004.
We will run it on Niagara to see the difference. We will also try other number of threads. Another direction could be parallelize other functions as well.

Friday, April 4, 2008

ICC and HyperCuts

We are working on getting the program to run using the taskq pragma but to do this we have to use Intel's C compiler (ICC). We are working on installing the ICC compiler and trying to compile the code with it. Once we are able to compile code we will be able to run it using taskq and compare the results we see.

We are also looking into replacing the tree structure used to store the rules with a HyperCut tree. We noticed that the tree is very large and we think switching to a HyperCut might give a significant performance increase.

Monday, March 24, 2008

taskq

We found a new openmp Pragma ''taskq" which works well with recursive functions. The recursive functions are naturally parallelized with the workqueuing model. We tried a few examples and it works fine. We are modifying our programs with taskq.

Tuesday, March 11, 2008

Memory Leak and Niagara

We used Valgrind to check the original code. One memory leak is found so far. We have fixed it. We will keep checking other possible problems.

As to Niagara, we run the original code without multi-threading and the rulecahce update time of one trace file is cut to half. It looks Niagara already did some optimization. We will try other trace files. The results from multi-threaded code is not very good. We will keep looking for fixes and other potential optimizations.

Tuesday, March 4, 2008

Consistency and Memory Errors

We tested the code with the changes we made and the original with a small ruleset and the output from the decision for the packets has been exactly the same.

We are trying now to verify this using the whole ruleset but we are running into problems. On the new machine with a GCC version that supports openmp our program (the original version) eats up all the memory until it crashes. We ran the original code also on the baleen machine and it runs fine. We suspect the different versions of GCC are to blame and we are working with Mike to come up with a solution for this.

Wednesday, February 27, 2008

Check consistency and other opportunities for parallerization

We run the program with 8 threads and the original program without thread on the whole rule set and a whole trace file. For each packet, we checked the rule number that matches the packet. We compared the results from both versions of programs, we found they are exactly the same. We think our program with multiple threads is consistent with the original program.

Yesterday, we discussed with dheeraj about other opportunities for paralleling the code. We believe there are other functions and for loops that can be parallelized. We will look further into this.

Wednesday, February 20, 2008

check consistency

After the meeting today, we checked consistency between code with multiple threads and without thread. We printed out the matched rule number for each packet. After an initial checking for a portion of packets, both multiple thread version and serial execution give same matched rule number for each packet. We are running the code on the whole trace file to further verify this.

Tuesday, February 19, 2008

Experiments with 2 and 8 threads

We looked carefully into the variables and modified the code for parallelizing the recursive function. We figured out the problem we had yesterday on how to set number of threads. The code is working for the following experiments we have carried out.

Experiment 1: We made a smaller ruleset (containing about 100 rules rather than original 1802 to make the program faster) and run our program on a huge trace file (trace3a, about 2G). The number of threads is 8.

Experiment 2: We run the code on the same trace file and same rule set as in experiment 1, the number of threads is 2.

Experiment 3: we tested the same trace file and the same rule set without threads.

We need to look further into the results we got to make sure that the new code is semantically equivalent to the original code.

Monday, February 18, 2008

one step closer

We managed to get the code with parallelization to work using a small ruleset (the regular one times out/uses too much memory on the machine that it is running on).

There are a few current caveats that we have to work out:
1) We have not been able to get the command that sets the number of threads to work yet, right now it just uses the default of 2.
2) We will need to test to make sure that the new code is semantically equivalent to the original code
3) We will need to test the multi-threaded version to see how it performs
4) We may need to make changes that limit the number of threads that are created by the program

Friday, February 15, 2008

more errors

I haven't really been able to troubleshoot the seg fault error that well due to another error that we are getting. After a while the terminal just prints out Killed and we loose the connection to the new computer. I think the session is timing out, we will have to email Mike about this and see if he can help us fix it.

Wednesday, February 13, 2008

We decided to look at two different approaches to the seg error we had. Bryan is going to debug and find the reason why the error occurs. I'm looking at how to transform the recursive function to one without recursion. I'll first try using stack, which seems practical with multiple threads.

Monday, February 11, 2008

first try of recursion parallelization

Mike helped us to have a new machine to test the program with OpenMP. A simple try gave us a segmentation fault error message. We found some articles about this and will try other ways to parallelize the recursion.

Wednesday, February 6, 2008

Recursion

We applied a simple way to do recursion using OpenMP. We tried it on an example and it worked well. We will try it on rulecache.c. We usually remotely log onto a machine to run our programs, but that machine does not have gcc 4.2 installed and thus can not support OpenMP. We have contacted Mike who is responsible for the lab and he will help us on this.

Sunday, February 3, 2008

OpenMP and recursion

We found the bottleneck function is a recursive function. We are reading more to understand how OpenMP works with recursion.

Friday, February 1, 2008

rulecache.c

We have found the second pass of function assignrhl() in rulecache.c, which tries to expand existing RHLs, takes longer time to run. After a careful examination, I found it is function check_conflict() in this section of code that actually runs slowly. We will try to parallelize this function.

Thursday, January 31, 2008

Open MP on IXP

I did some searching and from what I can tell the program should work on the IXP platform with the OpenMP parallelization. I have found no reason why it would not work and several things that hint that it will.

OpenMP is compatible with Intel compilers
Intel seems to have had a part in the development of OpenMP
I found a paper that mentioned OpenMP on the IXP 1200 line and it also said that their work should apply to the IXP 2xxx models

It seems like this platform would work fine for our purposes of running the rulecache code in parallelized form.

Tuesday, January 29, 2008

For loops with OpenMP

Yesterday, we succeeded in parallelizing a serial program. Today, I tried to parallelize loops in programs. I tried one program with a for loop, and another one with two embedded for loops. Both work well. We may start to parallelize loops in rulecache.c soon.

Monday, January 28, 2008

OpenMP GCC 4.2

The installed version of gcc on my machine is 4.1.2. I tried to install gcc 4.2 but the disk quota exceeded. I checked with cs lab and they said gcc 4.2 is currently installed under /s/gcc-4.2.1/bin and /s/gcc-4.2.2/bin.

Bryan and I figured out how to run programs using OpenMP. We succeeded in a simple hello.c program, as shown below.

% /s/gcc-4.2.2/bin/gcc -fopenmp hello.c -o hello.o
% hello.o
Hello World from thread 1
Hello World from thread 0
There are 2 threads

We will continue to work on other examples to get familiar with OpenMP.

Friday, January 25, 2008

OpenMP

After reading up on the OpenMP API I think this would be a good choice for us to use for parallelizing the code. It is cross platform and works with C and C++ on Windows, Unix, and Solaris. The GCC compiler version 4.2 and above have OpenMP support, the file omp.h needs to be included for the OpenMP functions.

Wikipedia Entry

Wednesday, January 23, 2008

Hardware

We are looking at parallelizing the code and Suman wants us to implement the code on server hardware so we are looking into different options with this. One being Sun's Niagara processor which would be nice since it runs Solaris 10 which has OpenMP support that would make parallelizing the code fairly straightforward.

Yadi and I are reading up on the OpenMP documentation and examples.

Wednesday, January 9, 2008

Parallelization

I've been looking into methods for parallelizing the code for the project. So far most of the stuff that I have found has been for C++ or some other language, I am having trouble finding something for C.

For us to be able to parallelize a for loop, we basically just have to make sure that all of the iterations can be performed independently and none depend on the next/previous iteration.

Dheeraj recommended looking at Threading Building Blocks (TBB) which come from Intel.
I started reading through their website and they have some examples and tutorials which seem fairly helpful. I'm going to read up on it some more and see if I can go through some simple examples to get used to the syntax and see if it will be possible to use this for our code. (It says that it is for C++)

Tuesday, January 8, 2008

Looking ahead

I meet with Dheeraj just before break to discuss the code. Suman wants to look at parallelizing the code as much as possible to speed up the update calculation. Dheeraj pointed out the parts of the code that take the longest to run and suggested that I try to parallelize some for loops in that section.

Dheeraj also suggested that I should read over the sigmetrics paper to learn more about the way the rules and modifications to the rules are handled in the code because we want to maintain the same semantic equivalence of the rules when parallelizing the calculation.

I am currently reading through the paper and looking into parallelization techniques.