GSOC 2013 Status: caching plugin week 7

Sorry about the late progress post, our internet at house went down from yesterday and got fixed today

Not a lot of coding happened this week. One of the reasons for not a lot of active work was the fact that my university papers started, and they should finish this week. But I plan to make up for the time taken, and still contribute with my normal pace as much as possible.

I did do some reading about http pipelined requests, and I have a branch for working on http header caching implemented. I want to make it generic so that arbitrary data can be added. My plan is to also add an api such that any process can add data in cache using an http api, which should also allow a lot of possibilities to open up. Having a caching infrastructure for just files may not be a big gain but it would be very useful for caching dynamic resources.

Advertisements

GSOC 2013 Status: caching plugin week 6

The whole caching pipeline is streamlined this week. The cache plugin now handles the lifetime of connection the right way so http pipeline requests work transparently. Thanks to Sonny for clarifying to me how it is suppose to be implemented.

Currently all the cache files have their fds open the first time they are accessed. An mmap is maintained over that fd, since this week cache plugin also maintains an on demand pipe list which are just pointers to the mapped memory (using vmsplice). but if a file is already cached than the pipes are used to directly splice content to the request. Before only one pipe was maintained.

A request pool is also used this time to maintain ready made request structures along with temporary pipe buffers to splice data to the socket. The same pipes are used again when the request is closed and a new one comes in. So for the lifecycle of a request for a file which is cached, only a tee followed by a splice is called to completely handle the request along with syscalls for sending the headers. in comparison to a normal scenario where open is called followed by sendfile and then close along with syscalls to handle headers.

Currently I am working on caching headers, but they are not streamlined yet for all cases and should come in very soon so the whole request lifecycle only ever requires 2 syscalls to send data if its all cached and all the response data fits inside the pipe. Right now the plugin caches all possible files, but I would restrict it soon to only cases where it has a chance to optimize like files which including with headers can fit in a couple of files, and resort to normal request handling for others

Weekly Progress

Work this week was pretty skewed, a lot of reading on some days and a lot of coding on others. 

I now maintain thread specific request handles which keep track of the socket, bytes send, cached file and other request metadata. I completely changed the old approach of using stages and now take over in the stage30 using using read events directly on sockets.

I also maintain a global file cache hash table based on inodes. It contains file related data including an mmap to the whole file, a custom sized pipe for storing initial chunks of file in kernel and other file statistics. Currently I use pthread rwlocks to maintain consistency across threads.

I started to implement different file serving implementations, first one was simply a sendfile implementation which had been there last week. This week I added a basic mmap implementation which takes the mmap from the file cache and just writes it down the socket, it worked but it was never on par with the sendfile implementation in terms of performance. The third implementation as an improvement over the second one where I use the linux zero kernel apis to push data directly from mmap to the socket without (hopefully) copying any bytes. It gifts bytes over to a kernel buffer (or pipe) using vmsplice and then splices data directly to the socket. Normally pipes have a max size of 16 pages (in total 64K) but I change it using the linux specific SETPIPE_SZ flag to a bigger size which in return performs better with large files. This implementation also maintains another  (kinda readonly) pipe with initial file data for every file for improving performance for small files which completely fit inside a pipe. With this it flushes the intial file data directly from the pipe without even touching the mmaps and for bigger files resorts to splicing from mmaps. The third implementation (although contains some bugs when used with multiple files which I think is rather monkey bug but I still have to debug more) is the default implementation and resorts to 1st implementation if mmap doesnt work.

In terms of performance currently implementation is currently on par with the default monkey static file server (although I was hoping for more this week) despite the fact that it maintain a global hash table for keeping track of file caches and does a lot more syscalls compared to a single main sendfile syscall in monkey. Here are some numbers

Without cache plugin (# ab -c500 -n10000 http://localhost:2001/webui-aria2/index.html)

Server Software:        Monkey/1.3.0
Server Hostname:        localhost
Server Port:            2001

Document Path:          /webui-aria2/index.html
Document Length:        26510 bytes

Concurrency Level:      500
Time taken for tests:   18.938 seconds
Complete requests:      47132
Failed requests:        0
Write errors:           0
Total transferred:      1265512772 bytes
HTML transferred:       1256433587 bytes
Requests per second:    2488.75 [#/sec] (mean)
Time per request:       200.904 [ms] (mean)
Time per request:       0.402 [ms] (mean, across all concurrent requests)
Transfer rate:          65257.76 [Kbytes/sec] received

With cache plugin (# ab -c500 -n10000 http://localhost:2001/webui-aria2/index.html)

Server Software:        Monkey/1.3.0
Server Hostname:        localhost
Server Port:            2001

Document Path:          /webui-aria2/index.html
Document Length:        26510 bytes

Concurrency Level:      500
Time taken for tests:   19.710 seconds
Complete requests:      50000
Failed requests:        0
Write errors:           0
Total transferred:      1332800000 bytes
HTML transferred:       1325500000 bytes
Requests per second:    2536.79 [#/sec] (mean)
Time per request:       197.100 [ms] (mean)
Time per request:       0.394 [ms] (mean, across all concurrent requests)
Transfer rate:          66035.77 [Kbytes/sec] received

Basically without cache plugin you get 2488 req/sec and with cache you get 2536 req/sec. Not a big boost but its basically comparing the kernel senfile implementation using file readahead cache with the plugin performance.

I already cache small files completely in pipes which reside in kernel memory, plan for next week is to get the entire file (given that it fits into memory) in pipes residing in kernel and try to write them directly to sockets (basically maintaining the custom cache directly in the kernel which i guess cant be swapped out where as memory mappings can). I also want to  cache http headers and also append them to the file pipes for even lower latency and send them all in one chunk, which should reduce cpu time even further while handling the request, should be easy inside monkey but need to find out how it can be done inside a plugin

github link: https://github.com/ziahamza/monkey-cache

blog: https://ziahamza.wordpress.com/

 

 

 

 

 

Weekly Progress

This week was again kinda slow, I fixed some bugs in the cache plugin, its performance is now identical to the default static file implementation and it is usable, no perf improvements yet but they are coming. I read quite a lot of code in the mk_http.c and mk_request.c on how requests are handled, and did do some experiments understanding and modifying them.
I submitted a couple of patches to the Monkey list today which would be kinda nice to have while working on the plugin. I wanted to cache the files by inode (which makes sense especially when links get involved) and had to lstat each time a file comes in, but monkey was getting the stat for each file anyways so I send a patch to include the inode and device number in the file_info struct. I send another patch to parse and put the range information in case it is available so that plugins can use it.

I think my work may have been a bit delayed, so I really hope to have a decent working plugin that improves the performance of monkey by this week. The transition from being a filesystem to a plugin took longer than I thought.

I hope to get an implementation running which mmaps file pages into memory and serve the request directly if they are available. Right now I am trying to find a way to attach information to each request like mmaps for the file rather than fds in stage30 and then write them in a non blocking fashion from stage 40.

Github: http://github.com/ziahamza/monkey-cache

 

 

Weekly Progress

It was a bit of a slow week with respect to the code that I had written. The initial benchmarks on the fuse overlay filesystem were very bad and by discussing the topic over the libfuse project, it seemed like the overhead of being a fuse filesystem was very large compared to the perf boost of having a caching read only filesystem even when serving from memory. So after discussing with my mentor, we decided to shift the focus and rather start working on a caching monkey plugin. I learned a lot by writing the fuse filesystem, but unfortunately it never worked out.

So then I started looking into the monkey sources, unfortunately I was a bit busy in middle as I had gone to a st gallen, but I have read quite some code in the monkey codebase and the plugin apis. I created a new github project for my monkey plugin codebase. The link is https://github.com/ziahamza/monkey-cache.

For now it’s a pretty simple plugin that handles static files and serves them as plain text. It’s a basic structure that I would extent, for now I plan to make the plugin be process and thread aware (adding hooks for CORE_PRCTX and CORE_THCTX) and then get the caching part done. This time I am running benchmarks from the very beginning to check for any unintended bottlenecks introduced in the process.

For now I am on track and should be able to get an initial usable and performant by coming week. For now you can check the progress on the github page.

 

 

 

 

 

Weekly Progress

I implemented a working low level fuse proxy fs, which is usable for simple directory transversal. Its a simple single threaded read only proxyfs, and doesnt have the whole suite of optimisations that fuse guys have implemented.

The low level implementation is still highly unstable, and the  deadlocks are somehow really hard to debug, I had a really tough time finding the problem as every time the process went into unterruptable sleep making it impossible to kill! even gdb wouldnt respond and the only option left is to restart the pc, turned out that the kernel deadlocks in a recursive syscall to the fuse fs and I still need to find a fix for it.

Although I do realize that going through this approach could take a some effort reimplementing all of the optimizations that default fuse high level apis have implemented, so I might end up copying over source files from the libfuse project or statically compile the  project with the libfuse library with modifications. But I also started to look into maintaining an inode table over the  high level api and then build caching on top of that, so that the optimisations from fuse come from free in exchange of some overhead of duplicating the mappings of inodes to paths, both at the fuse level and at our fs level. The codebase is pretty modular so for now I can switch between using both of the apis very easily and currently maintaining implementation over both of them.

At the same time I have also started to benchmark different approaches. So I got monkey running over the fuse fs and standard fs and then compared them. but initial benchmarks werent very good, the numbers are really bad with proxyfs, fuse guys have an overlay server in their codebase and even running benchmark over that gives some really bad numbers. For concurrent connection in ranges like hundreds, seems like the context overhead of fuse fs really starts to push the throughput down,

I used wrk as the benchmarking tool (https://github.com/wg/wrk), running at 400 concurrent connections, and duration of 30s, using standard fs over monkey we get:
Requests/sec:  29043.93
Transfer/sec:    739.06MB
but over proxyfs we get:

Requests/sec:   1517.46
Transfer/sec:     38.66MB

and with the fusexmp_fh overlay fs in libfuse sources we get similar results to the proxyfs:

Requests/sec:   1418.41
Transfer/sec:     36.12MB

Something is going terribly wrong in the fuse world, maybe its concurrent io requests but I dont really know for now, running something like callgrind over the the program suggests that most of the time is taken in the libfuse world. I have started a discussion over the libfuse devel mailing list and looking to get their advice.

I will continue to measure the perf numbers from now. anyways I have started initial work over a caching proxyfs. currently it maintains file descriptors over longer periods of time (even after the server closing them) so that the kernel doesnt drop the caches, but  I would soon land an implementation using mmap and continue the experiments. The modular structure of the project turned out to be really useful as I have multiple approaches and experiements all in the same codebase (even both using high level fuse api and the low level) and they can easily be turned on and off individually using the makefile and compiled into a fuse fs.

Weekly progress

I have been pretty busy this week. Lots of experiments, and read tons of code in the libfuse project.

I rewamped the code structure of the whole project, everything is turned into modules, and compiling different modules together gives you a completely different filesystem. Currently though there  is only a proxyfs file system, but most of the code is pretty generic now and could be resued for further fs experiements (compiling different types of caching fs sharing a lot of commong code). Initial testing has also been added, not a lot for now but will try to keep up some high level tests to make sure there arent high level bugs.

I started to work on the low level proxy fs. Most of the work is done in terms of mapping inodes to paths. The current implementation is inspired from the libfuse implementation, where they use 2 concurrent hash tables to map names to fs nodes, and inodes to fs nodes simultanously, pretty neat implementationa I adopted a similar approach. The code currently isnt multithreaded yet but there is no global state and everything operates on a handle so adding that should be easy.

Some optimisations have also been added, for which I had to look into the sources of libfuse sources to see how they work, the high level proxy fs now transfers data with zero copies, the fuse guys have a pretty neat buffer abstraction which makes it super easy. But most of my time had actually gone on the low level fuse implementation of proxyfs. It took me longer than expected as it turned out to be a lot of work. Mapping inodes to fs paths turned out to be a pretty comlicated problem especially when multithreaded is in place, but most of the work is done and an implementation should be out in a couple of days.

I planned on adding benchmarks after finishing low level fuse proxyfs implementation, but its still pending and bencharks would soon come after that.

Again, the github project is always up to date, I push in frequently around 3 times everyday, so feel free to follow progress on https://github.com/ziahamza/fucache