New core (revision 3) [Archive]

View Full Version : New core (revision 3)

darkone

08-12-2004, 11:52 PM

Ok, as most of you know I've been working on non-ioFTPD related project for past 'few' months. While I've been busy with the other project, I've also done some plans (notepad drafts & quick performance tests) for the new io (input output) core.

It took 3rewrites to get it all working, but now I'm confident that I've overcome all the theoretical problems that I had with the fully asynchronous processing line.

Old core used to process everything in certain order. When reading data from ssl encrypted socket, it first called 'read' function. Once read completed, it called ÃÂ´decryption' function. Finally when decryption completed, it called some other function (to look for linefeeds or so). Because everything is done in linear order, it is very easy to synchronize. 'read' -> 'decrypt' -> 'do something with buffer'

New core incorporates completely new ideology. Order of events is no longer predeterminated. 'read' function may be called at the same time as 'decryption' function is processing buffer returned by previous read call. It's also possible that function that does something with decrypted buffer, is also running while decryption is in progress...

As you can imagine, it is no longer a trivial task to get it synchronized properly. With proper manner I mean, that it is not acceptable for thread to block another for longer than few quantums (quantum = slice of cpu time thread gets from kernel to execute) Because whenever the pipeline stalls, the efficiency drops. Therefore I have tried to figure out a way to make stalls as short as possible, and in many ways I think I have succeeded in this.

I have placed high hopes for the new core. And when it is done, I hope I can say it was worth it. There are no examples of anything alike available - and AFAIK this method has never been done before. When I have the performance figures - and if they are what I expect them to be, I can be sure that it's unique solution.

- Old core: ~160mb/sec cached disk read (to socket)
- New core (rev 2): ~220mb/sec cached disk read (to socket)
- New core (rev 3): Faster than rev 2.

darkone

08-13-2004, 12:04 AM

Notepad draft from yesterday:

http://www.ioftpd.com/~darkone/tmp/stupid.txt

As usual example does not compile, nor has any real functionality. Idea of example is to show, complexity (and or simplicity) of algorithms in use. For legal reasons I have to mention that, one is not allowed to use this code or its' director or indirect derivants without my permission.

peep

08-13-2004, 12:57 AM

Oh man, oh man. Have I been waiting for this post to come :)

Great going d1. Gonna go read your post more throughly now!
Keep doing what you do best!

Let's show them Linux servers who's got the fastest daemon/servers around :)

Microsoft Windows Server 2003 vs. Linux Competitive File Server (www.veritest.com/clients/reports/microsoft/ms_netbench.pdf)

peep

darkone

08-16-2004, 03:21 AM

I wrapped up few tests for the new core...

ftp buffered file -> socket -> socket [ftp.exe] -> nul
- Parameters: FILE_FLAG_SEQUENTIAL_SCAN, SO_SNDBUF = 0, 3 * 32kb application write buffer, ftp.exe
- Result: ~180mb/sec
- Conclusion: No significant speed gains above the current core. Reduced memory and cpu (~25% in user mode) usage.

ftp unbuffered file -> socket -> socket [ftp.exe] -> null
- Parameters: FILE_FLAG_NO_BUFFERING, SO_SNDBUF = 0, 3 * 8kb application write buffer, ftp.exe
- Result: ~60mb/sec
- Conclusion: User mode cpu usage reduced significantly (to 1-2%), while kernel mode cpu usage increased somewhat. Greatest benefits when used asynchronous devices. (scsi, network shares)

buffered loopback transfer file -> socket -> socket -> null
- Parameters: FILE_FLAG_SEQUENTIAL_SCAN, no socket write buffers, 3 * 32kb application read buffer + 3 * 8kb application write buffer, within single daemon thread
- Result: ~120mb/sec transfer (240mb/sec throughput)
- Conclusion: Result is a bit lower than exepected. But because test was limited to single thread, it only utilized capacity of one processor.

It currently seems, that there is very little to be optimized in the main code - as nearly 100% of cpu-time is now used by mandatory calls: GetQueuedCompletionStatus(), WSASend(), WSARecv(), ReadFile() & WriteFile().

Mr_X

08-16-2004, 01:07 PM

I seen a long time ago an article on Onversity about optimizing loops. This may help you. Here's a link to the PDF file:
d-loop (http://www.onversity.com/load/d-loop.pdf)

darkone

08-19-2004, 12:35 PM

I actually use similar optimizations already...

void foo(int lOperation)
{
switch (lOperation) {
case 0:
DoSomethin1();
break;
case 1:
DoSomething2();
break;
}
}

could be written as:

lpProc[2] = { DoSomething1, DoSomething2 };

void foo(int lOperation)
{
lpProc[lOperation]();
}

darkone

08-20-2004, 04:15 AM

Finished the first test round with 2048 test connections; total transfer speed was ~180mb/sec. In the test daemon was acting as both client and server, so number of downloading client connections was 1024 and uploading server connections was 1024 as well. Cpu usage was at ~80% on both cpus. Memory usage using 3read & 3write buffers of 16kb, was ~160mb.

=> There's still room for optimizations, but core seems to handle heavy io loads now with ease.

ADDiCT

08-20-2004, 05:34 AM

(stupid question, but who am i :)) do these optimiziations for extreme situations (+1000 transfers) produce any overhead for normal situations (~20 transfers) ?

darkone

08-20-2004, 07:00 AM

Ofcourse not :) Resources free by reduction in memory copying negate the overhead that required thread synchronization adds.

For uploads, I expect tremendenous performance improvent; seperate encryption thread pool is now gone - io threads are able to transform themselves to encryption threads when required. This means that thread is in many sitautations able to decrypt/encrypt/chash/whatever the buffer immeaditely after processing it. Just like with the old core, amount of threads performing such operations simultanously is definable parameter.

darkone

08-20-2004, 08:19 PM

It seems that I'm getting close to the performance requirements I set to the core.

Transfer times for 800mb file, when daemon is working as both client and server:
1 connection: 6.3 seconds (253mb/sec)
2 connections: 12.6 seconds
10 connections: 62.6 seconds
100 connections: 619.8 seconds
1000 connections: 6285.2 seconds

Performance of single cached transfer seems to be always constant.

Just one odd thing.. I noticed I had pulled the wrong figure for old core performance: http://www.ioftpd.com/board/showthread.phpthreadid=3174

... and the odd thing is, that now that I try to transfer same file, I get lower performance (even with the old io) And I can't remember any changes since (other than I added 1Gb of memory)

Syrinx

08-21-2004, 02:09 AM

"Transfer times for 800mb file, when daemon is working as both client and server:
1 connection: 6.3 seconds (253mb/sec)
"

The above reslt confuses me.How Can transfering a 800 mb file with 6.3 seconds,and the transfer speed is 253 mb/sec.

800 mb / 6.3 seconds = 253 mb/sec?

ADDiCT

08-21-2004, 03:32 AM

800 MB / 6.3s = 126,98 MB/s

"when daemon is working as both client and server"

in those 6.3s, it has both uploaded and downloaded 800 MB, so it's 126,98 MB/s "full duplex", or a total of 253 MB/s :)

iXi

08-21-2004, 04:54 AM

ftp unbuffered file -> socket -> socket [ftp.exe] -> null
- Parameters: FILE_FLAG_NO_BUFFERING, SO_SNDBUF = 0, 3 * 8kb application write buffer, ftp.exe
- Result: ~60mb/sec
- Conclusion: User mode cpu usage reduced significantly (to 1-2%), while kernel mode cpu usage increased somewhat. Greatest benefits when used asynchronous devices. (scsi, network shares)

d1 does io needs special buffer settings for asynchronous devices? because i'm using some scsi raids..

cya

darkone

08-21-2004, 09:23 AM

Not likely that I'll support that FILE_FLAG_NO_BUFFERING, implementation seems to be costly. (amount of code compared to performance delta)

iXi

08-21-2004, 09:39 AM

sounds quit good;)

Syrinx

08-21-2004, 07:37 PM

I still get some questions about that result based on my knowledge of hardware.
1.what's the speed of the NIC?and 253 mb /sec means 253 mega byte per second or 253 mega bit per second?
2.The lastest(the fastest) harddisk can barely reach 80MB/sec reading and writing speed in a very ideal environment(mostly at 20 MB/s ~ 70 MB/s).Even the daemon can have that transfer speed,but I don't think there is harddisk can continiously write or read data at 127 mb/s.

ADDiCT

08-22-2004, 03:24 AM

When u read a file for the first time, windows keeps it cached in ram. When u read it again, it comes from the cache. On my system, i can read a 100MB cached file at about 1 GB/s.

When u want to test the maximum throughput of ioFTPD, u don't use harddisks, because those are the slowest component. U transfer a cached file, then the limit is 1 GB/s.

The client in this case, doesn't write the file to the harddisk, it just downloads it and then discards it.
ftp.exe: get file > nul

The speed of the NIC is unimportant here, because the tests are local transfers. Localhost -> localhost doesn't go through any network adaptor. (This way, u can develop and test networking applictions without being connected to a network). The speeds darkone got were most likely 253 MB/s and not 253 Mbit/s.

Please correct me if i'm mistaken :)

Zer0Racer

08-22-2004, 01:02 PM

:drool:

darkone

08-22-2004, 01:26 PM

Remember that I'm not developing this daemon (core) only with ide/sata disks and dsl connections in mind ;)

And yes, figure was 253mb/sec... and I've managed to get few more mbytes more, so it's now closer to 270 than 250 :) Major problem with these tests, is that winsock is not able to use hardware acceleration of NIC to create tcp packets (crc-16 etc). Also, the test includes buffer copying that would not exist in real world situation, as core is now able to assign transfer buffers directly to hardware without going through winsock buffers.

darkone

08-26-2004, 10:17 AM

Huhuu... I've broken the 300mb/sec barrier. All I had to do, is either to disable hyperthreading or set program's thread affinity to represent best case scenario. Excat transfer time for 800mb file was:

1 transfer: 5.1 seconds (313mb/sec)
2 transfer: 10.13 seconds (315mb/sec)

I'm currently finalizing the base api, and moving on to add ssl within a few days.

peep

08-27-2004, 05:13 AM

You're mad :-)
That's all I've got to say. Keep up the amazing work d1. sad to see I won't be around when the new core goes public (going away for about a year). Good luck with all this and I'll be dying to try it out when I get back!

peep

richto

08-27-2004, 07:37 AM

315 MB a sec? So thats roughly a 2.5 GBit connection - not bad!

darkone

08-27-2004, 07:56 AM

Yep, and because crc calculations, buffer splitting etc. are now done by cpu, it should be actually faster when it goes through NIC. (depending on driver performance & hardware acceleration capabilities of the card)

My integrated NIC (Intel PRO/1000 CT) seems to have following hardware capabilities:
- QOS managment
- Hardware flow control
- Up to 1024 receive buffers
- Up to 1024 send buffers
- TCP checksum (receive & send)
- IP checksum (receive & send)
- And some more.. (stupid thing shows them in finnish :))

Which means that use of TCP protocol shouldn't add any overhead to CPU, when compared to sending RAW packets.

Mr_X

08-27-2004, 01:04 PM

I have a few questions:

How is it possible to improve so much performance? I don't think there is a lot of way to read a file and transfer it to a socket (perhaps I'm wrong).
What hardware do you use to get transfer so fast (MB, CPU, Hard disk, raid controller, RAM, network cards, ...)?
What windows version do you recommand to get best speed?
Could you explain how to code such function or do you have a website that explain it:

lpProc[2] = { DoSomething1, DoSomething2 };

void foo(int lOperation)
{
lpProc[lOperation]();
}

neoxed

08-27-2004, 03:21 PM

Originally posted by Mr_X
What hardware do you use to get transfer so fast (MB, CPU, Hard disk, raid controller, RAM, network cards, ...)?
If you were reading the rest of the thread, both darkone and ADDiCT mentioned these tests are done with the file cached in memory. See ADDiCT's post above.

Originally posted by Mr_X
What windows version do you recommand to get best speed?
All 5.x versions of Windows (2000/XP/2003) should preform somewhat similar. Though, I'm sure Windows Server 2003 has a few under-the-hood tweaks that previous versions do not.

darkone

08-27-2004, 03:45 PM

How is it possible to improve so much performance? I don't think there is a lot of way to read a file and transfer it to a socket (perhaps I'm wrong).

On windows there are several ways of reading from handle; synchronous (ReadFile()), asynchronous , ReadFileEx(), overlapped ReadFile() using events, overlapped ReadFile() using io completion ports.. not to mention options that are available for sockets). New core uses the most advanced method available; single completion port, several simultanous reads/writes per socket/file, and several threads polling for the completion port. It makes it possible, for several cpus to process notifications that arrive in one completion port. I can't give latest functional sources for you to see, but what I can give is: http://www.ioftpd.com/~darkone/tmp/newcore.txt ... that's the first implementation of async processing line that I did. (latest version is much better, and in most cases it can process the operations inside thread safe areas within one quantum)

What hardware do you use to get transfer so fast (MB, CPU, Hard disk, raid controller, RAM, network cards, ...)?

2x 2.66Ghz Xeon, 2048Mb (Dual channel), cached file, local transfer (no NIC involved, which causes worse results that you'd get with a transfer through NIC)

What windows version do you recommand to get best speed?

Any version of windows is ok, just make sure you have you set windows to use memory for system cache rather than using it for programs. Server editions of windows have longer socket queue, which makes it better suitable for server use. However, io rarely needs listen queue - as it can accept up to 10connections simultanously - 10AcceptEx() requests are pending on each service socket. But in case you're going to run it as httpd server, which gets lots of short lived connections - you might hit the wall using XP or 2000 Professional.

ld you explain how to code such function or do you have a website that explain it:

It's standard function to pointer declaration. In some cases it's slower: if function can be inlined, and there's only few operators in array. However ie. php source code could be optimized this way, I remember seeing switch() operators of like 200 members there.... you can just imagine how long it takes to get to the last operator.

Mr_X

08-27-2004, 05:46 PM

Thx for those informations.
New io seems promising ;)