From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <9front-bounces@9front.inri.net> X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: from 9front.inri.net (9front.inri.net [168.235.81.73]) by inbox.vuxu.org (Postfix) with ESMTP id 6B89426AEE for ; Mon, 5 Aug 2024 14:30:28 +0200 (CEST) Received: from sirjofri.de ([5.45.105.127]) by 9front; Mon Aug 5 08:26:55 -0400 2024 Received: from dummy.faircode.eu ([31.16.254.19]) by sirjofri.de; Mon Aug 5 14:26:47 +0200 2024 Date: Mon, 5 Aug 2024 14:26:44 +0200 (GMT+02:00) From: sirjofri To: 9front@9front.org Message-ID: In-Reply-To: <20240805110501.asula7k52eo5gdld@black> References: <7003a121-ae98-4a24-b0dc-778c3b086310@sirjofri.de> <20240805110501.asula7k52eo5gdld@black> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Correlation-ID: List-ID: <9front.9front.org> List-Help: X-Glyph: ➈ X-Bullshit: extended basic information content-driven-scale optimizer Subject: Re: [9front] Thoughts on Wayland? Reply-To: 9front@9front.org Precedence: bulk 05.08.2024 13:09:41 Shawn Rutledge : > Dedicated GPUs are like that, so portable APIs need to support working > that way. But often on integrated graphics, GPU memory is just a chunk > of system memory, which makes the "upload" trivial in practice. Perhaps > it can even be zero-copy sometimes, if there's a way to just map a chunk > into GPU-managed space after it's populated? I don't know so much about the integrated graphics, but in the end they pla= ster the whole beast with an API middleware like OpenGL or DirectX, which t= akes care of anything that's happening underneath. I assume that these also= handle copying or mapping with integrated graphics. With dedicated GPUs, t= hey surely upload the data, either directly (blocking) or nonblocking (save= until the GPU is ready, then upload). > It's very useful for AI applications when there's a lot of system memory > and the boundary for "dedicated" GPU memory is fully flexible, as on > Apple's unified memory. (For example I found that I can run a fully > offline LLM with ollama on a modern mac; and after that, I stay away > from all the online LLMs... I don't trust them, they are ultimately > greedy, and if something is "free" it must be that we're helping them > train it better by using their throttled interfaces.) So I'm not sure, > but maybe we can expect that kind of architecture to be more common in > the future. Probably, who knows. The GPU is becoming a core part of any system, and AMD= is working on having dedicated GPU power in the same package as the CPU. M= odern APUs (as in handheld gaming consoles) are incredibly powerful enough = (still not comparable to true dedicated graphics, but you can run modern ga= mes on them), and ARM also follows the market of integrating GPUs (Mali and= stuff). I assume that with the rise of AI in standard computer systems, th= e market is forced to having dedicated GPU power, and unified memory (apple= ) sounds like an interesting way to go forward. > And it would be nice to have a way to avoid spending system memory at > all for GPU resources, to be able to stream directly from the original > source (e.g. a file) to GPU memory. This is an issue in Qt (my day > job)... we have caching features in multiple places, we try to avoid > reading files and remote resources more than once; but if you're playing > an animation, and the desired end result is to have the frames as GPU > textures, and you have enough GPU memory, then it should be ok to lose > the CPU-side cache. Especially if the cache was not already in > GPU-texture form. Decoding simple formats like png and gif is fast > enough that it doesn't even matter if you need to do it multiple times: > not worth caching frames from them, IMO, unless the cache is on the GPU > side. But I think the tradeoffs are different for different sizes. In > some cases, an animated gif can be very compact and yet a full set of > decoded frames can be enormous, so it doesn't make sense to cache it > anywhere. Decoding one frame at a time is the cheapest. Even if you > had to keep re-reading the file, doesn't the OS cache the file contents > in RAM anyway? (A controversial position, I'm sure.) So how and whether > to decide at runtime what to cache how and where, or leave it up to the > application developer by exposing all the suitable APIs, is Qt's > problem... sorry for the digression, my point is just that > upload-and-forget is not the only way that a GPU needs to be used. > Likewise large games are often streaming assets and geometry to the GPU > more or less continuously, from what I've heard: depends which assets > they can reasonably expect to be reused and to have enough GPU memory to > retain them, I suppose? Streaming textures (and other data) to the GPU is indeed a complex topic. L= arge games usually calculate what's needed, then load the data that's not a= lready in the GPU and upload it. They make a big difference between streame= d data (e.g world textures) and non-streamed data (e.g UI textures). Microsoft is also working on DirectStorage, which makes decoding/unpacking = on the CPU obsolete by just uploading the data to the GPU and unpacking it = there. I think they are also working on using other bus systems to transfer= the data, but as far as I know, as the hardware is currently built, it's n= ot that easily possible and you always need the CPU (and system memory) for= streaming. I don't know the exact details though... Modern game engines (I'm biased towards Unreal) really perfected streaming,= also considering the incredible speeds of SSDs. Nanite, for example, descr= ibes a hierarchy of clusters of polygons, each cluster with its own bounds.= The cluster data is uploaded to the GPU, and the GPU (shader) does some fa= ncy culling (frustum, occlusion, and size). This data is then used to "tell= " the CPU which polygons (clusters) to load, and then only this data is loa= ded. This data is also stored on the GPU for as long as it's needed, the da= ta is reused each frame. So per frame, only new clusters are streamed in, f= or example when moving around. In practice, each frame is slightly differen= t, so you always have a few clusters here and there, but compared to full d= raw calls like in a classic pipeline, that's a huge difference! > It's also my daydream to get the GPU to take care of UI rendering more > completely, even for simple 2D stuff, and free up the CPU. It's one > thing I'm hoping to achieve with my 9p scenegraph project (which is > a slow-moving side project, not a Qt project). But in general, there > might also be a backlash against excessive GPU usage coming, if people > expect to use the GPU mainly for "hard problems" or embarrassingly- > parallel algorithms like AI and 3D graphics, and not load it down with > simple stuff that the CPU can just as well do for itself. And battery > consumption might be a concern sometimes too. My attitude towards old > CPU-based paint engines like draw and QPainter has been kindof negative > since I started at Qt, because we've been trying to sell the idea that > you have a GPU, so you might as well use it to get nice AA on all your > graphics, animations "for free", alpha blending, and stuff like that. I > still think AA is really a killer feature though. Just about makes > 2D-on-the-gpu worthwhile all on its own. But Plan 9's draw could not > have AA on everything, could it? In fact, GPUs can still be used as 2d accelerators. In the end, it comes do= wn to how you program it. In the beginning, the GPU was more like devdraw, it could only draw 2d stuf= f based on simple draw calls. With shaders, you are free to do what you wan= t. You can, in fact, upload some scene data structure, and your shader does= interpolation, rasterization, etc. Sometimes it's even cheaper to do that = in a shader than using dedicated hardware components. > So while _I'm_ still interested in 2D on the GPU, I admit that you might > be onto something with your gpufs proposal, to focus on treating it more > like a computing resource than a fundamental piece of the graphics > pipeline. But I think we should have both options.=C2=A0 ;-) The trend goes towards indirect rendering for reasons. For example, the sta= ndard graphics pipeline is very strict about what it does and what it expec= ts, and also how it works. But for your specific application, you maybe don= 't need all the components, and you maybe need some components to work diff= erently. This is also the example of Nanite: large triangles are faster to rasterize= on the hardware, using the dedicated hardware rasterizer, but smaller tria= ngles are faster to render with a custom rasterizer engine. It becomes even= more complex when thinking about other use cases, like particle rendering = and displacement. Additionally, when using the standard pipeline, you have to run a lot of bo= ilerplate code that you don't really need. For Nanite, they even plan to im= plement compute-based shading (running the pixel shader as a compute shader= ), and they expect a performance gain. I expect that at some point we won't have the hardware for this specialized= rendering pipeline anymore. It will still exist as a concept, but the API = (OpenGL, DirectX, ...) will simulate that. >> With complex applications with hungry code and hungry graphics (many pri= mitive draws) > > Many draw calls are the enemy of performance when it comes to GPU > graphics. This is the main reason for needing a scene graph: first get > the whole collection of everything that needs to go to the screen > together into one pile, in an efficient form that is quick to traverse, > and then traverse it and figure out how to combine the draw calls. > (There is an impedance mismatch between any turtle-graphics or > paint-engine API, and the GPU. You can solve it if you can intercept the > draw calls and have them only populate the scene graph instead. Never do > any drawing immediately on-demand. Or, give up on the imperative API > altogether: be declarative.) This is the thing that Qt Quick is good at. > The interesting question for me now is how best to map that idea to a > filesystem. It's not uncommon nowadays to have a command list builder that accepts comm= ands from many different cpu threads. Those command lists will then be uplo= aded in bundles, which reduces the number of draw calls. > Are you trying to have gpufs as an endpoint for uploading assets that > could be invoked via draw calls too? Or just as a way to invoke the GPU > to do embarrassingly-parallel computation at whatever stage one wants to > use it for? (Why not make it possible to use it both ways, eventually?) > But I would expect that this fs will have a close-to-the-hardware > design, right? Currently, gpufs is designed after vulkan, which is much closer to the hard= ware than OpenGL. It focuses on interacting with the hardware, instead of d= rawing graphics. I'd like to treat assets and graphics as "just data". Your application (inc= luding shaders) defines what the data is. It could be a 3d model, a texture= , a frame buffer, an animation, it fully depends on the application. That way, you can upload your assets and your programs, and in the end you = get the final frame as data, which can be interpreted as an image, for exam= ple. You could however als upload assets and your programs and get the fina= l data, which is an animation. As a game developer, I have to build graphic= s though... sirjofri