c++ - Using vector types vs custom structures for 256-bit numbers in CUDA - Stack Overflow

I’m working on implementing 256-bit number arithmetic in CUDA for operations like addition with carry p

I’m working on implementing 256-bit number arithmetic in CUDA for operations like addition with carry propagation, subtraction with borrow, and comparison. Initially, I defined a custom structure to represent the 256-bit number:

struct big_number_t {
    unsigned long long s0; // Least significant limb
    unsigned long long s1;
    unsigned long long s2;
    unsigned long long s3; // Most significant limb
};

However, I came across references to vector types in CUDA and found there are built-in types like ulong4 and ulonglong4.

  1. ulong4 or ulonglong4: I refer to the Cuda 12.8 documentation (.html#vector-types-alignment-requirements-in-device-code) and I am unable to determine which of these types is correct.
  2. Memory access efficiency: Will using vector types lead to better coalesced memory access or optimized loads/stores compared to the custom structure?
  3. Computational performance: For operations that require carry propagation (like addition and subtraction across the four limbs), is there any advantage to using vector type over the custom structure? Since carry propagation is sequential, I’m not sure if vector types offer any benefits here.

Now, I’m wondering whether it’s better to use ulonglong4 or stick with my custom structure for representing 256-bit numbers. My primary concerns are.

Below is an example of how I might utilize the custom structure in code:

struct big_number_t {
    unsigned long long s0;
    unsigned long long s1;
    unsigned long long s2;
    unsigned long long s3;
};

__device__ big_number_t bn_add(const big_number_t& a, const big_number_t& b, unsigned long long& carry_out) {
    big_number_t result;
    unsigned long long carry = 0;

    result.s0 = a.s0 + b.s0;
    carry = (result.s0 < a.s0) ? 1 : 0;

    unsigned long long sum = a.s1 + b.s1 + carry;
    carry = (sum < a.s1 || (carry && sum == a.s1)) ? 1 : 0;
    result.s1 = sum;

    sum = a.s2 + b.s2 + carry;
    carry = (sum < a.s2 || (carry && sum == a.s2)) ? 1 : 0;
    result.s2 = sum;

    sum = a.s3 + b.s3 + carry;
    carry = (sum < a.s3 || (carry && sum == a.s3)) ? 1 : 0;
    result.s3 = sum;

    carry_out = carry;
    return result;
}

Would replacing big_number_t with ulonglong4 or ulong4 offer any performance benefits, or is it mostly a matter of preference?

I appreciate any insights or experiences from those who have worked with large integers in CUDA. Thank you!

I’m working on implementing 256-bit number arithmetic in CUDA for operations like addition with carry propagation, subtraction with borrow, and comparison. Initially, I defined a custom structure to represent the 256-bit number:

struct big_number_t {
    unsigned long long s0; // Least significant limb
    unsigned long long s1;
    unsigned long long s2;
    unsigned long long s3; // Most significant limb
};

However, I came across references to vector types in CUDA and found there are built-in types like ulong4 and ulonglong4.

  1. ulong4 or ulonglong4: I refer to the Cuda 12.8 documentation (https://docs.nvidia/cuda/cuda-c-programming-guide/index.html#vector-types-alignment-requirements-in-device-code) and I am unable to determine which of these types is correct.
  2. Memory access efficiency: Will using vector types lead to better coalesced memory access or optimized loads/stores compared to the custom structure?
  3. Computational performance: For operations that require carry propagation (like addition and subtraction across the four limbs), is there any advantage to using vector type over the custom structure? Since carry propagation is sequential, I’m not sure if vector types offer any benefits here.

Now, I’m wondering whether it’s better to use ulonglong4 or stick with my custom structure for representing 256-bit numbers. My primary concerns are.

Below is an example of how I might utilize the custom structure in code:

struct big_number_t {
    unsigned long long s0;
    unsigned long long s1;
    unsigned long long s2;
    unsigned long long s3;
};

__device__ big_number_t bn_add(const big_number_t& a, const big_number_t& b, unsigned long long& carry_out) {
    big_number_t result;
    unsigned long long carry = 0;

    result.s0 = a.s0 + b.s0;
    carry = (result.s0 < a.s0) ? 1 : 0;

    unsigned long long sum = a.s1 + b.s1 + carry;
    carry = (sum < a.s1 || (carry && sum == a.s1)) ? 1 : 0;
    result.s1 = sum;

    sum = a.s2 + b.s2 + carry;
    carry = (sum < a.s2 || (carry && sum == a.s2)) ? 1 : 0;
    result.s2 = sum;

    sum = a.s3 + b.s3 + carry;
    carry = (sum < a.s3 || (carry && sum == a.s3)) ? 1 : 0;
    result.s3 = sum;

    carry_out = carry;
    return result;
}

Would replacing big_number_t with ulonglong4 or ulong4 offer any performance benefits, or is it mostly a matter of preference?

I appreciate any insights or experiences from those who have worked with large integers in CUDA. Thank you!

Share Improve this question asked Mar 11 at 13:37 Mehdi YeganehMehdi Yeganeh 2,1512 gold badges25 silver badges43 bronze badges 12
  • 1 Oh wait, you said CUDA. Sorry, I've missed that part. – Intelligent Shade of Blue Commented Mar 11 at 13:57
  • 2 AFAIK ulonglong4 should be faster since it can perform wider load, but you should really profile this with nsight (maybe both are already memory bound). In the current version, if the compiler can reorder load so to avoid dependencies resulting in stalls, it might be close in performance-wise. In practice, it seems to reorder stores but not loads (see on Godbolt -- IDK why SASS code cannot be generated anymore). – Jérôme Richard Commented Mar 11 at 14:22
  • 2 If you define the struct as alignas(ulonglong4), the compiler will use the same vectorized load instructions, see here for a modified godbolt link. Note that the largest load instruction is 128 bit, so you always get 2 instructions. This will impact coalesced memory access. Maybe storing in two arrays, one for the low half, one for the high half is better since it allows fully coalesced access – Homer512 Commented Mar 11 at 17:01
  • 1 Side note if you want to implement multiplication: The hardware of many (all current?) cuda platforms can only do 16 bit integer multiplication. I vaguely remember someone mentioned that knowing this, they can save instructions building their arbitrary length multiplication out of 16 bit multiplications – Homer512 Commented Mar 11 at 17:06
  • 3 These may be of interest: 1 2. Just to concur with what @paleonix said, a simple test on godbolt seems to show 3 XMAD (16-bit multiply) instructions for sm_52 and sm_60, and a single IMAD for sm_70, so I think sm_70 and beyond have a full 32-bit integer multiplier in the integer multiply functional unit. – Robert Crovella Commented Mar 11 at 21:06
 |  Show 7 more comments

1 Answer 1

Reset to default 5

Some of this is already covered in comments below the question.

ulong4 or ulonglong4

The first thing I would suggest is to use your C++ knowledge and note that these vector types are not entirely opaque. On a standard linux CUDA install I find the file "vector_types.h" in /usr/local/cuda/include. It has the following definitions, for example:

struct __device_builtin__ __builtin_align__(16) ulonglong4
{
    unsigned long long int x, y, z, w;
};
struct __device_builtin__ __builtin_align__(16) ulong4
{
    unsigned long int x, y, z, w;
};

They are basically quite similar except for the underlying base type - unsigned long long int vs. unsigned long int. On Linux (platforms supported by CUDA), there should be no difference between these types. But on windows, the first type is a 64-bit (unsigned) integer whereas the second type is a 32-bit unsigned integer. This is a peculiarity of windows compared to specific Linux platforms; not unique or specific to CUDA, but CUDA doesn't deviate from that. So my guess would be you want to use unsigned long long int or ulonglong4.

Memory access efficiency

I suggest reading this section of the programming guide. It covers several key concepts. The important thing to note about the above struct definitions here is the _align__(16) directive. If you build your custom structs with a similar directive e.g.:

struct __align__(16) {...

there should be no functional or behavioral difference in terms of memory efficiency. I'm not going to cover all the concepts of the CUDA requirement for natural alignment, it's covered in the section I linked. But from a memory access efficiency perspective, the align directive allows a load of the structure if properly written to take place as two 128-bit loads rather than a larger number of e.g. 64- or 32-bit loads. A bigger memory efficiency issue on the structure load will be that loading 256 bits per thread (broken into two adjacent 128-bit loads) will possibly be less efficient than loading adjacent 128-bit loads in adjacent threads. I have no doubt that this sort of efficiency is one of the things the CGBN designers were shooting for, and one of the reasons they chose to do the work in multiple threads per 256-bit element rather than a single thread per 256-bit element. However, on a modern GPU I think the efficiency difference here may be small. Another reason to prefer multiple threads might be if the scope of the work is too small. A GPU works best when many threads are engaged. If you worked on, say, 1024 of your big integers at once, in 1024 threads, that is going to be only a small amount of parallelism exposed on the GPU. But instead if you can effectively have, say, 8 threads work on each number (doing work in 32-bit chunks) that is going to work out much better from an exposed parallelism perspective.

That isn't a concern so much if you have large amount of exposed parallelism (say, 1 million numbers to work on at once) so if that is the case I would say proceed with whatever path seems best, and only consider CGBN if performance is still a concern.

Computational performance

As discussed in the comments, there isn't any GPU that does integer work in wider than 32-bits per SASS instruction, so I would suggest just proceeding with what you have in mind. The structure design isn't going to affect the math, because NVIDIA GPUs compute integers in 32-bit quantities, so larger integers are represented using 32-bit unsigned integers.. By the time you actually get to doing arithmetic, you will not be operating with the structure directly, but with its components. If you want examples of well-crafted routines, I would suggest these 1 2 may be of interest, but they deal with 128-bit integers, not 256-bit integers, so would require some extension of some sort. If that seems complicated, if it were me, I would simply start out with a C++ realization for the arithmetic that makes sense to you, and only consider more exotic things if performance is a concern, and the profiler steers you in a particular direction.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744791455a4593939.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信