2 + 2 = 4, er, 4.1, no, 4.3... Nvidia's Titan V GPUs spit out 'wrong answers' in simulations, page 1

posted on Mar, 22 2018 @ 01:06 PM

link

Nvidia seems to be caught up in some interesting issues lately.

First there was an issue with their partner program, and having partners remove the "gaming" logo reference from competitor's chips.

Now, it seems there is an issue with scientific modelling on their Titan V cards, where some of the cards are spittig out wrong (or at least different) answers to modelling simulations at least 10% of the time.

Linky (Register)

These cards are 3 grand US a pop (so at least a bajillion Canadian Rupees), and are geared towards the scientific community rather than the average consumer/gamer.

I wonder what effect this might have for similar uses for these cards, like bitcoin mining?

Could this have a negative effect on the blockchain?

gortex

posted on Mar, 22 2018 @ 01:28 PM

link

a reply to: gspat

Or , Nvidia chips create new maths !

I would like to see some confirmation of this story other than an unnamed industry veteran and an anonymous engineer.

stormcell

posted on Mar, 22 2018 @ 01:28 PM

link

a reply to: gspat

Bitcoin mining involves integer and bitwise calculations so it isn't a problem for them. When you are doing scientific calculations like fluid dynamics and finite-element analysis, the mathematical calculations require that the state of each cell is used to calculate the state of adjacent cells in the next generation using floating-point calculations.

I've encountered that problem with multi-threaded calculations. I was trying to calculate the volume of a 3D blob-thing, so I take all the volumes of the itty-bitty tetrahedrons and add them together in separate threads and then add those results together. Somewere along the line the arithmetic errors cancel out in some places, and add up together in others so the error threshold increases. Depending on the order of addition, the results would be different.

I really didn't want to investigate any further, so the quick fix was to use double precision and pretend it never happened.

NobodiesNormal

posted on Mar, 22 2018 @ 03:45 PM

link

originally posted by: stormcell
a reply to: gspat
...and pretend it never happened....

I KNEW IT! you mad scientists are creating a doomsday device after all! covering up the data! ripping holes in the fabric of reality then sweeping them under the rug! for shame!

dug88

posted on Mar, 22 2018 @ 04:06 PM

link

originally posted by: stormcell
a reply to: gspat

Bitcoin mining involves integer and bitwise calculations so it isn't a problem for them. When you are doing scientific calculations like fluid dynamics and finite-element analysis, the mathematical calculations require that the state of each cell is used to calculate the state of adjacent cells in the next generation using floating-point calculations.

I've encountered that problem with multi-threaded calculations. I was trying to calculate the volume of a 3D blob-thing, so I take all the volumes of the itty-bitty tetrahedrons and add them together in separate threads and then add those results together. Somewere along the line the arithmetic errors cancel out in some places, and add up together in others so the error threshold increases. Depending on the order of addition, the results would be different.

I really didn't want to investigate any further, so the quick fix was to use double precision and pretend it never happened.

This. Your post is correct.

en.m.wikipedia.org...

The primary sources of floating point errors are alignment and normalization. Alignment is the shifting operation that must occur when adding or subtracting numbers with differing exponents. Normally, the fractional value (significand plus the hidden bit) is greater than or equal to 1. and less than 2. The process of shifting the significand to create this situation is called "normalization". Right shifting operations will provide digits in the result that will no longer fit in the specified format. These bits can be dropped to round the result down[34]:6 , or can be used to round the result up. Some numbers cannot be accurately depicted in an fixed format computer representation and must be rounded. Either case introduces a "rounding error", resulting in a slightly smaller or slightly larger representation of the real number represented. In a single operation, rounding may not introduce a significant error, but collectively, over a large number of operations, the result could be so erroneous so as to be useless.

Similarly, when subtracting similar number, the value must be normalized by shifting the significand left. This introduces useless bits into the right hand side of the significand, called cancellation.[34]:20 This may remove sufficient significant digits so that the represented value is worthless and is called "catastrophic cancellation." Cancellation otherwise still introduces error called "benign cancellation."

Neither rounding error nor cancellation error are recognized nor recorded by standard floating point operations and therefore the results of floating point operations are equivocal.

edit on 22/3/2018 by dug88 because: (no reason given)

Cymru

posted on Mar, 22 2018 @ 04:35 PM

link

a reply to: NobodiesNormal

I've just sent this post to my mate from Uni with a pHD. He'll have the handle on it in no time.

Maxatoria

posted on Mar, 22 2018 @ 04:38 PM

link

Its always been a problem with accuracy especially when you go in very large or small numbers and especially for multiple iterations where small errors can soon multiply up.

Its a fun thing when you convert real numwbers to integers so adding 2.3 to 2.2 will give 4.5 and rounding up will make it 5 but as pure integers its 4.

GetHyped

posted on Mar, 22 2018 @ 04:52 PM

link

a reply to: dug88

Rounding error should not differ when performing exactly the same sequence of calculations, as per the issue with the nVidia cards.

dug88

posted on Mar, 22 2018 @ 06:26 PM

link

originally posted by: GetHyped
a reply to: dug88

Rounding error should not differ when performing exactly the same sequence of calculations, as per the issue with the nVidia cards.

Floating point errors are handled differently on. Different CPUs/GPU's handle them differently...nvidia's are obviously just kinda #ty at it. The ARM processor on my phone will give me different floating point errors than the x86_64 one on my computer....different programming languages will give different and sometimes inconsistent results depending on the compiler or what machine it's compiled on. If sufficiently large numbers are being overflowed the results will be.inconsistent. Even on the same card as the memory itself is being corrupted and values are changing in memory itself as their being overwritten by the overflowed values. Floating point arithmetic is just not reliable even on $3000 GPUs.

Soylent Green Is People

posted on Mar, 22 2018 @ 09:18 PM

link

Let's just blame it on the Mandela Effect and call it a night. Math works differently in this timeline.

stormcell

posted on Mar, 22 2018 @ 09:55 PM

link

Every floating-point unit has a variety of rounding modes. The IEEE 754 specification provides four different settings
. Just for the Intel CPU, there is a 16-bit control register than governs a wide combination of settings.

www.website.masmforum.com...

This Nvida CUDA programming guide explains how the order of arithmetic calculations can affect the final result:

docs.nvidia.com...

CUDA register instrinsics let the programmer set the rounding mode per instruction.

Aazadan

posted on Mar, 23 2018 @ 11:17 AM

link

originally posted by: GetHyped
a reply to: dug88

Rounding error should not differ when performing exactly the same sequence of calculations, as per the issue with the nVidia cards.

Not true. Floating point numbers are handled interestingly on computer chips. Binary can't actually represent 0.78 for example. You're probably familiar with binary for integers so the number 10 would be 1010 but it works similarly on the floating point side. The second digit in binary is your 1's for 0/1, but the first digit on the decimal side is representing 1/2, the next digit represents 1/4, and so on down the line. So the closest it can get is to add 1/2 (.5)+1/4 (.25)+1/32(.03125) and represent the number as 0.78125. Usually this will then be complimented by another number that subtracts .00125, but the closest to that is 1/512 which is .001953125, so now your floating point number to represent 0.78 is 0.779296875. Floating point arithmetic is always tricky because it doesn't work with exact numbers. What's displayed to the user, and what's it's working with aren't the same thing.

moebius

posted on Mar, 23 2018 @ 02:31 PM

link

a reply to: Aazadan

You are confusing things here. While floating point numbers have limited precision, you will still always get the same limited precision result from the same arithmetic operation.

This bug doesn't have anything to do with rounding. There is some speculation about memory being the issue. My guess it is simply a hardware bug.

www.reddit.com...

... a simpler repeated loop shows errors with matrix multiplication and division, but only without the newer matrix math functions afaik.

toysforadults

posted on Mar, 23 2018 @ 06:08 PM

link

GetHyped

posted on Mar, 23 2018 @ 06:30 PM

link

a reply to: Aazadan

This is not how rounding error works. It's not stochastic, it's very much predictable, hence the reason that running the same sequence *should* deliver the same result, rounding error included. This is evidently not the case with these chipsets.

toysforadults

posted on Mar, 23 2018 @ 06:31 PM

link

a reply to: moebius

LOL!!!

I wish I had read your post before adding my video.

I was actually learning about this in class today thought I would share with the thread. Should have checked first.

GetHyped

posted on Mar, 23 2018 @ 06:32 PM

link

originally posted by: dug88

originally posted by: GetHyped
a reply to: dug88

Rounding error should not differ when performing exactly the same sequence of calculations, as per the issue with the nVidia cards.

Floating point errors are handled differently on. Different CPUs/GPU's handle them differently...nvidia's are obviously just kinda #ty at it.

The key issue is that rounding error is not stochastic for a given chipset/architecture, even when it's "wrong". This is why the unpredictibley erroneous results are problem with these particular boards.

edit on 23-3-2018 by GetHyped because: (no reason given)

roadgravel

posted on Mar, 23 2018 @ 06:51 PM

link

There was the FDIV error in the early Pentiums due to a board design error.

edit:

FDIV being a floating point division error

edit on 3/23/2018 by roadgravel because: (no reason given)

Aazadan

posted on Mar, 23 2018 @ 07:02 PM

link

originally posted by: moebius
a reply to: Aazadan

You are confusing things here. While floating point numbers have limited precision, you will still always get the same limited precision result from the same arithmetic operation.

This bug doesn't have anything to do with rounding. There is some speculation about memory being the issue. My guess it is simply a hardware bug.

www.reddit.com...

... a simpler repeated loop shows errors with matrix multiplication and division, but only without the newer matrix math functions afaik.

Different chipsets can be using different rounding errors. Getting the same error with the exact same hardware is something else though. In that case there's something in the hardware that's either faulty or purposefully non deterministic.

Without knowing how cryptomining works (the actual algorithms I mean), my guess is it's an attempt to stop that considering NVidia is set to release some crypto specific cards soon.

Maxatoria

posted on Mar, 26 2018 @ 02:06 PM

link

After a good think I would say that we could be hitting a bottleneck in chip performance and that could be the actual chip(s), its memory or even the circuitry between the chips, normally this would be more of a set of chips overclocked to a point where errors start to creep in due to lack of heat displacement especially as its a premium product and thus at the max and running it somewhere in Texas may make it fail a lot easier than somewhere in Canada.