c# - Immutability and side effects with dataflow TransformBlocks - Stack Overflow

I'm currently reading about TPL dataflow pipelines and there's something I'm curious abo

I'm currently reading about TPL dataflow pipelines and there's something I'm curious about when it comes to TransformBlocks. As I understand it, TransformBlock<TInput, TOutput> accepts an input of type TInput, transforms it, then return the new result as type TOutput, to be passed in to another block, perhaps.

I think this would work fine if the transformation would change the type itself, but what if the types are the same? And the output is a reference to the input? Suppose the object is too large where copying or cloning it wouldn't be efficient. Consider this example:

Suppose I have a string where I need to apply heavy concatenations on it sequentially. To save memory, I would be using StringBuilder. Here are the blocks.

var sbBlock = new TransformBlock<string, StringBuilder>(str => new StringBuilder(str));
var op1Block = new TransformBlock<StringBuilder, StringBuilder>(sb =>
{
    // call API
    // concat to sb
    return sb;
});
var op2Block = new TransformBlock<StringBuilder, StringBuilder>(sb =>
{
    // call API
    // concat to sb
    return sb;
});

sbBlock.LinkTo(op1Block, blockOptions);
op1Block.LinkTo(op2Block, blockOptions);

So, it's really just a pipeline of TransformBlocks, but most of them just modifies sb in place. When I thought about this, it looks concerning. In the context of blocks, op1Block and op2Block have side effects yet return a value, which is very dangerous. In the context of the whole pipeline, there can be no issues since the states are never shared and they are passed in sequence, so the next block will always get the most updated value. However, I could be wrong about this and would like clarification.

My questions:

  • Am I right with my observations?
  • Is this good practice? I am not sure if the processing of sb can still be considered immutable across all blocks, or it might introduce issues down the line.
  • Does TPL dataflow have other ways to handle cases like this?

I'm currently reading about TPL dataflow pipelines and there's something I'm curious about when it comes to TransformBlocks. As I understand it, TransformBlock<TInput, TOutput> accepts an input of type TInput, transforms it, then return the new result as type TOutput, to be passed in to another block, perhaps.

I think this would work fine if the transformation would change the type itself, but what if the types are the same? And the output is a reference to the input? Suppose the object is too large where copying or cloning it wouldn't be efficient. Consider this example:

Suppose I have a string where I need to apply heavy concatenations on it sequentially. To save memory, I would be using StringBuilder. Here are the blocks.

var sbBlock = new TransformBlock<string, StringBuilder>(str => new StringBuilder(str));
var op1Block = new TransformBlock<StringBuilder, StringBuilder>(sb =>
{
    // call API
    // concat to sb
    return sb;
});
var op2Block = new TransformBlock<StringBuilder, StringBuilder>(sb =>
{
    // call API
    // concat to sb
    return sb;
});

sbBlock.LinkTo(op1Block, blockOptions);
op1Block.LinkTo(op2Block, blockOptions);

So, it's really just a pipeline of TransformBlocks, but most of them just modifies sb in place. When I thought about this, it looks concerning. In the context of blocks, op1Block and op2Block have side effects yet return a value, which is very dangerous. In the context of the whole pipeline, there can be no issues since the states are never shared and they are passed in sequence, so the next block will always get the most updated value. However, I could be wrong about this and would like clarification.

My questions:

  • Am I right with my observations?
  • Is this good practice? I am not sure if the processing of sb can still be considered immutable across all blocks, or it might introduce issues down the line.
  • Does TPL dataflow have other ways to handle cases like this?
Share Improve this question edited Mar 12 at 6:38 Mark Seemann 234k50 gold badges447 silver badges775 bronze badges asked Mar 12 at 3:16 lightning_missilelightning_missile 3,0845 gold badges38 silver badges67 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 2

I believe in all block types the messages/objects are removed from the internal buffers once they are propagated successfully to the next block. This would mean that the changes we do to the same object are insulated only to the current step in the pipeline - no race conditions. Some example code to corroborate this (Release mode necessary):

var foo = new object();
var wr = new WeakReference(foo);
var transformBlock = new TransformBlock<object, object>(o => o);
var transformBlock2 = new TransformBlock<object, object>(o => new object());
var actionBlock = new ActionBlock<object>(a => {
    Console.WriteLine(wr.IsAlive); // True
    GC.Collect();
    GC.WaitForPendingFinalizers();
    GC.Collect();
    Console.WriteLine(wr.IsAlive); // False 
});

transformBlock.LinkTo(transformBlock2);
transformBlock2.LinkTo(actionBlock);
transformBlock.Post(foo);
Thread.Sleep(10000);

However, one scenario I can think of that would be problematic is with using BroadcastBlock with a cloning function that just returns the same object:

public static void Main() {
    var mres = new ManualResetEventSlim(false);
    
    var broadCastBlock = new BroadcastBlock<A>(a => a);
    var transformBlock = new TransformBlock<A, A>(a => {
        a.SomeProperty = 4;
        mres.Set();
        Thread.Sleep(100);
        return a;
    });

    var transformBlock2 = new TransformBlock<A, A>(a => {
        mres.Wait();
        a.SomeProperty = 4444;
        return a;
    });

    var actionBlock = new ActionBlock<A>(a => {
        // here we have side effects from another transform block
        // than the one we linked to
        Console.WriteLine(a.SomeProperty);
    });

    broadCastBlock.LinkTo(transformBlock);
    broadCastBlock.LinkTo(transformBlock2);
    transformBlock.LinkTo(actionBlock);


    broadCastBlock.Post(new A());
    // 4444 will be printed
    Thread.Sleep(10000);

}

class A {
    public int SomeProperty { get; set; }
}

This contrived example will override the changes made by transformBlock that actionBlock is linked to by changes made by transformBlock2 which actionBlock is NOT linked to.

A bit of speculation, but the presence of the cloning function in the API for BroadcastBlock would hint to me that its absence in other block types constructors/API could mean that your scenario would be fine so long as the object is not changed sneakily in code that is not part of the pipeline (or the methods that are called as part of its execution).

TPL dataflow... Now, that's a name I haven't heard in a long time... A long time.

The concern expressed in the OP is understandable, and the example with a StringBuilder object is well-chosen. You can, however, extend that concern to most objects on .NET, since most objects in the .NET base class library are mutable, and passed around as objects.

Thus, as a completely general observation, in all mainstream programming languages, including C# and F#, if you're concerned about functional programming and referential transparency, the responsibility lies entirely on you. You'll have to maintain the discipline to keep things functional, including making sure that you only pass around immutable objects.

That's a general observation about doing any kind of functional programming on most languages. The only languages I know of that explicitly model the distinction between pure functions and impure actions are Haskell, Idris, and PureScript.

Is it going to be a problem with the TPL dataflow library?

I don't know. It's been more than ten years since I even looked at it, but in general, when you're dealing with Pipes and Filters architectures, you may still be okay passing mutable objects around like that. The most important rule to follow, as far as I can tell, is that you shouldn't keep references to mutable objects around, because that's likely to lead to aliasing problems.

On the other hand, if you only receive a mutable object as input, do something to it, and then pass it on to the next filter in the pipeline, you'll effectively have 'serialized writes', which isn't much different from a normal program that first does one thing, then another, and so on, to a mutable object.

All that said, if you want to be on the safe side, favour immutable objects. In C#, favour records over classes. And while it's fine to worry about performance, why not start with immutable data, and only change to mutable objects if you've measured and found that this significantly improves performance?

Usually, you'll find that immutable data structures can be quite efficient, too. In my experience, they perform just fine, and significant performance improvements are usually to be found in choosing an appropriate algorithm or architecture, rather than worrying about small-scale data structures.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744770923a4592753.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信