jsburbidge: (Default)
 There's an old trick in searching a C-style string that you actually own (i.e. you can't do this in an implementation of strchr(), which requires a const argument): you can use a sentinel to speed up searching.
 
If you have a 300-character long C-style string, and it's not in read-only memory and is safe to modify, and you want to find the first instance of a character in the string, you can use std::strchr(), but that requires that at every point in the search the function has to check for two conditions: first, is it the character that you are searching for, and, secondly, is it the null/end-of-string character. Or you can very temporarily assign the value of the character you are looking for to the location of that terminal null and use memchr() instead, changing it back when the call is over. If you get the address of the artificial sentinel, there were no instances in the string.
 
Note that in this context this is a pure, old-fashioned optimization, of the sort that you don't do unless it's genuinely useful, as it complicates the code and makes it more brittle. That being said, it's a very long-established trick which shouldn't be particularly confusing.
 
However, there are other domains where using a sentinel can be a win from the design and maintenance as well as the efficiency perspective.
 
I had a set of conditions, settable at run time by command-line options, which set filters on a set of records. These were implicitly anded together - if a record didn't match a criterion, it was out. (There was special handling to allow two or more conditions of certain types to be ored together.)
 
Writing this as a hardcoded set of if/else choices would have been horrendous. So I implemented them as a family of function objects stored in a vector. The conditions could be set up initially in a fairly straightforward manner. If you set the most common reasons for filtering out up front, you could optimize to reduce the number of comparisons. The conditions could be set up in a simple std::any_of call.
 
However, if there was filtering out, there was still some stuff to do; this wasn't a simple case where you have to do stuff only when an item was found. So it looked like
 
if (std::any_of(a.begin(), a end(), [&](const auto& arg) {... condition ...})
(
// Do some stuff
// Return X
}
else
{
// Do some other stuff
// Return Y
}
 
This is ugly. It's maintainability isn't awful, but it's not great, either. And every run has an extra branch after the end of the STL algorithm.
 
(Branching can mess with pipelining and slow down performance. This is in addition to being, frequently, maintenance problem points. In many cases both clarity and efficiency argue in favour of replacing branching by polymorphism.)
 
But I had created these function objects. I could do anything I wanted with their interfaces. (If I hadn't, I could have used a wrapper.) So I added an additional execAfterFind() function to the function object, and all of the real criteria for exclusion had a (common, inherited) implementation corresponding to the if part of the test. I then created a new type which *always* matched and placed it at the end of the vector of tests in every case. It, and it alone, had an implementation of the new function corresponding to the else branch.
 
Now the call looked roughly like this:
 
auto foo = std::find_if(a.begin(), a end(), [&](const auto& arg) {... condition ...});
foo->execAfterFind();
 
This is cleaner overall, not only at this site. What about performance?
 
For the case where the record ends up not being filtered out, there's probably no gain: unless a really good optimizing compiler optimizes the test away on the "always matches" object through a polymorphic call (unlikely) the new object just moves an if/else test around. There might be a small cache benefit because all the object tests with their functions were allocated one after another and we just *might* have improved some cache locality, but I wouldn't count on that, either.
 
However, most records are expected to be filtered out. Consider a record that gets booted by the first, most likely, test. In the old implementation there were two successive branches, one for the test, one for the branch after the STL algorithm has run its course. Now there is only the one branch. So we probably gained overall; we're certainly not likely to have made anything worse. So we have an improvement in readability / maintainability and efficiency, both at once.
 
If your concern were strictly time optimization, and you have the space, and a known, small enough, set of criteria, by the way, this is not the way to go about it. For that you give every condition its own return value as a different power of 2 and use std:: accumulate rather than anything with a test. After running std::accumulate you can use if/else if all you care about is matching at all; otherwise, use an array of 256 (or 128, or whatever best suits your use case; table sizes corresponding to larger sets are probably not ideal unless you really need the speed over the space[1]) function objects addressed from a known point and just invoke them with the returned value as the array index. I do not recommend using jump tables of this sort as an approach supporting more maintainability, though: they are tremendously fragile in the context of code changes. They can be extremely fast.
 
Even a simple array of two functions can be used if you are setting values as only 1:
 
std::array<std::unique_ptr<IFoo>, 2> funcs;
//Set up array
...
int val = std::accumulate(... parameters including initial value of 0...);
funcs[val].execAfterFind();
 
The drawback is that you always visit every test; any_of and find_if truncate your search. You'd have to give very, very careful thought to whether this would actually be a benefit or a cost, and you probably want to do careful profiling over a range of cases. (In the case I had, the majority of records would be screened out quickly; this would not have been an appropriate solution. If most had been retained, that would be another question.) The other drawback is that the table setup is rather more complex and uglier than the preparation for the sentinel approach.
 
[1] If you have more than 8 tests then the gains from not branching are going to be counterbalanced by the need to process all eight tests rather than short-circuiting as find_if does.

Filtering

Dec. 21st, 2022 09:55 am
jsburbidge: (Default)
 If you (for C++ developers values of you) happen to be in the happy possession of a C++20 compiler, one facility it provides is the range-based filter view which allows for iterating over a range while filtering out certain elements.

If you don't have one, there are several options.

At a simple level, for use with for_ each(), there's simple composition. If you have a filter predicate Pred for Foo:

class Pred
{
public:
bool operator()(const Foo& inVal) const;
};

And a functor that does something with them, Op:

class Op
{
public:
void operator()(const Foo& inVal);
};

you can always create a composite:

class FilteredOp
{
public:

FilteredOp(Op& inOp, const Pred& inPred);

void operator()(const Foo& inVal)
{
if (m_op(inVal))
m_pred(inVal);
}
private:
Op& m_op;
const Pred& m_pred;
};

(We will refer to this as the naive version. This could easily be turned into a template to do more general composition.)

This works just fine - if all you want to invoke is for_each(). But if you want to use, e.g. transform() or rotate_copy(), it won't work. (Some operations provide a filtered option with their _if variants. Many do not. Many of those operate in such a way that a valid return value is expected for every application. In other cases, e.g. sample(), there is no predicate functor to be extended in this way.)

It is also very slightly more elaborate to write

Op op;
FilteredOp fop(op, Pred());

std::for_each(seq.begin(), seq.end(), fop);

than, say,

Op op;

filtered_for_each(seq.begin(), seq.end(), op, Pred());

even if you legitimately want to use for_each(), but only very slightly.

(The same applies a fortiori if FilteredOp is a closure; the difference lies in how closely you have to look at what is happening to discern intent; a closure has no name to assist the maintainer.)

The next alternative, if you have C++11, is to create a temporary filtered copy using copy_if:

typedef Seq std::vector<Foo>;

Op op;
{
Seq tempSeq;
std::copy_if(seq.begin(), seq end(), std::back_inserter(tempSeq), Pred());

std::for_each(tempSeq.begin(), tempSeq.end(), op);
}

This is not a great improvement on the naive version, and costs more. It does avoid multiplying entities. The big downside is that if you are processing the filtered data once, the copying costs in both time and memory.

It has the advantage of bring idiomatic. And, of course, it works for a use of std::sample().

It is a better choice if you want to process the filtered data in any way more than once - by far the best choice, as the costs of subsequent iterations will be cut by the initial filtering, unless you have memory constraints. (Also note that in C++20, you can separate the filtered and unfiltered elements by using remove_copy_if and have two sequences ready for subsequent operations.)

One other advantage of the copy_if approach is that you can change the nature of the collection - you can, for example, iterate over a vector and insert into a set, effectively carrying out a sort on your filtered items at the same time. This may not be as efficient as copying to a vector and then applying a sorting algorithm - but again, a second stage in processing of this type is the sort of thing the copy_if approach enables.

The other alternative is to turn to boost. Boost has a filter iterator, which skips elements satisfying Pred without doing any modifications. Thus:

Op op;

std::for_each(boost::filter_iterator<Pred, Seq::iterator>(tempSeq.begin()), boost::filter_iterator<Pred, Seq:: iterator> (tempSeq.end()), op);

This works generally.

If you need to filter once only, then this is preferable. (It can also be used to emulate copy_if if you have a C++03 compiler but also have boost, by using it with std::copy.) If you need to operate on the filtered set more than once, it is suboptimal, since every iteration has to visit every element in the full sequence each time - unless you are optimizing memory (large sequences) and care less about time; this is the option using the least memory.

You can compose filters if you need to. This gets confusing unless you use typedefs.

Whether it's idiomatic or not depends on how much you consider boost fundamental. The naming does declare exactly what you are doing, though.

The original use case that got me thinking about this was one with a switch driven by context. If a flag was set, we iterate over everything; if not, we iterate over just the subset. In a for_each example, the naive implementation looks like:

Op op;

If (flag)
{
FilteredOp fop(op, Pred());

std::for_each(seq.begin(), seq.end(), fop);
}
else
std::for_each(seq.begin(), seq.end(), op);

The copy_if example looks like:

Op op;

If (flag)
{
Seq tempSeq;
std::copy_if(seq.begin(), seq end(), std::back_inserter(tempSeq), Pred());

std::for_each(tempSeq.begin(), tempSeq.end(), op);
}
else
std::for_each(seq.begin(), seq.end(), op);

This can be simplified by filtering in a function

class Filter
{
public:
const Seq& getFilteredRange(const Seq& inSeq, book inFlag)
{
if (inFlag)
{
std::copy_if(inSeq.begin(), inSeq end(), std::back_inserter(m_temp), Pred());
return m_temp;
}
return inSeq;
}

private:
Seq m_temp;
};

Filter f;
Op op;
const Seq& toProcess = f.getFilteredRange(seq, flag);

std::for_each(toProcess.begin(), toProcess.end(), op);

Effectively the generator replaces the FilteredOp class, so it's a tradeoff in complexity but clearer at the call site.

The functor can avoid using if/else if implemented as a strategy. This is useful if it will be used multiple times, always with the same value of flag (e.g. passed at the command line).

The boost example looks like:

Op op;

If (flag)
std::for_each(boost::filter_iterator<Pred, Seq::iterator>(tempSeq.begin()), boost::filter_iterator<Pred, Seq:: iterator> (tempSeq.end()), op);
else
std::for_each(tempSeq.begin(), tempSeq.end(), op);

What about that notional filtered_for_each I threw in at the beginning?

Well, it can just be implemented by wrapping the boost version in a template function call. I'm not sure that the syntactic cleanup is better than a typedef, though. Once you have

typedef boost::filter_iterator<Pred, Seq::iterator> PredFilteredIterator;

instead of

template<typename T, typename U, typename V> void filtered_for_each(V& begin, V& end, T& inOp, const U& inPred)
{

std::for_each(boost::filter_iterator<U, V>(begin),
boost::filter_iterator<U, V>(end), inOp);
}

(declaration more complex than that, and needing more policies, but you get the picture...)

then

filtered_for_each(seq.begin(), seq.end(), op, Pred());

versus

std::for_each(PredFilteredIterator(tempSeq.begin()), PredFileredIterator(tempSeq.end()), op);

isn't a big improvement in clarity, and involves a lot more finicky work getting the function definition both correct and general.

It's generally a good idea to avoid for_each() as less expressive (and often less efficient) than the more specific algorithms in the STL. And anything that is complex enough that it doesn't fit any more specific algorithms is frequently complex enough that adding the filtering logic internally on a custom basis may make sense. (An example would be something aimed at generating elements in a collection based on another collection, but with a variable number based on the characteristics of the input parameter. This will not work with std::transform or std::generate_n. If you already have selection logic, integrating filtering logic may very well be more efficient on a custom basis inside your functor than doing so via any form, direct or indirect, of composition. Likewise, if you are processing a set of inputs and converting them into database insertions but skipping some, the field access you are doing to build the database insertions can double for filtering.)

In general, too, using a more precisely targetted algorithm instead of for_each() tends to move complexity out of the functor you have to write. In some cases it can move a lot complexity into library code. (Using remove_if() plus erase() is much, much simpler than implementing the behaviour in a general for loop of any sort.) But even using std::transform plus an inserter to fill a container means that you have separated out the code dealing with the target container from the code doing the element generation, even though the container logic remains in the application space.

For all these reasons putting effort into writing an extended for_each is probably always using energy and attention which can be better expended elsewhere.

Matt Wilson's Extended STL has a chapter on the implementation of filtering iterators. It may be worth emphasizing one thing that he notes; a filtered iterator cannot have the semantics of a random access iterator: not only will indexing be an issue, but so will the concept of distance between two iterators; both can in theory be supported but only a considerable expense, and may give unexpected values. (If we apply a filter F to a sequence of length 10, the effective length of the filtered sequence can't be determined without traversing the entire sequence, and an indexing operator might not even be anything but the end of the sequence (at one extreme) depending on how many elements were filtered out). If you need random access semantics, using copy_if to generate a standard sequence is by far a preferable option.

Editing

Dec. 4th, 2022 05:29 pm
jsburbidge: (Default)
When I was a student, back in the day, I had an Elite portable typewriter which had belonged to my grandfather. (The fact that it was Elite meant I got an extra couple of lines of words per page: N-page essay requirements always assumed Pica, which had fewer lines per page.)

My practice throughout my university days was, invariably, to write out every paper in longhand, edit the draft heavily, and then transfer the edited draft to the typewritten form for handing in, doing a second edit as I went.

A period in publishing as an editor taught me the standard markup formats, which I had not bothered with up until that point. But by that time computers were coming in and editing tended to become a continuous process onscreen; I rarely had the opportunity to use paper editing on my own texts.

(I may add, in passing, and as qualification in what follows, that I still think that printing out and editing a text is the only really effective way to end up with a good text. I have occasionally edited source code in this way. It is the best way, bar none, to attain to brevity.)

These days all my work is onscreen; i don't even have a printer. I do, however, find that the discipline of separating writing and editing remains critical.

I will regularly make a first draft of a class, sleep on it, and decide, on sleeping on it, that the design needs significant changing. This is too close to the original composition to be refactoring; it is, fundamentally, part of the original design process, with no re- about it.

Much of the time this leads to simplification; when it does not, it is because it leads to generalization, more complex in one place, less complex overall.

(I might as well throw in a note about "emergent design". I tend to agree with James Coplien that design is something which has to happen as its own discipline, and can't just emerge from work with concrete classes plus some general patterns principles. When I work on a feature, or on a bug once it has been analyzed, I never work without hewing to an explicit design, even when that design is not written down. But from that perspective design is not so much part of composition but its prerequisite: I couldn't have done that longhand writing without a good sense of what my overall structure was to begin with.)

A lot of the code that I see in production looks as though it was produced by developers who left off as soon as they got the first draft that actually worked. It's verbose and full of copy-and-paste antipatterns. Boolean flags are used instead of strategies and in some extreme cases independent access to global variables is used instead of parameter passing.

It uses idioms which were learned early but are not optimal. For example, most developers started out writing loops using for and while; but in C++ maintainability, clarity, concision, and speed of execution are better served by using the STL algorithms. One might draft out one's thoughts using a for loop, but finished code should have the additional thought put into it of using an appropriate algorithm.

In all cases these are less clear and less maintainable and in most cases also less efficient at runtime. But it's a first cut that's left in that state because it works and people won't edit.
jsburbidge: (Default)
 A few days ago, one of my co-workers contacted me about a possible bug in some code he had been going over. Part of the code went back to the original check-in of the code when it was migrated from a system called Harvest about four years ago (and lost all if its history in the process), and a number of lines had my name in them in git blame.

A little bit of checking showed that the algorithm was essentially as it had been four years ago. Several lines were marked as mine because I had converted the macro TRUE to the boolean value true on a couple of lines, and one because I had taken a freestanding C function and turned it into a member of the class in which it operated.  For practical purposes, the code was the same as it had always been - but my name was all over it. In addition, the problem would be expected to take the form of a line having dropped out, and there is no blame tracking attached to deletions.

In actual fact, the conclusion to be drawn was that the code was legacy code. Minor tweaks obscured that fact.

Git blame operates on a per-line basis. But any change to the line - tweaking parentheses, for example, or converting a C-style cast to a C++ cast - makes you the owner of the line.

On a greenfield project where responsibility is doled out in blocks it might be useful, but on a legacy projects it's worse than useless.

By coincidence, I had been looking at the blame record for a makefile the day before.  The makefile had an if-else block where both branches had the same statements. (In other words, there should have been no if-else block, but just the list of statements.) Blame shows five different names associated with the block of the code (all of whom, except (I think) the oldest one have some passive responsibility for the poor structure) but not in any coherent manner. 

When I look at a block of code and want to see its history, I want to see how the algorithm evolved, not how different lines were tweaked while retaining the same algorithm.

You can't avoid this problem as long as your algorithms are line-based. It's a whole different level of difficult to provide a program which divides a program into logical chunks and applies that analysis to the raw record; or (worse) to determine when apparently minor changes create new logic but skip over better implementations using the same logic. (A for loop and a find_if statement may be exactly formally equivalent, but don't expect any automated help to know that.)

So I will continue to avoid git blame. If I really need to look at the history of code .... I'll look up diffs from the history of the codebase and look at them as integral wholes.

Remove

May. 21st, 2022 10:36 am
jsburbidge: (Default)
 A few days ago I was doing a code review in which (in essence) the following loop occurred:

class Foo;

bool matches(const Foo& inVal);

std::vector<Foo> x;

std::vector<Foo>::iterator i = x.begin();
while (i != x.end())
{
    if (matches(*i))
        i = x.erase(i);
    else
        ++I;
}

I raised two issues with it: one, that there was a bug in it (it left out the else, meaning that where consecutive members of the vector matched the condition the second would be skipped).  The second was that the cost was fairly high; at the time I raised this by asking whether it made sense to replace the vector by a deque (reduces the cost for large sets of data) or a list (reduces the cost for all sets of data, but can raise the cost of other operations significantly), and was told that it would be a short vector and that the overall expense would therefore be low. (The cost is high because every time an element is deleted all the following events have to be shifted left once.)

About 24 hours later I asked about the use of remove_if and erase, shown below:

class Matcher {
public:
     bool operator()(const Foo& inVal) const {
         return matches(inVal);
     }
 };

x.erase(std::remove_if(x.begin(), x.end(), Matcher()), x.end());

I received the answer that the use of remove_if required extra work for a minor bit if code and it was left there, as far as that issue went.

But I remained curious.  How much extra work was it? So I actually drew up the code bits above and counted lines. If we leave out the setup (first three lines) the STL code snippet is actually one line shorter.

(This had to be workable in a C++03 compiler. A C++11 implementation with the STL algorithm using lambdas would be shorter still.)

What are the benefits of the STL approach? Well, first, it's not fragile; the bug caught on the code review is impossible in the equivalent STL code. The bug is directly related to another minor weakness of the iterator-based code: because of the variable incrementing logic it can't use an idiomatic for loop, but has to use while().

Secondly, it's faster; remove_if is an O(N) operation and the final erase, which only resets the end of the vector, is very cheap indeed.

Third, it is immediately clear what the code is doing. Because of some additional verbiage in the actual code (plus the missing else, which obscured the firm of the loop) I had to look twice to be sure of what was going on.

This is, in a nutshell, an illustration of why, in general, STL algorithms are preferable to hand-rolled loops: more reliable and generally faster. There are some subtle ways of using find_if and transform in unusual ways[1] but remove_if is clear about what it does.

But it's also an illustration of the implicit prejudice I find in developers against using the STL algorithms. The combined use of remove_if and erase is a very well-known idiom; there is nothing difficult about it. But the perceived - but not actually significant - overhead of having to provide a functor to feed to the algorithm [2] seems to constitute a mental barrier.


[1] For example, using find_if for "process every item in this ordered vector below a given value", for which it is admirably suited. Using for_each is less efficient, as find_if will cut off processing once the relevant value is found or surpassed if your test is written correctly. But you throw away the result of the search, which is not normally expected.

[2]Of course, even that is not real. If your test is in a function, as above, you can just pass a function pointer, although it is likely to be slightly less efficient.  If it's not in a function, it's a logical unit and either should be in a function or in a functor in any case.[3] (In the actual code the test was actually not a function, but a comparison expression. This actually did mean that the number of lines needed for the STL approach would be a little greater, as the test needed to be encapsulated for the STL where it had been merely dropped into the loop.)

[3] The marginal case being where this test is used only here. If it expresses a meaningful concept in the problem domain, though, it's a good bet that it is not.
jsburbidge: (Default)
 There's a not uncommon use case in programming where you want to do something once in a specific context: the typical example being where you want to log an event the first time, and only the first time, it happens (and it's not part of the program's startup, and typically may never happen at all, but if it happens at all may occur many, many times, which is why you want to limit the logging).

There's a common idiom for this:

static bool logged = false;
if (!logged)
{
    DoLogging(message);
    logged = true;
}

We will assume for the sake of discussion that this is not a routine which can be called by multiple threads at once, which would complicate the issue somewhat.

There are two small problems with this. The first is that it's an ugly little block, especially if it's used multiple times. (You can't extract it into a function called in multiple contexts because that static variable has to be unique for each context.) The second is that it's inefficient: we have to check the flag every time through. That means branching (always inefficient) and, worse, because it's a static variable it will almost certainly not be in the memory cache, making for a slow load.

So what can we do? We need a mechanism which chooses one path once, another path after that, and which neither branches nor uses static data on subsequent runs.

If we think of it in terms of typical one-time activities, we might think of a constructor. We can do the following:

class OneTimeLogger
{
     public:
     OneTimeLogger(const std::string& inMessage)
     {
         DoLogging (inMessage);
     }
 };

In context:

//...do stuff
static OneTimeLogger logger(message);
//...do other stuff

This looks attractive, but actually it solves nothing. First, because it's not a standard idiom it's going to confuse the hell out of a maintenance programmer. Any line which requires a detailed comment saying "do not delete this apparent no-op" is a bad thing. Secondly, it actually hides an expensive if/else switch. The compiler has to emit code checking for a first time on initializing a static object, and, worse, at least post-C++11 it has to make that check, and the initialization, thread-safe. (If this *is* a multi-threaded situation with potential contention, this might mean that the maintenance cost is worth it; it's the simplest thread-safe way of doing this I know. In that case, you might want to add a no-op log() function to the class and call it in every pass, so that it looks normal to a maintainer, although then you have to explain the no-op function where it's defined. The next alternative involves replacing the unique_ptr in the solution below with a shared_ptr or putting a mutex around the point where the smart pointer is updated.)

The expensive part of the test is one-time, but the test is still there, and it's going to be on static data. All that we've done is hidden the if/else test.

The other way out is polymorphism. Assuming that the calling context is in a class and not a freestanding function, we can do the following:

class MyClass
{
    public:

    class IThisEventLogger 
    {
        virtual ~IThisEventLogger() { }
        virtual void log() = 0;
    };

    class OneTimeEventLogger : public IThisEventLogger
    {
                       OneTimeEventLogger(std::unique_ptr<IThisEventLogger>& inParent, const std::string& inMessage):
        m_parent(inParent), m_message(inMessage)
    { }
    
    class NullThisEventLogger: public IThisEventLogger {
    virtual void log() { }
    };

    virtual void log()
    {
        DoLogging (m_message);
        m_parent.reset( new NullThisEventLogger());
    }
    private:
        std::unique_ptr<IThisEventLogger>& m_parent;
        std::string m_message;
    };
 
    MyClass(): m_logger(new OneTimeEventLogger(m_logger, "Message to be logged"))
    { }
    
    void doSomething()
    {
        //...stuff
        m_logger->log();
        //... more stuff
    }
    
    private:
     std::unique_ptr<IThisEventLogger> m_logger;
};

The trick is that the reset call in the logger is a fancy way of doing "delete this" (i.e. committing suicide), which is entirely legal in C++. (Also, passing in the address of the parent while constructing the parent is fine, because nothing happens with it until the object is fully constructed.) We choose to pass the message in the constructor because it reduces the cost of the no-op call to a bare minimum.

The first time log() is called, the message gets printed, and then the execution path for all future calls is changed to a no-op function. We still have a virtual call, but that should be on an object in the cache, and virtual calls are typically cheaper than branching. The call site is simplified; the only time complexity appears is when looking into the very detailed implementation, well away from the business logic in doSomething();

If the logging logic is sufficiently generic, the machinery for managing this can be extracted and reused so that it doesn't clog up the interface of the calling class. If the logic is complicated internally then the logger and the interface will have to be created locally. (If we have to log four variables, two of them integers and one floating-point as well as a string, we want to put the expense of generating the message to log into the one-time call as well, so a generic "pass a string to log as a parameter" may not be a good match, and pre-generating the message in the constructor, as above, may be impossible -- though usually something logged once and for all will have a very generic message).

The downside is that you need a separate logger for every message you want to print once - a classic trade of space for time. Of course, if your parent class is well-designed, it will have limited responsibilities, and the number of instances it will need will be correspondingly small. And the logger itself is cheap to construct - not much larger or costlier than the text of the message it logs; if it builds a custom message its content model is even simpler and its associated costs are even smaller.

ETA: you can make this more generic and more encapsulated by making the smart pointer a custom class which hides the delegation entirely and implements the logging interface (and, naturally, holds a smart pointer as a delegate).
 
jsburbidge: (Default)
 1) On checking my spam folder, I see that I have received two invitations to join the Illuminati, one in Italian.
 
If the AISB is going to contact anyone it will not be by cleartext e-mails. They will use proper tradecraft.
 
2) It is remarkable just how awful protesters' historical education is. The position of the Prime Minister has always, since the time of Walpole, been determined by the House of Commons. The Crown has no power to dismiss the PM and only very limited powers to prorogue Parliament (essentially, when the PM has lost the confidence of the house, or at the request of the PM). This was firmly established in 1649 and 1688, with tweaks in the 18th Century as the office of the Prime Minister developed.
 
3) I am getting tired of public health officials who are reported in the media as talking about masks and vaccines as though they were purely about individual risk rather than looking at the impact in populations of general adoption/dropping of particular activities. A 30 year old with two doses of vaccine who goes out without a mask is at a low risk of contracting symptomatic Covid and at very low risk of serious disease. But if 30-year olds in general do that, there will be a calculable increase in the spread of COVID to other parts of the population. Wearing a mask or being vaccinated is not principally about personal risk, in many cases; it is about being a responsible member of the body politic and of society.
 
4) Seen in real code, names slightly adjusted: 
 
class XKey
{
    public:
    XKey(const int inIdA, const int inIdB, const std::string& inName):
        m_idA(inIdA), m_idB(inIdB), m_name(inName)
    { }
 
    bool operator<(const XKey& inOther) const
    {
         return (m_idA < other.m_IdA) && 
             (m_idB < other.m_idB) &&
             (m_name < other.m_name);
     }
     private:
     const int m_idA;
     const int m_idB;
     const std::string m_name;
 };
 
Surprising things will happen when you try to use a map with that as a key.
 
Don't do this.
jsburbidge: (Chester)

In my last job, there was an internal site on which employees could create their own personal work-related blogs. I had one that dealt with C++ and OO topics that I'd been irritated into writing about in production code: I had to leave it behind when I left, but transferred the site ownership to my team leader so that it at least wouldn't be cleaned out automatically. My current workplace has various ways for teams to share data but nothing comparable for individuals.

This means that anything I'm irritated into will end up being posted here. Any code examples will be generalized enough to be generic and no details of proprietary IP will be shared, although various issues of actual production code quality  may be referenced, from this or from prior work contexts, without any specification of where.

For anyone who is not a programmer, these are likely to be TL;DR.

In the nearly thirty years since Design Patterns came out, the Singleton has generally been recognized as an anti-pattern, encouraging the use of global data with a mask on and hiding the flow of control, to say nothing of complicating unit tests.

However, in the context of trying to convert legacy code into well-designed and tested code, the Singleton has a role to play as a Band-Aid (sticking plaster to right-pondians) as a temporary measure, even though its ultimate fate will be un-Singletoning itself.

Consider the case where we are converting a legacy method with some associated data into a proper class. It uses ambient data from the huge source file it is being pulled out of freely, but the data it uses is tightly coupled in a logical sense, and if you group it together it naturally attracts a few existing functions which use that data and little else, to make a nicely encapsulated class. The data tracks state of ene sort or another and notionally has to be unique.

// in file RecordTracker.h
class Record;

class RecordTracker
{
public:
    void addRecord(const TransactionId& inId, const Record& inNewRecord);

    const Record& getRecord(const TransactionId& inId) const;
};

// In file f.cpp
RecordTracker theRecordTracker;

class RecordUser
{
public:
   void processTransactionUpdate(const Transaction& inTransaction)
   {
       //...
       const Record& rec = theRecordTracker.getRecord(inTransaction.getId();
       //...
   }
};

The data is used in several other compilation units, where it was declared extern. We update all of those locations to use the new class.

Now our legacy method uses one object, which it is still reading from global data. If we extract an interface from the new object, we can pass it in to the object we are now building on construction, as an instance of simple dependency injection. So we do that.

// in file IRecordTracker.h
class Record;

class IRecordTracker
{
public:

    virtual ~IRecordTracker();
    
    virtual void addRecord(const TransactionId& inId, const Record& inNewRecord) = 0;

    virtual const Record& getRecord(const TransactionId& inId) const; = 0;
};

// in file RecordTracker.h
#include "IRecordTracker.h"

class RecordTracker: public IRecordTracker
{
public:

    virtual ~RecordTracker();
    
    virtual void addRecord(const TransactionId& inId, const Record& inNewRecord);

    virtual const Record& getRecord(const TransactionId& inId) const;
};

// in file RecordUser.h
class RecordUser
{
public:

   RecordUser(const IRecordTracker& inTracker): m_tracker(inTracker) { }

   void processTransaction(const Transaction& inTransaction)
   {
       //...
       const Record& rec = m_tracker.getRecord(inTransaction.getId();
       //...
   }
private:
   const IRecordTracker& m_tracker;
};

There is a remaining problem: the object in question is still global data, used in multiple places. In fact, the second class we extracted the function into turns out to be useful in some (but not all) of those cases, so it's getting the value passed in there via interface as well.

The original location might be tolerable - the data is static lifetime data in that place. It could even in theory be moved into the compilation unit for the second class as an implementation detail - but we still have all the other access points to look at.

Some of those access points have very long call chains before they go back to a common source, frequently main(), with narrow interfaces[1]. Many of them have little to do with the area we're working on aside from incidental use of the data in question, and it's not in our scope to do far-flung imperial refactoring beyond the area we're having to touch. So we have two options:

  1. Leave it as it is; at least the class we're worried about is now functioning in a testable manner.
  2. Convert the instance of the data into a Singleton and have all the places which currently use the concrete class use the Singleton Instance() method. (Don't actually change the data class: implement the Singleton to use the interface and delegate operations to the original substantive class. This doesn't mess with our testing and prepares for the end of the process when the Singleton goes away.)

#include "RecordTracker.h"

class SingletonRecordTracker: public IRecordTracker
{
public:

    static IRecordTracker& Instance();
    
    virtual ~SingletonRecordTracker();
    
    virtual void addRecord(const TransactionId& inId, const Record& inNewRecord)
    {
        m_delegate.addRecord(inId, inNewRecord);
    }

    virtual const Record& getRecord(const TransactionId& inId) const
    {
        return m_delegate.getRecord(inId);
    }

private:
   SingletonRecordTracker();
   RecordTracker m_delegate;
};

If we do the second, we have a clear marker for the issue. We can also push the actual invocation up our call stack for the immediate case we're dealing with, widening interfaces to accept the interface, until we get to the point where the instance "naturally" lives (frequently in main() if it's dependent in any way on program parameters).

In the future, refactorings can push up the calling level from other locations using the interface higher up the call tree. As we go, we may (probably will) end up with more classes which hold a reference to the interface, as well. Eventually, all the calls invoking the Singleton will be in the same context.

At that point, we get rid of the Singleton and replace it with a simple instance of the class. (If the program is resolutely single-threaded, it may even be possible to allocate it on the stack; otherwise it will have to be static or allocated via operator new.)

So the Singleton acts as a temporary way of patching the problem for a period of time, better than the original state, and also as a marker that the problem continues to exist. (The analogy here is not flesh-coloured or clear Band-Aids, but the ones with pictures of Dora the Explorer I used to get for my daughter.) The existence of those Instance() calls is an explicit and deliberate marker of what has to be fixed, and where.

[1]One application I know, originally C but mixed C/C++ for over 20 years, heavily used, passes all execution of transactions through a single function which takes one argument - an enum indicating the type of operation to take.  This logic is in shared code and is called in several different applications. Someday this will presumably be refactored, but until then it's a barrier to any attempts to get rid of global data in some form.

jsburbidge: (Default)

I have seen a number of recent articles about projects which aim at encouraging more people to code; the aim being not one of, as it were, providing the means of acceleration for potentially good coders, nor, again, of providing the skills needed to do personal customization of editors and spreadsheets but getting more people involved in actual programming as a career.

This is, I am sure, well-intentioned, but it is probably not a good idea. If you go into software development not because you find it interesting, but because it pays reasonably (and at the upper end, better than reasonably) well, then unless you are buoyed by an unusual degree of intelligence you are potentially going to be a problem.

Consider the following profile. It's a merged profile based on some people I've known over the years (and on the code from many more people's code I've encountered). We'll call this developer Fred, because I do not know and have never known a developer named Fred.

Fred has decent credentials: at least a C.S. B.Sc. from a mid-range university. He - it has, so far, always been a he - works in the financial sector or another sector which makes significant use of custom software; he's not in the software industry itself, nor a worker at a startup.

Fred's work has always been on large and medium-sized systems. He may have been involved in some systems when they were new, but mainly he works in maintenance and extension. He's a diligent worker, and addresses the tasks given to him with reasonable speed.

Fred's real interest, though, isn't software in itself; it's a means of supporting himself and his family. When handed a task he does the minimum necessary to address the problem. Fred invariably leaves longer functions, larger files, and cyclical dependencies behind him. Unless he's carefully monitored, he doesn't concern himself with levelization or insulation.

If he were working in an environment which required adherence to best practices and enforced it, and if he were always to be assigned to implement a team leader's design, he probably wouldn't do too much harm. But in most cases I know of in environments like his, you're lucky if lip-service is even paid to the more onerous best practices, and nobody does code reviews or pair programming. The only people to see Fred's code up close are he himself and his successors when he moves on.

Fred is entropy at work on code. Even good programmers generate bugs; this sort of just-good-enough programming generates incomprehensibility over the medium to long term.

The blunt fact is that not only is good programming difficult - and if you don't believe that, try reading John Lakos' Large-Scale C++ or Alexandrescu's Modern C++ Design[1] - but it requires both an unusual degree of analytic ability and an enthusiasm for the ding an sich which is difficult to maintain at a high level.

My advice to a reasonably bright person with some competence at math but no love for it would be to gravitate to some less temperamental branch of engineering. If you have the sort of mind which revels in abstractions - if you feel the call of really pure math, or certain sorts of philosophy - then programming might be a good fit (given that positions teaching pure math are as rare as hen's teeth).

[1]And don't believe that moving to a higher language will save you. There's a ton of bad code in Java that I've seen. And scripting languages like perl and python degenerate into spaghetti code at an astonishing rate.

jsburbidge: (Default)

The second of a matched pair. That one was about requirements; this is about implementation.

This can be taken as a parable about development in a big organization.

A previous employer had developed a compliance process for reporting user application privileges on production applications. Reports went in XML to a mainframe. The general expectation was that these applications would usually be either Windows (.NET) or Java, for which easily-called libraries were available for ftp.

(The developers' manual was written in a way which suggested they assumed that the reader knew nothing about XML, and that Java development tools would be in use.)

They did not have anything readily available for Unix/C. My original encounter with it was for an application which I supported where generating the information was straightforward, but three months were spent before anyone could deliver a scriptable ftp (beyond "dump stdin into the default ftp process") for the Solaris environment involved. (I don't think I ever did get nftp installed.)

Some time later I was conscripted to finish up work on a reporting script for another Unix-based application. This one had a Perl component.

It amounted to: read a config file with user ids, join with the results of a database query, write out in XML, ftp to a given location on the mainframe (you could tell it was a mainframe because of some sector-related commands in the FTP).

I was dragged in because the original developer had used Perl, and I had a working knowledge of Perl. Nobody in the group that usually handled reports knew Perl and the actual business group didn't have anyone left with the skills to complete it. If it hadn't been a compliance request it would never have happened.

Development time, at worst, could be measured in a couple of hours. There were all sorts of delays resulting from the runtime environments; access (over two months for me to get access to the relevant environment); the fact that some columns in the production database, required for the report, were not populated (and when they did get populated, for the sole purpose of downloading to XML, some idiot put an ampersand into one field).

Actual elapsed time, first stage: three months, with some trailing activity.

Two months later, I got dragged in again: they had split the one application into two domains running on different servers with differing but overlapping user lists. They had to either merge these into one report or set it up with the other end as two reports. They did not seem to have anticipated this; it was an afterthought.

It might have gone the two reports way - I thought it best, it reflected the business model, the Unix sysadmin thought it best - but because he had had previous experience with compliance a development manager to whom I did not report and who was essentially a waste of air - I had dealt with him in other circumstances - was looped in to help deal with compliance and instead, because it would be an effort for him to do so, inserted himself into the process and insisted on going the one merged report route (so that it wouldn't be necessary to deal with the other end).

This meant extra time setting up secure FTP between the servers, and writing a Perl script variant on the principal server to pull over the file from the other server and merge the two files, made trickier by the fact that some duplicated users had different privileges on the two systems but could have only one record in the report.

Again, maybe a day's worth of development, in total: actual elapsed time, second stage: one month.

This was a high-priority process, and the requirements were simple and clear. What was essentially red tape filled up my mailbox and stretched the time to delivery out by a factor of, oh, twelve, assuming that reasonable time on top of one day's development would be four days testing and promotion, about a week to one day's worth of actual development work.

Reports

Jul. 19th, 2019 04:18 pm
jsburbidge: (Default)

For half a year or so between working on more major projects, several years ago, I was stuck^H^H^H^H^H charged with implementing a number of reports on the back end. These had broad similarities in that they were written in Java, but they had different internal clients and ran in slightly different environments.

As separate projects, they had to have formal Requirements documents (System Requirements Specifications) - it was maybe five years before my employer decided to formally, if ineffectively, embrace Agile - and if they took more than 80 hours of work they also had to have a formal Design document and go through a much more formal approval process. The aim, accordingly, was to keep projects under 80 hours, if necessary by lying...

I'll take a representative one. It was a report in Excel format mandated by Compliance, which meant it was going to happen no matter what. The original requirements document had some vaguenesses and lacunae, and there were privileges for database access I needed to get, so it took about a week to get a first pass at the report back to the Business Analyst This exposed some more vaguenesses which needed to be specified (i.e. differences from the implementation) and it took a few more days of tweaking code and running tests - one nice thing about reports is that the developer can easily do his own testing - to get it back to him...

...only to be told that the client had new requirements. Additional columns, and some expansions to the records to be displayed.

This cycle was repeated three or four times more.

Deployment was a headache; the testing and production systems used different names for stored procedures, used different versions of some packages, and ran at a different OS patchlevel.

It took four months, elapsed, to reach production deployment of the application. At some point we undoubtedly went over the 80 hour mark, so reporting to management of time allocated was blurred into inaccuracy. (It was precisely to discourage the use of developer's cycles in casual changes with no chance to set priorities between projects which had led to the 80-hour cutoff for minor projects in the first place. As far as I could tell it was completely ineffective.)

I'm sure this took longer still; I had been looped in as an available resource only after the SRS had been written and approved in its first version.

Part of this was the fault of the Busuness Analyst. My own experience in an earlier context had indicated that in a watetfall-style world you have to be quite aggressive in finding out what users really want. And he had to be pushed into updating the requirements document to reflect the revised requirements, which is necessary to provide a basis for future maintenance. A bigger part of the fault was the business, for not thinking through what they wanted in the first place. A final part was that documentation of environments, credentials, and permissions was very poor and had to be extracted by specific e-mails passed up the chain.

This is a reason why larger businesses have jumped into a pseudo-Agile bandwagon: the metrics look better for piecewise delivery which is happening anyway. On this model, "delivery" took four plus months. In an Agile model, the first cut of the program would have been a sprint in itself, and each iteration would have been a distinct thing in itself. The Compliance department wouldn't have received their finished report any faster, but on the reporting that went up through development management it would have shown as a sequence of short successful sprints, mixed in with a couple that were blocked by external forces (from the point of view of immediate management, though internal to the organization).

Under Agile, e-mail records, from which I was able to reconstruct this, would have been reduced, which is a good in itself: daily scrums would have eliminated the need for e-mail In many cases (though I'm not sure it would have sped up e.g. getting usernames and passwords assigned). However, only e-mails between myself and the business analyst would have been affected. I doubt that the compliance department, in a different building, would have sent a knowledgeable representative over to be involved in a daily scrum; and the same for the key person in the database group, on a different floor. Actually implementing a true daily scrum with all stakeholders would have been difficult.

The dragged-out requirements were a waste of time and effort. If the client had been clear up front about what they wanted, with about the same amount of work they could have had a final report in a couple of weeks, and with no need to munge the hours to keep it below the radar, with a bit of trailing work getting it into production. An incompletely carried-through Agile approach would just have formally institutionalized a wasteful and pointless misuse of resources.

There are places where iterative development makes sense - for example, one with a large base of target users and a limited ability to get general feedback out of them except by pushing out better approximations on a regular basis. Or where a complex interactive application is the deliverable. But an internal client should be able to put together an exact model of a report right out of the gate; you can do it by messing around in Excel with sample data.

In a true Agile environment, of course, the client would be directly involved in daily meetings, face-to-face, and the tightening of the feedback loop would probably act to bring the time for development down sharply. But I have a hard time imagining that degree of involvement from a distant department in a large organization for this sort of project.

jsburbidge: (Lea)

Sometimes the oddest things show bitrot as sfnal futures diverge from our actual future.

This is from Dorothy Heydt's A Point of Honor (now available at her website in electronic format):

..."the scrolling slowed and stopped, displaying a pair of lines surrounded by asterisks: "Following 1175 lines make the tree growth respond to the wind direction. G.H."

"All the code we wrote is commented," Greg said. "Good code should be...."

Heydt generally gets UNIXy geeky things right, but between the high days of hacker and USENET culture and today some underlying details have changed so as to becone a little jarring.

The idea that you mark a functional unit of over a thousand lines in a highly optimized program by a comment was once arguably good practice, but emphasis is now put on getting the clarity by a greater number of small functions. Optimizing compilers that do inlining render the efficiency gains of avoiding multiple function calls, doing loop unrolling, and so forth, less important. No normal function should be even in the small hundreds of lines. (This is still frequently not actually true. I've seen lots of production code with thousand-line functions, but that has been invariably the result of laziness and sloppiness and not because someone was intent on saving cycles.)

Explicative clarity is gained by providing expressive function names (linkers capable of handling long symbols helped there as well):

void take_wind_input_and_generate_tree_motion()

(or, if you prefer,

void TakeWindInputAndGenerateTreeMotion())

with appropriate sub-functions.

The days of Henry Spencer's "Thy external identifiers shall be unique in the first six characters, though this harsh discipline be irksome and the years of its necessity stretch before thee seemingly without end, lest thou tear thy hair out and go mad on that fateful day when thou desirest to make thy program run on an old system" are now past: it would now be a bizarre system where the linker was so strictly constrained. (This led to function names like lstwgm() for landscape (package prefix to help insure uniqueness) tree wind generate motion.)

Frequent advice is to avoid comments where the symbol naming conventions render them redundant. (Comments are still encouraged for documentation purposes in headers which will be run through doxygen - if discussion is necessary.)

The thing making this slightly jarring is not its applicability to the programming practices for optimized code from the 1970s and 80s, but simply the fact that the story takes place in the future, at a date which is still in our future.

Capability

May. 30th, 2017 09:12 pm
jsburbidge: (Default)
A few days ago I was looking into some roughly seven or eight year old production C++ code, and saw, basically, the following (the only thing I have changed is the function name):

try {
doSomething();
} catch (...) {
throw;
}

It is, enough to make one despair, because although I have seen worse code - this at least introduces no bugs - I have never seen anything that more clearly shows that the author had no idea what the hell he was doing. (The problem being that the effect is exactly the same as not putting in a try/catch block at all; it shows that the writer failed to understand in any real sense either the mechanisms nor the rationale for the use of exceptions.)

As a piece of cargo-cult programming it belongs with this, also from the same codebase:

~Foo() {
try {
delete m_bar;
} catch(...) {
}
}

Here, the author has been told that exceptions should never be thrown from inside a destructor, and has possibly been exposed to the idiom that surrounds generalised cleanup functions (like close()) with try/catch blocks in destructors to avoid it. He - it was a he, there were no women on the team that produced this - does not seem to have internalised the ability to make the judgement that when you are calling delete, if the class is yours you should be hewing to the above stated standard and if it's not you should drop it like a hot potato if it throws from a destructor. (The member class pointer was to another class in the same library.) There is no need to guard a call to delete with a try / catch block; if it throws an exception you are probably already screwed (corrupted heap, or something equally bad).

The same codebase betrays a failure to grasp polymorphism, const-correctness, the use of initializer lists, the Law of Demeter, design patterns, data hiding, package dependencies, global data use, and so forth. The code base dates from no later than 2008: it might be permissible in code from 1990.

This is massively depressing.

I am no great defender of the modern rush towards credentialism and the move of the universities to become glorified trade schools for their students - my undergraduate university syllabus is now full of departments of humane letters trying to appeal to undergraduates based on usefulness in the world of employment - and my own background has scrupulously avoided any such thing (I write software with an M.A. In English and an LL.B., which I took out of an interest in law as such). But the one thing which having Computer Science courses is supposed to give us is properly trained and capable graduates. And from what I know organizationally these developers must have been at least CS grads, and very possibly Software Engineering grads.

These are not isolated observations. The majority of the "professional" code I see reflects sheer incapacity. The flaws it has are not those of carelessness, which can happen to anyone - nobody writes bug-free code, and even extensive testing and many eyes cannot be guaranteed to flush out all bugs - but of a sheer unwillingness to think about design or structure. Maintainability as an end is tossed to one side (even otherwise "good" developers seem to feel that there is nothing wrong in adding another 25 lines to a function which is already 500 lines long rather than breaking out, at a minimum, the new code and refactoring towards better modularization). Nobody seems to perform dependency analysis. In 25 years I have from time to time seen lip service paid to code reviews but never seen them actually implemented as general practice.

There are good developers out there; I've worked with some and read books by others. But whatever standards general training is supposed to impose are either too low or breached more often than observed.

This sort of complaint can be focussed just on the local area of software development, in which case it tends to become part of the standard debate about software engineering standards in academy and workplace, usually drawing distinctions between the generation of code and, say, building bridges, with the difference in length of existence of disciplines being adduced. However, I suspect that it is one aspect of a more general pattern, a kind of milder Sturgeon's Law. Certainly I have not known a surfeit of good managers, wide variations in teaching skill are prevalent in the schools, and competent politicians seem to be a vanishing breed (probably always vanishing, as the past is seen by the light of nostalgia: I'm not suggesting that things are getting worse, just that they've always been pretty dispiriting).

Some of this is institutionally driven: it really does make more pragmatic sense in many contexts to fill your required roles with what you can reasonably easily get rather than spending seemingly inordinate amounts of time hunting down significantly more able candidates, after a certain threshold has been passed, and to make the roles to be filled have responsibilities and expectations which can be met by the run-of-the-mill rather than to craft roles around the specific skills of the very good. Interchangeable parts apply to human roles as well as physical components. Likewise, society expects universities to turn out reasonably full classes of graduates rather than putting standards high enough that a significant number of students fail, drop out, or take six years to obtain a bachelor's degree. The intersection of the production rules for human capital and the onboarding rules which govern its corporate deployment can quickly lead to positions being filled with less than ideal employees.

That's not to say that it's right, either. I've seen a number of expensive projects which would have been much cheaper, and faster, if those in charge had been willing to pay to get them right the first time rather than be cheap the first time and have to pay more for a rebuild on the second - or third - try. (These have frequently been cases where an "essential" project has budgetary constraints, so a budget is deliberately confected which comes in just under the limits in time and money, and the rest of the time and money predictably become available once the project is further advanced, due to the sunk cost fallacy's domination over the minds of senior managers.)

Some of it is socially driven. Just as our Deweyan schools are built around social promotion for social at least as much as administrative reasons, so also much of our working world is socially built around the idea that most people are reasonably competent. There's a general attitude that encourages a culture of "good enough" when applied to domains which benefit from expertise.

But we, collectively, pay for it. Shoddy work - in infrastructure, education, management, politics - takes a toll on the lives of all of us, frequently indirectly, occasionally directly.

Profile

jsburbidge: (Default)
jsburbidge

April 2025

S M T W T F S
  12345
67 89101112
13141516171819
20212223242526
27282930   

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated May. 23rd, 2025 10:33 am
Powered by Dreamwidth Studios