maintenance_work_mem controls the amount of memory that the system will allocate in two different cases which are basically unrelated to each other. First, it controls the maximum amount of memory that the system will use when building an index. In order to build a btree index, the input data must be sorted, and if the data to be sorted does not fit into maintenance_work_mem, it will be spilled to disk. At least in my experience, this does not tend to be a big problem. Index builds are rare in many installations, are generally initiated manually, and you can always raise the value just for your local session by issuing the SET command before CREATE INDEX if you have determined that it helps performance in your environment. The second thing maintenance_work_mem does is control the amount of memory that vacuum will allocate to store the TIDs of dead index tuples. It can be a little bit difficult to understand how to tune the amount of memory that is needed for this purpose.
In brief, vacuum scans the table and collects the TIDs of all the dead tuples. Then it scans the indexes to remove all entries for those TIDs from the indexes. If it runs out of memory to store the TIDs of dead tuples before it scans the whole table, it will stop the table scan, scan the indexes, discard the accumulate list of TIDs, and then resume scanning the table from the point where it left off. If there is even one more dead tuple encountered after that point, it will end up scanning all of the indexes a second time. On a large table, scanning the indexes more than once is really expensive, especially if the table has many of them. If your maintenance_work_mem setting is drastically too low, you could even get more than two index scans. That would stink. So, you want to set maintenance_work_mem high enough that you never get multiple index scans.
On the other hand, you also don't want to run out of memory. Note that the system will run as many autovacuum workers at a time as specified by autovacuum_max_workers, and each one will allocate an amount of memory equal to maintenance_work_mem. Currently, it always allocates the full amount; hopefully, that will be fixed in the future. So, if you have the default setting of autovacuum_max_workers = 3, and you set a value like maintenance_work_mem = 10GB, you can and very often will have 30GB of RAM dedicated just to autovacuum, not counting any foreground VACUUM or CREATE INDEX operations that you may run. You can very easily run a small system out of memory this way, and even a large system may have some performance problems.
Sometimes it's easier to estimate based on the on-disk size of your largest relation rather than the number of row it contains. To do this, you'll have to make some estimate of how many tuples you have per 8kB page. For example, if you estimate that your average row is taking up about 128 bytes on disk, which is not an unreasonable estimate in general but will obviously vary based on what data you have stored in your table, then you have about 64 tuples per 8kB block, which works out to about 8192 tuples per 1MB of table size. So if your largest table is 10GB, that would be about 10240 megabytes * 8192 tuples/MB * 1.2 bytes/tuple = ~96 MB of maintenance_work_mem; perhaps configure 128MB or so to be conservative. You can see that you need a fairly big table to require more than the 1GB I recommended in the first paragraph, and hopefully this explanation makes it clear why this is so.
If you do have very large tables, one thing you can do to reduce the amount of maintenance_work_mem that will be required is to reduce autovacuum_vacuum_scale_factor to a smaller value. Instead of vacuuming the table when the number of dead rows is 20% of the table size, you might want to vacuum the table when the number of dead rows is 1% of the table size, or even 0.1% of the table size. This can be configured on a per-table basis, which is good, because you likely only want to reduce it for your very large tables. This might seem like a strange recommendation, since vacuuming the table more frequently and scanning the indexes once each time doesn't necessarily save any I/O vs. scanning it less frequently and scanning the indexes multiple times each time. However, that ignores the impact on your foreground tasks. Especially on very large tables, more frequent vacuuming tends to result in substantial improvements to query performance. If you only vacuum a 20MB table when it has 20MB * 20% = 4MB of dead tuples, that's OK, because 4MB isn't that much. If you only vacuum a 200GB table when it has 200GB * 20% = 40GB of dead tuples, that tends to be fairly painful, because every sequential scan of the table has to read and skip 40GB of extra stuff.
I have a vague feeling that PostgreSQL could be smarter than it is in this area. In the example calculation above, the amount of maintenenance_work_mem that is required is about 1/100th of the size of the largest table. For a significant percentage of users, though certainly not all, the size of the active portion of the database is no more than a few times larger than physical memory. Such users would be quite happy if autovacuum just allocated a small amount of memory initially and grew that amount as necessary to avoid multiple index scans, with no upper limit at all. Despite the lack of an upper limit on memory consumption, they would never actually run out of memory, because only the amount of memory actually required would get allocated, and that's probably not too large. On the other hand, in the current system, because the configured amount is always allocated, they can quite easily run out of memory by setting the value higher than what is really required. On the other hand, a strategy of allocating with no upper limit would be terrible for people running really large PostgreSQL databases on modest hardware. For those people, the current system is superior. So, it's not entirely clear how to improve on what we have today.