Ethereum 1 Research topics

Migration from hexary to binary tries

If it is found that binary merkle tries are more beneficial than hexary merkle tries (and perhaps that it is also beneficial to change the hash function used to hash the trie), we need to think about potential migration path for existing implementations. Currently proposal is described here: https://ethresear.ch/t/optimizing-sparse-merkle-trees/3751

Theoretically, we can do a minimally disruptive migration by nominally keeping the patricia hex tree structure, but changing the hash function from

sha(rlp(x1 .... x16))

to

merkle_tree(x1 ... x16)

So then we would get most of the witness size reduction benefits. But at DB level nothing would change. Even for incremental update, we could make the root deterministic if we put information about whether or not an account was updated into the leaf.

Flat data layout for storing the state

For decoupling merkle tree from the data storage. The result of this research is benchmarks and documentation on how to deal with various issues when changing from trie-based storage model to flat storage model. Examples of such issues:

This documentation should serve as a guide for any implemention team who wishes to migrate their implementations to the flat storage model.
The beginning of this documentation is here: https://github.com/ledgerwatch/turbo-geth/blob/master/docs/programmers_guide/db_walkthrough.MD

Research phases:
THIN HISTORY. Indexing of historical data - we are researching and implementing the approach in which most of the data about historical changes of accounts and contract storage are kept in the form of “change sets” (1 change set is written per block), and the “history” is simply the index mapping account address hashes or storage slot hashes to the “timeseries” of block number where these accounts and storage slots have been modified. This is mostly implemented, and the next phase is comparative testing with the previous model, where “history” contained most of the data. We currently expect further reduction in disk space requirement for storing the history, and insignificant slowdown of large historical queries (due to the fact that data will be queried using the index, adding an extra round trip to the database). Results of this phase are comparative benchmarks of THIN HISTORY vs THICK HISTORY.

BOUNDARY HASHES. This aims to solve the biggest drawbacks of the current implementation of the flat data layout - slow startup of the node (because it has to read all the accounts and build the top of the Patricia Merkle tree in memory), and difficulty to become light server, as well as “Red Queen”/”Firehose” seeder (Red Queens and Firehose are snapshot sync protocols, superior to “fast sync” and “warp sync” that we are planning to implement). The main idea is to add an extra bucket to the database, recording the subset of intermediate hashes (as mapping prefix => node hash, which is still compatible with the flat layout). Instead of maintaining such intermediate hashes for all the intermediate nodes, the plan is to maintain them only for the nodes that are not current in the memory. This should reduce unnecessary database churn while still delivering the benefits. We currently expect a slight increase in the database size, as well as some extra database write I/O per block. We expect this to be a trade off for much quicker start up time, as well as unblocking further development of Red Queen/Firehose snapshot sync protocols. Results of this phase are comparative benchmarks of startup times at various block heights; performance of “get_Proof” RPC method for recent block heights, compared with go-ethereum; comparison of extra database size and write I/O with and without the intermediate hashes.

RED QUEEN. This aims to develop a snapshot sync method that is more efficient than fast sync (supported by most implementation), and more secure than warp sync. Flat data layout makes it impossible to efficiently support fast sync, because fast sync’s primitive, called GetNodeData queries trie nodes based solely on their hashes, which implies the existence of the index from hashes to the node data. This index is missing in the flat data model. Warp sync is more efficient, but it has two main downsides. Firstly, the warp snapshots need to be pre-generated before being served. Secondly, the recipient of a wrap can only verify the warp’s correctness after it has been downloaded in its entirety. Red Queen sync has first been designed to be a combination of a protocol and an algorithm. Protocol is a set of primitives that request data to be transmitted from one peer (seeder) to another (leecher). Algorithm is the strategy that uses primitives of the protocol to synchronise the current or recent state. The main idea of Red Queen sync algorithm is to not assume that the seeder are able to serve any historical views of the state, they always server what they deem to be the current state. It is the job of the leecher (and the Red Queen algorithm) to combine the data from potentially different block heights and eventually arrive at a consistent and complete view of the state.

The State Analytics

This project aims at making charts and tables that has appeared in various blog posts, such as:
https://medium.com/@akhounov/ethereum-block-gas-limit-increase-and-state-growth-b95353153179

https://medium.com/@akhounov/data-from-the-ethereum-stateless-prototype-8c69479c8abc

https://medium.com/@akhounov/looking-back-at-the-ethereum-1x-workshop-26-28-01-2019-part-4-a69b48e14309

https://medium.com/@akhounov/looking-back-at-the-ethereum-1x-workshop-26-28-01-2019-part-2-d3d8fdcede10

more accessible on regular basis (aim 1), and then reproducible by anyone in the possesion of a turbo-geth archive node (aim 2).
For aim 1, there would be a remote access to the turbo-geth database, whicn will allow extracting data without stopping the turbo-geth node. That way, stats can be gathered, and charts and tables produced and published automatically at regular intervals.
For aim 2, the tools used for extracting data from a working turbo-geth node would be extracted into a separate command line interface, with some user documentation on how to use and extend it.

Rough plan

  1. Server side of the remote read-only interface to BoltDB while running turbo-geth node. Almost done.
  2. Client side of the remote interface to BoltDB. Partically done.
  3. Implement “state growth vs block gas limit” charts from https://medium.com/@akhounov/ethereum-block-gas-limit-increase-and-state-growth-b95353153179
  4. Compare performance of remote extraction (with on-line node) with local extraction (with off-line node).
  5. Create a static web-site and regularly publish fresh charts for “state growth vs block gas limit”.
  6. Implement the generation of “state trie depth” table from https://medium.com/@akhounov/looking-back-at-the-ethereum-1x-workshop-26-28-01-2019-part-2-d3d8fdcede10
  7. Add “state trie depth” table to the Web-site.
  8. Create a better abstraction of server side to accomodate both BoltDB and BadgerDB on the server side.
  9. Extract and document the tool for gathering data and extending the analytics (so that people can come up with their own charts and tables).

Stateless Ethereum

For letting most nodes in the network to use less resources to process blocks. The results of this research are prototypes, benchmarks, documentation and specification describing how to implement stateless processing. Implementation would include three main parts:

  1. Block witness producer
  2. Block witness relayer
  3. Block witness consumer

Stateless clients is one approach to improve performance of Ethereum client implementations while processing blocks of transactions. Specifically, it seeks to ease the increasing burden that is the state size. It does so by removing the need to download, or implicitely construct and maintain the state, for most of the participants in the Ethereum network. The requirement to access to the state is removed by introducing another type of data packets (existing data packet types are, for example, blocks and transactions) to be gossipped around the p2p network. We call this data packets “block witnesses”. For each block we have one corresponding block witness. The two main properties that block witnesses have:

  1. It is possible to verify efficiently that the block witness is indeed constructed from the correct version of the Ethereum state.
  2. Block witness has all the required information to make it possible to execute the corresponding block.

More details can be found here: https://medium.com/@akhounov/data-from-the-ethereum-stateless-prototype-8c69479c8abc

Research roadmap

To make stateless clients a reality and get it implemented in all Ethereum 1 clients, these large steps seem to be necessary, up to the research test-net launch:

  1. Revisiting of the prototype (currently based on Turbo-Geth). Previous version of the prototype, that was used to produced data in the medium article showed above, had a different encoding for the structure than what is currently proposed (based on the “multiproofs” section here: https://github.com/ledgerwatch/turbo-geth/blob/master/docs/programmers_guide/guide.md#multiproofs).
  2. Refresh the data. Using updated prototype, produce new data, and add non-smoothed charts (previously published charts were moving average over 1024 block to make them smoother, but they potentially concealed some extremes)
  3. Estimate savings in the block witness size from switching to binary tries It is a conjecture that switching to binary tries would reduce the size of the block witnesses by the factor of 3-4. Also, will merkle-hashing of contract byte codes reduce the size of the block witnesses.
  4. Witnesses for genesis sync. Estimate the potential benefit of using partial witnesses for speeding up sync from Genesis block (where sync is proceeding with minimal disk I/O).
  5. Visualisation in the prototype. Some initial steps have been taken (e.g. visualisations for the presentation at the STARKWare sessions). This needs to be extended to provide interactive visualisations of state and block witnesses, to help extend the circle of researches and developers dealing with the stateless clients.
  6. Prototype to support semi-stateless. Semi-stateless approach described here: https://medium.com/@akhounov/the-shades-of-statefulness-in-ethereum-nodes-697b0f88cd04, but the data gathering was still hard computationally. With the code and data from item (4) on this roadmap (persisting partial witnesses), it should be much more efficient to collect and compare the data. The threshold would not be expressed in number of blocks for which the “aggregated” witness is kept by a semi-stateless client. Instead, the threshold would be expressed in number of state trie nodes kept in memory.
  7. Analysis of potential adversarial behavior. Using the prototype developed, make an assessment of how much gas would an adversary need to pay to inflate the size of the block witness. This analysis will be the input to the gas cost adjustment that introducing stateless client would require.
  8. Incentivisation of block witness production and relay. Using information gathered in the previous steps, reason about what incentives exist for any nodes to produce and to relay block witnesses, and, for example, not relay blocks without witnesses.
    it is a bit circular but this is my current thinking. Block witnesses can actually benefit full nodes (keeping the entire state) as well as partial nodes. Because they can reduce their memory expenditure on keeping the cache for its database, and instead they can simply consume the witnesses. If they do want to use their memory, they can “dial up their statefullness and dial down the witness sizes”, and still benefit by not having any read I/O from the database, only write I/O (which is much easier to do concurrently). So, in order to reinforce the incentive for miners to generate witnesses and prefer smaller witnesses to bigger witnesses, we would mostly likely need to change the “default” propagation rules for blocks. Currently, blocks are propagated once their PoW is found to be correct. In the future, we might need to change it to “blocks are propagated once there is a valid witness for it, which can be propagated alongside the block. If there is no witness, do not propagate the block”. For the nodes that benefit from the existence of the witnesses, it should be easy to convince the operators to switch to the new propagation rules. Once this propagation rule becomes the norm, miners are having clear incentive to produce witnesses for their blocks, because otherwise their blocks simply won’t propagate or propagate slowly. Note that even miners themselves can benefit from witnesses - because they can process blocks and start mining on top of them without any database I/O on the critical path.
  9. Block witness specification. Formal (implementation-independent) description of how block witness can be produced, how to encode/decode them into/from the wire packets, how to execute transactions on a block witness. This also covers the case or partial witnesses, which would simply be a concatenation of smaller witnesses rooted at different points in the state tree.
  10. New sub-protocol specification. Specify a new, optional sub-protocol for gossiping block witnesses (full, partial, and for range of blocks) around the network.
  11. Gas cost specification. Based on the analysis of the adversarial behaviour, specify how the transaction senders will be charged for the production of block witnesses. If we take 16 gas per byte as a baseline established in Istanbul (in EIP-2028), and we manage to somehow remove or minimise all intermediate hashes from the block witness (or just ignore them), then cost of SLOAD should be (32+32+32) * 16=1536 gas. First 32 is address hash of contract, second 32 is key hash of the storage item, and the third 32 is the value. If we apply the same crude logic to contract codes, we should charge 24 * 1024 * 16 = 393216 gas for CALL. 24 * 1024 is the current limit on contract code. If we, however, figure out from the data that it would be beneficial to split contract code in pieces, and merklize them (item 2 in this roadmap), then obviously we would only need to charge gas whenever the instructions from a piece are actually “loaded” into EVM. But if the CALL instruction charge proportional to the contract code’s size, it would be better. This, however, still requires something like UNGAS, otherwise there will be quite a lot of broken contracts and other concerns about backwards incompatibility. This would encourage people to split their large contracts into smaller pieces, working via DELEGATECALL proxy
  12. Research test-net. Construct a test-net supporting the block witnesses. Specification from previous steps would allow any client implementation to catch up and join.

Implementation roadmap

Since the research is in the early stages, this is not very fleshed out yet (therefore no expected dates), but this is the current thinking.

  1. Formal semantics for EVM, with reference implementation.
  2. Introduction of “Versionless EVM” via Account versioning, together with unobservable gas change (UNGAS), most probably dependent on Project “Gas repricing vs backwards compatibility”.
  3. Potentially changing state tree from hexary to binary.
  4. Potentially introducing merkle tree hashing of the contract byte codes.
  5. Repricing of operations influencing the size of the block witness.

Gas repricing vs backwards compatibility

The aim of this project is to achieve less painful (or ideally painless) repricing of opcodes. We expect these repricings to be necessary to keep the gas costs balanced, as well as for the implementation of the State Ethereum project. This project only has a research roadmap at the moment, and the dates are very uncertain. It also has a few branches (alternative proposals), covering either the main mechanism with which backwards-compatible repricing is achieved, or some secondary issues. Currently, there are 2 alternative proposals for the main mechanism:

Legacy repricing

https://ep.corepaper.org/compatibility/legacy-repricing/. The main idea is to introduce account versioning, but with the goal of only ever having two versions, “legacy” and “forward compatible”. Legacy version experiences no repricing, but its usage gets disentivised economically, but making it progressively more expensive than the forward compatible version. In the forwards compatible version, the main innovation is making the gas unobservable from the smart contracts (codename UNGAS). It is currently believed that this can be achieved by making 3 changes to the semantics of opcodes. Since the gas is now unobservable, any repricing does not affect the logic of the contracts’ execution, only the total required gas (which can be easily changed when creating a transaction). In order to economically disentivise the usage of the legacy version, repricings are performed not by increasing the gas expenditure of select operations, but rather by proportionally decreasing the gas expenditure for all the rest of the operations. It is believed that it will be difficult for miners to distinguish between transactions only containing legacy code, forward compatible code, or the mixture, so they won’t be able to apply different gas prices accordingly, and the legacy code will become largely more expensive to execute. When decreasing the gas expenditure of some operations, we will face the fact that some of them are already quite small (3 gas, for example). There are currently two proposed ways of dealing with this:

  1. Changing the semantics of entering a call frame, CALL, CREATE (and similar) instructions to scale the remaining gas up and down by a certain factor
  2. Particle gas cost, introducing the units of “milligas”, or even perhaps “microgas”: https://eips.ethereum.org/EIPS/eip-2045

Decoupling the two functions of gas

Gas is split into two components - one for metering, another for restricting recursion and callbacks.

Formal system to specify semantics of Ethereum state transition

The aim of this project is to agree on the most suitable way to specify semantics and use it to propose changes in Ethereum more formally. It can also mean that such suitable semantics has to be created. Some of the very ambitious projects (like Project “Gas repricing vs backwards compatibility”) that involve very significant changes to the EVM, would be hard to deliver successfully without the ability to specify the changes in an unambiguous way. From our previous experience, the real scrutiny of such proposals usually starts relatively late in the process, once the code is written. This is because the specification given in prose within the EIPs is not rigorous and formal enough for the proper analysis to start (and, more importantly, for using automated analysis tools like model checkers, symbolic execution engines, etc.). There exist at least one known formal semantics for EVM already, called KEVM. It is also part of the jello paper: https://jellopaper.org/. This semantics is complete and executable. However, its major drawback that the language it uses (K) is quite complex and inaccessible even for most core developers. Also, the process of installation of all required software to execute semantics makes the barrier to entry higher still. So far, KEVM has not been used to formalise EVM change proposals ahead of the implementation. Our assumption is that this is a usability issue. If there were an alternative semantics that uses simpler languages (like the one used in Web Assembly), there are much better chances for it to be adopted as a de-facto specification language for EVM changes.
In this project, we will attempt to construct a new formal semantics for EVM, based on a single type of substitution rules. Every rule will have 3 elements:

The matching that the pattern performs is done against the sequence of objects that represents the current state of EVM (also known as “configuration”). Like in Web Assembly specification, we treat items on the EVM stack and EVM instructions as the same type of objects, so that for most operations, opcode and operands can be matched by a simple mattern without the use of extraction functions.
This project has both research and implementation roadmaps, but they are closely intertwined and are likely to proceed in a lock-step fashion. That is why we combine them in one list for the time being. At any point of the roadmap we may encounter unsolvable problems, and the project may need to be abandoned or rescoped. That is why the roadmap attempts to probe more difficult areas of EVM functionality first, because they are more likely to be unsolvable.

Research/implementation roadmap:

  1. Skeleton of reference implementation for pattern matching (and variable bindings). Will start with the C programming language, which gives a good balance of being understandable for a lot of developers, as well as having a lot of code analysis tools developed for it.
  2. Skeleton of reference implementation for evaluation of the guards. Guards will also be written in C, with some conventions about what is permissible and what not.
  3. Skeleton of reference implementation for generating the substitutions.
  4. Rules for jump instructions, unconditional and conditional.
  5. Rules for call, revert, and return instructions (dealing with spawning and destroying execution frames).
  6. Rules for contract storage access (SLOAD, STORE). Potential introduction of non-determinism into the semantics, and therefore the need to also create the contract storage oracle.
  7. Rules for memory operations.
  8. Rules for all other opcodes.

Previous write up of ideas

Current idea is to use the notation similar to what
is used in Web Assembly, which is very clever system, and is as follows:

  1. State of the system, of configuration, is described as a tuple of functions (for accounts, storage etc.), and an instruction sequence.
  2. Interestingly, the stack does not have a separate representation and is instead represented as a part of the instruction sequence. To achieve that, a special const instruction carrying a value is used.
  3. Most operations will be represented as a reduction rule that involve pattern that has to be matched inside the sequence of instruction, and an optional guard, which is formulated as a logical formula. Guards let us choose between multiple rules applicable to the same configuration, and allows us computing the effect of the operation on parts of the state.
  4. For our purposes, we can try to represent a gas counter (or two) a part of the instruction sequence too (with special instruction gas_counter, so we can have different reduction rules that apply depending on whether there is sufficient gas, as well as properly decrease the counter on usage.
  5. Things like execution frames are trickier. We need an ability to scrap them completely when the frame is aborted or we perform RETURN. For that reason we might need to introduce frame boundary “instructions”, frame_begin, frame_id_1 … , frame_begin, frame_id_2 …, frame_end, frame_id_2, …, frame_end, frame_id_1. In this way, pattern matching can recognise where frames begin and end so they can be eliminated. Frame ids are required to ensure that any reduction rule that removes the entire frame can specify in the guard that frame id at the start of the frame matches the frame id at the end of the frame. And, the frame needs to have its own frame id present locally.

One of the challenges is how to model the sub-states. Complexity with the substates is that the state modifications made in an execution frame (substate) could be reverted or “committed” at the end of the frame. However, the opcodes are allowed to reach beyond the substate and that can potentially trigger cascading reads.
To solve this, we can employ certain level of non-determinism. When the instructions like CALL executes, and spawns a new frame between instruction sequence and the stack, it is also allowed to copy an arbitrary part of the enclosing frame’s state into its local readset. When the execution proceeds, reading from beyond the readset results in the failure of evaluation. However, with sufficiently correct readset the evaluation will proceed as required. This is where non-determinism is. Symbolic execution engines will be able to deal with it pretty easily by actually implementing cascading reads, but modelling cascading reads as a reduction rule would require some advanced higher-level functions.
Developing this idea, we would also have local writeset, which is getting populated during the frame’s execution, and then either gets merged into the encompassing frame’s writeset, or discarded.

When jumping instructions are executed, this leads to the replacement of the entire code in the current frame with the slice starting from the jump location.

In preamble, the following types would need to be defined:

  1. Boolean
  2. Integer (up to 2^256)
  3. Lists
  4. Mapping

List will need a “drop” operation defined, like
l[index:], which is part of l with index number of items in the front dropped, or invalid_list if there aren’t enough items. It will be used for modelling jump instructions.

Mapping type will need a boolean function “submap” that tells if one mapping is a sub-mapping of another m1<=m2. It will be used to check that the readset (non-deterministic) of the newly spawned frame is correct. Also, mappings should allow “holes”, i.e. values that are intented to be deleted. Using them, we can define “merge”, which applies writeset generated by an execution frame to the writeset of the enclosing frame.

The general composition of the string that is getting reduced during the execution is this:

(1 stack1 [configuration1] instructions1 )1

When instruction CALL (or similar) is encountered,
it spawns another exection frame within the first one:

(1 stack1 c1 c2 c3 c4 c5 c6 [configuration1] CALL instructions1 )1
=>
(1 stack1 [configuration1 - gas]
   (2 stack2-empty [configuration2 + readset]
   instructions of the called contract
   )2
instructions1 )1

Until this new execution frame completes, it will prevent any reduction rules in the first frame to proceed, because most reduction rules will have this kind of left hand side:

operand1 operand2 .. operand_n [configuration] OPCODE

Some reduction rules will include frame boundaries like (1 and )1 and some wildcards (* - matching any number of objects). Note that (1 will need to be two separate objects, so that we can have a guard to make sure that the frame id that we are matching corresponds to the inner-most frame. For example, revert will be something like this:

[configuration]( frame_id * mem_start mem_end [frame_id, mem] REVERT * frame_id )
==>
[configuration'],
if
configuration' = configuration + mem[mem_start:mem_end]

It is not very formal yet, but just to give an idea.

The next two challenges are:

Memory of an execution frame is modelled as a list of bytes + an integer specifying the starting address of the memory. Ending address can be always computed as starting address + length of the list. When memory is expanded, we need might need multiple guards for the cases where it is expanded upwards or downwards, with overlapping and non-overlapping “allocation”. When the memory is read onto the stack and writting from the stack, we need to have functions defined that turn certain number of bytes into the 0…2^256-1 integer.
In order to support the limitation of the stack depth (1024), there needs to be a short notation for matching large number of objects, so we do something like this:

( frame_id *{1024} [frame_id] PUSH1 x * ) frame_id ==>
OutOfStackDepthException

In the above, the entire execution frame gets remove by the reduction rule. Unfortunately, there is nowhere to put the exception, but it can still be part of the configuration for the purposes of semantics and symbolic execution tools.

State syncronisation (snapshot, moving snapshot)

Model for what state syncronisation is, describe existing and proposed methods using this model. Ideally, such model would allow describing fast sync, beam sync, firehose etc., illuminate their differences and perhaps derive the optimal sync algorithm. Such model would also use specification for the state handling that abstracts away EVM and transactions, but requires a notion of blocks as a form of temporality.

Storing state outside of Ethereum network

This project aims at exploring various ways of storing relatively “immobile” parts of the Ethereum state in external systems, like Swarm, or Cloudflare. There are preliminary results indicating that it is possible to partition the Ethereum state in fragments in such a way that there is not too much churn (i.e. number of fragments updating per block is not very high). What are these partitioning techniques? Which external systems to use?
This project does not have roadmaps yet, because there are no dedicated people working on it.

WASM for implementing Generalised Elliptic Curve precompiles

This project is related to https://eips.ethereum.org/EIPS/eip-1962 which was proposed to make it easier for the users of Ethereum to utilise Elliptic Curve cryptography without relying on a specific curve, as it was done previously with bn128 curve precompiles introduced in Byzantium upgrade in October 2017.
Despite being related to EIP-1962, it does not lose relevance even if and when EIP-1962 is rolled on the Ethereum mainnet.
The main conjecture that this project is attempting to verify is that, given a suitable set of primitives for performing arithmetic on large integers (for example, 2^1024), the interpreted version of Web Assembly is sufficient for most practical use cases of elliptic curves.
The roadmaps are currently unknown, but may be provided at a later date.

STARK proofs for Stateless Ethereum

Is it viable to use STARK proofs to reduce bandwidth requirement of Stateless Ethereum? Intermediate hashes and structure would be removed from the block witnesses, and replaced by 2 STARK proofs per block - one for pre-witness, another - for post-witness.

Useful link for implementing Keccak: https://pdfs.semanticscholar.org/f55d/5e7826b6e59aa2e8bffedec115519fbf071d.pdf (ARCHITECTURES FOR ARITHMETIC OPERATIONS IN GF(2M)
USING POLYNOMIAL AND NORMAL BASIS FOR ELLIPTIC
CURVE CRYPTOSYSTEMS). Looks like we would need normal basis representation, because cyclic shift is simply squaring.
But there is still non-linear operations in Keccak256, which is X and (not Y) for some X and Y. This would still need bit-extraction techniques used in STARK paper of 2018 for SHA2 arithmetisation.
Further research is to be done on how combatible is the usage of the normal basis with other elements of STARKs, like FFT and FRI.

Transaction pool improvements to accomodate larger transactions

This project aims at developing techniques to address two main challenges arising when the sizes of the transactions travelling over the network, increase:

  1. When Ethereum nodes restart, they need to reconcile their transaction pool with their peers. Currently reconciliation is done by simply downloading all the transactions from a peer. When transactions gets larger, this method will require more bandwidth and time.
  2. Most of the transaction pool implementations attempt to limit the amount of memory the utilise. For mining nodes, it is also important to be able to efficiently execute strategies for inclusion of transaction into a newly mined block. Larger transactions mean that more sophisticated data structures need to be employed to find a good tradeoff between memory consumption and strategic efficiency for miners.

A more straightforward solution that is currently pursued by a couple of implementation (go-ethereum and Trinity) is to use transaction hashes (32 bytes) as transaction IDs. Then, the transaction bodies would be persisted by the nodes. They would reconcile their transaction sets using transaction IDs, rather than the entire bodies.
Another, more opportunistic solution, could be the use of the minisketch library: https://github.com/sipa/minisketch
The capacity of the sketch, C is the maximum number of differences that the sketch can reconcile. And the size of the sketch is B * C bits, where B - size of each element (in our case B would be 256 for Tx hash). so, if you only expect up to 1024 differences, and you use 256 bits of the Tx hash for reconciliation, your sketch sizes would be 256Kb.
The roadmaps for this project are currently unknown, but may be provided at a later date.

Chain pruning

This project aims at looking at techniques that would be required to deal with every growing size of the blockchain, in terms of block headers, block bodies, and transaction receipts. Unlike the State, these data are almost append-only (the only time they get rewritten are reorganisations of the chain, they are relatively rare).
The chain of block headers grows the slowest (in terms of its size), but it is necessary to download even for the light clients. There are two potential solutions to that:

  1. Regular checkpointing. Once in a while, let’s say once a year, a new checkpoint hash is publicly agreed and hard-coded into the Ethereum implementation code. That way, light clients and others do no need to download all the headers.
  2. There could be a potential to generate the compact proof of header chain ancestry. It would allow to not download the entire chain of headers and verify PoW on every single header, but instead use proof systems like SNARK or STARK to have a compact proof that a certain header is correct in the sense that it has ancestry all the way to the genesis block (or some other “checkpoint” block), with all PoW correct along the way.

The sequence of block bodies grows fast, and it is expected to grow even faster after the “Instanbul” upgrade, lowering the cost of inclusion data into a block from 68 gas per byte to 16 gas per byte.
The sequence of receipts grows the fastest, but most users don’t require the entire sequence. Also, any receipt can be regenerated by re-executing the corresponding transaction in the context of the historical state. The history of state tends to be more compact to store than receipts, though re-generation of a receipt tends to be 10 times slower than simply fetching it from the database (data from tests on turbo-geth from September 2018).
This project does not have roadmaps yet, because there are no dedicated people working on it

Lowering Ethereum client implementations requirements for black-box testing

This project aims at improving the state of conformance testing for Ethereum implementations, so that less effort is required on the side of the client implementation to execute or generate conformance tests. Currently, the conformance tests (https://github.com/ethereum/tests) can be executed either an implementation-specific driver (code that parses the tests and translates them into EVM/state actions). The process of creation of new tests can be done either via testeth (part of Aleth C++ implementation), or via retesteth (part of Aleth C++ and go-ethereum implementations).
There are currently two main proposals on how to move ahead:

  1. Drive black-box testing by p2p network interface instead of RPC methods. This eliminates the need of having RPC module in an Ethereum client implementation.
  2. Something in functionality similar to RPC but without inter-process communication i.e. API / FFI allowing to use an Ethereum client as a shared library

This project does not yet have a roadmap, because there are no dedicated people working on it.

ReadSet and WriteSet pre-declarations in transactions

This may allow for transactions to bear their own witnesses. However, it needs to be researched whether such pre-declaration makes it easy enough to censor transactions accessing specific accounts or contracts.