Download Latest Version Version 1.0.0 source code.zip (8.0 MB)
Email in envelope

Get an email when there's a new version of OneFlow

Home / v0.8.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2022-07-07 90.4 kB
Version 0.8.0 source code.tar.gz 2022-07-07 2.7 MB
Version 0.8.0 source code.zip 2022-07-07 5.8 MB
Totals: 3 Items   8.6 MB 0

OneFlow v0.8.0 Release Note

OneFlow v0.8.0 came out, welcome to install the new version for a better experience. 

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Performance
  • Improvements
  • Bug fixes
  • Documentation

Highlights

This update contains 523 commits and the following highlights:

  • PyTorch compatible APIs have been further optimized, 68 new APIs aligned with PyTorch have been added, and 84 compatibility bugs between operator and interface have been fixed. More PyTorch models support being one-button transferred into OneFlow.

  • All operators support Global Tensor more completely and efficiently, 28 Global Tensor-related bugs have been fixed, and 180 operator unit tests have been newly added.

  • Graph's advanced features have been further optimized:

  • In addition to the existing ZeRO-DP, Zero Redundancy Optimizer(ZeRO) can also be used in combination with MP parallelism, 2D parallelism, and 3D parallelism, which saves more memory overhead.

  • Graph provided new pipeline parallelism API, which not only simplifies the pipeline parallelism configuration but also optimizes the performance of pipeline parallelism and 3D parallelism.

  • Multi-dimensional debugging functionality in the logic graph, light plan physical graph, memory analysis, Python stack information, and others have been newly added, making Graph.debug more efficient.

  • Empowered by OneFlow v0.8.0 and LiBai v0.2.0, 3D parallelism speed under GPT and BERT witnesses a notable increase, and its training speed performance exceeds Megatron-LM with same configuration in multiple dimensions. For more details, please click here.

  • OneEmbedding has been released recently. It is an extension component designed for large-scale recommendation systems, boasting high efficiency, extensibility, flexibility, and other advantages.

  • Multi-Device adaptation: OneFlow v0.8.0 has provided a neat, efficient, and easily-extensible hardware abstraction layer called EP(Execution Provider) and defined a collection of basic computing interfaces called Primitive, allowing to re-implement kernels based on Primitive interface. 

  • Added new debugging tool stacks: OneFlow-Profiler and AutoProf

  • OneFlow-Profiler is a tool designed to collect performance information during framework execution. It can record the execution time of operators and system components, the allocation of memory and DRAM, and the corresponding input and parameters of operators. The information can help developers find out the main source of overhead in framework execution and thus implement targeted optimization.

  • AutoProf is a framework designed to efficiently detect the alignment between OneFlow APIs and PyTorch APIs. Besides, it can automatically compare the performance results of OneFlow APIs and PyTorch APIs.

  • Significantly optimized the exception handling process in OneFlow API and improved the error message when APIs meet exceptions.

  • Significantly optimized the OneFlow API documentation: the API documentation has been restructured based on functionality. In addition to general operator APIs, oneflow.nn.graph, oneflow.embedding, oneflow.autograd and other modules in OneFlow and their environment variables have also been explained in detail.

Backwards Incompatible Change

Outdated configuration method in OneFlow v0.7.0:

:::python
import oneflow as flow

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = flow.nn.Linear(3, 8, False)
        self.config.set_zero_redundancy_optimizer_mode("distributed_split")
        if zero_stage > 1:
            # stage 2
            flow.boxing.nccl.enable_use_compute_stream(True)
            if zero_stage > 2:
                # stage 3
                flow.boxing.nccl.disable_group_boxing_by_dst_parallel(True)
    def build(self, x):
        return self.linear(x)

graph = Graph()

New interface in OneFlow v0.8.0:

:::python
import oneflow as flow

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = flow.nn.Linear(3, 8, False)
        self.config.enable_zero(stage=2)
    def build(self, x):
        return self.linear(x)

graph = Graph()

Deprecations

Python API

v0.7.0

:::python
oneflow.sbp.split(axis=0)

v0.8.0

:::python
oneflow.sbp.split(dim=0)
  • For the outdated pipeline parallelism configuration method self.module_layer_0.config.stage_id = 0 (this method is not suggested ), we have added a novel pipeline parallelism API config.set_stage, which optimizes pipeline parallelism performance as well as avoids implementing the input_tensor.to_global(placement=this_stage_placement) operation for all module input tensors at every stage. (https://github.com/Oneflow-Inc/oneflow/pull/8442)

v0.7.0

:::python
import oneflow as flow

B = [flow.sbp.broadcast]
P_0 = flow.placement(type = "cuda", ranks = [0, 1])
P_1 = flow.placement(type = "cuda", ranks = [2, 3])

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
        self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
        # Set different module's stage id to hint the graph preparing right num of buffers in pipeline.
        self.m_stage0.config.stage_id = 0 
        self.m_stage1.config.stage_id = 1
        self.config.set_gradient_accumulation_steps(4)

    def build(self, x):
        x = x.to_global(placement=P0, sbp=B)
        y = self.m_stage0(x)
        # Move tensor between different pipeline stages.
        y = y.to_global(placement=P1, sbp=B)
        z = self.m_stage1(y)
        return z

v0.8.0

:::python
class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
        self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
        # set_stage(stage_id, placement)
        # The Stage ID is numbered starting from 0 and increasing by 1.
        # The Placement is all tensors placement of this module.
        self.m_stage0.config.set_stage(stage_id=0, placement=P_0)
        self.m_stage1.config.set_stage(stage_id=1, placement=P_1)
        self.config.set_gradient_accumulation_steps(4)

    def build(self, x):
        # There will be automatically do tensor.to_global(placement) for all input tensor of this module.
        # So there is no need to write to_global() in/out of the module forward function.
        y = self.m_stage0(x)
        z = self.m_stage1(y)
        return z

New Features

Graph

Debug

  • Graph.debug offered the new parameter: max_stack_depth (default = 2) to note the maximal stack depth of the Python stack where the op exists in Graph, making it convenient to locate the Python context for each op in Graph. (https://github.com/Oneflow-Inc/oneflow/pull/8028)

  • Apart from supporting printing the input/output/variable info of modules in Graph, it also newly supported printing operator info constructed in module forward. (https://github.com/Oneflow-Inc/oneflow/pull/8135)

  • Enabled export ONEFLOW_DEBUG_MODE=true and export GLOG_v=3 to print the full memory log, which contains multi-level MemBlock info on each device (Total Memory-> Chunk -> MemBlock), Block that has exclusive memory, Eager Variable and other information. Besides, a lifecycle label was added in Regst to analyze each tensor's memory lifecycle.

  • LightPlan provided a more simplified way to display Actor Graph, cutting down the cost of debug based on Plan. When ONEFLOW_DEBUG_MODE = true, a series of light plan files corresponding to each rank in Graph will be generated under the log/local_rank_0/machine/ directory, containing simplified actor sub-graphs in each rank, and the filename is GraphName_rank_i_light_plan. (https://github.com/Oneflow-Inc/oneflow/pull/8396)

  • The print graph method allowed to display the logic graph by Module, making the debugging more efficient in constructing graphs. (https://github.com/Oneflow-Inc/oneflow/pull/8131)

Eager

Tensor

Global Boxing

OneEmbedding

For better recommendations, modern recommendation systems always rely on huge Embedding tables. Besides, frequent iterations of user data require model training to be fast enough.

OneEmbedding is a component designed for large-scale recommendation systems, and it's efficient, extensible, and highly flexible. The following are its advantages:

  1. Hierarchical storage and dynamic capacity expansion: users can expand the capacity of the Embedding at much lower cost.

  2. Mixed parallelism strategy: it supports easily extending the model to train it on multi-machine multi-GPU.

  3. Embedding quantization for better communication: in the parallel scenario, communication data can be quantized to reduce the communication amount, thus accelerating the training.

  4. Efficient data pipeline: the model parts that have no data dependency can be executed in advance, thus overlapping with other operations in time.

  5. Automatic mixed precision training: data can be computed in FP16 to reduce the occupied memory, thus accelerating the training speed and ensuring high model convergence precision.

  6. A collection of efficient CUDA ops for common operations in recommendation systems is available.

  7. Flexible model building is supported.

See OneEmbedding API documentation from here.

PyTorch Compatibility

A collection of new functionalities and interfaces that are compatible with PyTorch 1.10.0 have been added.

Tensor

Operators

Random

  • Added new interfaces: oneflow.cuda.manual_seed, oneflow.cuda.manual_seed_all, oneflow.seed, oneflow.manual_seed, oneflow.initial_seed, oneflow.get_rng_state, oneflow.set_rng_state and improved the configuration of OneFlow random seed initialization. (https://github.com/Oneflow-Inc/oneflow/pull/7957 )

AutoGrad

CUDA

RNN

  • Refactored the Module of RNN and migrated the implementation of Python layer splicing to C++, which greatly optimized the performance. Added modules related to RNNCell and modules aligned with the torch.nn.utils.rnn in functionality:

  • Refactored modules: RNN, LSTM, and GRU

  • Added modules: RNNCell, LSTMCell, GRUCell, andoneflow.nn.utils.rnn
  • Supported and fixed RNN unit tests of local and global, and completed documentation.

Device

Supported heterogeneous equipment type: In order to cope with the complexity of different hardware, OneFlow, in line with the dependency inversion principle in software engineering, has introduced a hardware abstraction layer called Execution Provider (EP). The hardware abstraction layer is composed of a series of interfaces, which are abstracted from the capabilities provided by the required hardware devices during the running of the framework. After the hardware abstraction layer is introduced, each modules can directly call the interface provided by the hardware abstraction layer, not the original hardware interface, to use the underlying hardware, so it's unneccessary to concern the specific details of the underlying hardware. When a new hardware device is introduced, because the hardware abstraction interface remains unchanged, all modules can adapt to the new hardware device without any modification. At the same time, when adapting new hardware for the framework, we do not need to pay attention to the specific implementation details of the framework. We only need to implement a series of interfaces according to the agreement of the hardware abstract interface and the actual situation of the hardware device, and then the hardware adaptation can be completed.

Execution Provider has defined a collection of runtime interfaces: device registration interface, device management interface, queue management interface, event management interface, and memory management interface.

Primitive

In addition to the runtime interfaces, the Execution Provider has also defined a set of computing interfaces called Primitive, which are used to describe the commonly-used computation in the deep learning framework, thus simplifying the development of operators in hardware adaptation. Compared with the runtime interfaces provided by the Execution Provider, the interfaces provided by Primitive are more loose and flexible. All interfaces are mutually independent, and each interface represents a specific computing capability provided by a certain hardware device. Similar to runtime interfaces, the abstraction of interfaces provided by Primitive is closer to the device side, and developers can carry out adaptation work without an in-depth understanding of OneFlow's mechanism. Developers need to implement all interfaces provided by Execution Provider when adapting runtime interfaces, but in the process of adapting Primitive, developers can selectively adapt according to the actual situation of the project.

Debug tools

OneFlow-Profiler

OneFlow-Profiler is designed to collect various performance-related information during the execution flow of the framework. It can calculate the execution time of the operator or system components, the allocation of memory and DRAM, and can record the input and parameter information corresponding to the operator. This information can be used by developers to analyze which part brings the most overhead and implement some targeted optimizations.

Auto-Test

AutoProf

AutoProf is a framework designed to test the performance of OneFlow and PyTorch operators. It can automatically test the operator performance and print a comparison table under different CPU threads and GPUs. At present, it has been applied to the development of some existed operators and all new operators. Its effect is shown below:

image

IR

Performance

Graph

Eager

  • Enabled export ONEFLOW_EAGER_LOCAL_TO_GLOBAL_BALANCED_OVERRIDE =true to accelerate the execution of Eager Global, which can save the synchronization of meta information on each rank of Global Tensor. (when users are confident that their code execution is symmetrical, SPMD)(https://github.com/Oneflow-Inc/oneflow/pull/7981)

This environment variable is used to indicate whether the shape of the input data is the same when local to global is executed. If it is set to true, there is no need to synchronize the shape of each rank, and the logical shape is calculated locally.

Operators & Tensor

Primitive

  • Lowered the elementwise.cuh template's requirement for pointer alignment.

Improvements

Graph

Eager

Operators & Tensor

Device

Tests

Eager Global Module Tests:

In 0.8.0, we have completed the ability of all kernels to deal with global tensor in distributed situation, and fixed many known bugs related to sbp. The global tensor worked efficiently and correctly at the kernel level. No matter how the distributed topology structure changes, the same algorithm logic can efficiently get mathematically consistent results, which greatly reduced the trouble of verifying correctness in the complex, diverse and asymmetric distributed parallel training process.

module/functional op PR
abs [#7540]
0_dim_tensor [#7540]
activation [#7540]
adaptive_pool [#7563]
addmm [#7565]
add [#7204]
affine_grid [#7578]
arange [#7576]
argmax [#7579]
argmin [#7581]
argsort [#7582]
argwhere [#7584]
avgpool [#7585]
batch_gather [#7590]
bernoulli [#7732]
bmm [#7741]
broadcast_like [#7742]
cast [#7773]
ceil [#7744]
chunk [#7750]
clamp [#7752]
clip_grad [#7757]
concat [#7204]
conv1d [#7769]
conv2d [#7771]
conv3d [#7771]
cumsum [#7772]
deconv2d [#7772]
diagonal [#7772]
diag [#7421]
div [#7421]
dot [#7421]
dropout [#7772]
empty [#7508]
eq [#7421]
erfc [#7421]
erf [#7421]
expand [#7772]
expm1 [#7421]
eye [#7421]
flatten [#7421]
flip [#7496]
floor [#7421]
fmod [#7421]
fold [#7772]
greater_equal [#7421]
greater [#7366]
fused_bias_add_dropout [#7867]
fused_bias_add_gelu [#7867]
fused_scale_mask_softmax_dropout [#7867]
fused_scale_mask_softmax [#7867]
fused_scale_tril [#7867]
fused_self_attention [#7867]
fused_tril_softmax_mask_scale [#7867]
gather_nd [#7880]
gather [#7880]
glu [#7880]
grid_sample [#7881]
groupnorm [#7885]
masked_fill [#7457]
masked_select [#7492]
math_ops [#7461]
matmul [#7465]
maxpool [#7683]
max [#7450]
mean [#7650]
meshgrid [#7533]
min_max_observer [#7725]
min [#7450]
movedim [#7679]
moving_average_min_max_observer [#7726]
mul [#7717]
narrow [#7647]
negative [#7644]
ne [#7642]
nms [#7536]
nonzero [#7645]
normalize [#7635]
ones_like [#7635]
parital_fc [#7534]
permute [#7635]
prod [#7635]
randint [#7508]
rand [#7508]
reshape [#7472]
roi_align [#7794]
scatter_nd [#7807]
scatter_ops [#7807]
sign [#7818]
slice [#7818]
softplus [#7818]
sparse_softmax_cross_entr [#7298]
split [#7277]
sqrt_square_sum [#7277]
squeeze [#7289]
stack [#7289]
stateful_kernel_with_cache [#7289]
std [#7303]
sub [#7303]
sum [#7303]
tensor_ops [#7307]
tensor_scatter_nd_update [#7308]
tile [#7322]
transpose [#7332]
tril [#7322]
TripletMarginLoss [#7332]
triu [#7882]
unfold [#7883]
unfold_tensor [#7883]
unsqueeze [#7882]
upsample [#7884]
var [#7891]
view [#7886]
weight_norm [#7886]
where [#7886]
zeropad2d [#7886]

EP::Primitive

Completed some unit tests of Primitive log_softmax, softmax, copynd, Memset, Memcpy, matmul, add, binary, unary, matmul, batch_matmul, fill etc. (https://github.com/Oneflow-Inc/oneflow/pull/8132, https://github.com/Oneflow-Inc/oneflow/pull/8139, https://github.com/Oneflow-Inc/oneflow/pull/8137, https://github.com/Oneflow-Inc/oneflow/pull/8109, https://github.com/Oneflow-Inc/oneflow/pull/8143, https://github.com/Oneflow-Inc/oneflow/pull/8108, https://github.com/Oneflow-Inc/oneflow/pull/8154, https://github.com/Oneflow-Inc/oneflow/pull/8154, https://github.com/Oneflow-Inc/oneflow/pull/8118https://github.com/Oneflow-Inc/oneflow/pull/8291)

Exception

Improve exception error handling

Build

CI

Improve the running speed and stability of CI

Models

Bug fixes

Graph

Eager

Operators & Tensor

Global Tensor

Tensor

Scalar Tensor

Fixed failure of gather to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8376)

0-Size Tensor

Operators

Device

Higher order derivative

Build

CI

Module

Documentation

Source: README.md, updated 2022-07-07