Home / v0.16.0
Name Modified Size InfoDownloads / Week
Parent folder
apache-tvm-src-v0.16.0.tar.gz.asc 2024-04-28 833 Bytes
apache-tvm-src-v0.16.0.tar.gz.sha512 2024-04-28 162 Bytes
apache-tvm-src-v0.16.0.tar.gz 2024-04-28 75.8 MB
Apache TVM v0.16.0 source code.tar.gz 2024-04-13 10.6 MB
Apache TVM v0.16.0 source code.zip 2024-04-13 16.2 MB
README.md 2024-04-13 46.1 kB
Totals: 6 Items   102.6 MB 0

Introduction

The TVM community has worked since the v0.15.0 release to deliver the following new exciting improvements! This release version is:

  • First support of Relax, with dynamic shape and pipeline
  • Dlight module for optimizing LLM TIR workloads on GPU
  • Disco module for initial SPMD multi-GPU support

The main tags are below (bold text is with lots of progress):

  • Community, RFCs
  • Adreno, ArmComputeLibrary, Metal, cuda & cutlass & tensorrt, micoNPU, Runtime
  • Relax, Dlight, Disco
  • Arith, TIR, TVMScript
  • Docs, CI, Misc, BugFix

Please visit the full listing of commits for a complete view: v0.16.dev0...v0.16.0.rc0.

Community

  • #16695 - Add new key for release signing
  • #16419 - Add new key for release signing

### RFCs

This new RFC explores how TVM can be utilized to generate code for the SME ISA to achieve improved inference performance on supported Arm®-based hardware implementing the SME extension.

* #107 - [RFC] Scalable Matrix Extension enablement

Arith

  • #16735 - [Fixup] Require feature flag for tighter inequality bounds
  • #16588 - Provide tighter ConstIntBounds for special cases
  • #16704 - [Fix]Fix canonical simplification of LE

BYOC

  • #16567 - Skip processed functions in FuseOpsByPattern and RunCodegen

BugFix

  • #16766 - [Target] Added null check to fix segfault at ->defined() in cpu.cc DetectSystemTriple()
  • #16739 - [Ansor] Fixing Ansor Gradient Bug
  • #16820 - [Fix] PAPI docs
  • #16793 - [Fix] fix for numpy 2.0 compatibility
  • #16790 - [Fix] Fix build errors with VS2022
  • #16780 - [Fix] Fix numpy dtype map
  • #16773 - [Fix] Fix the purity flag of "vm.call_tir_dyn" and "kill" ops
  • #16770 - [Hotfix] Revert driver API pass ordering that breaks MLC, mark failing test
  • #16771 - [Fix] Remove redundant "remove_all_unused" in IPC memory lowering
  • #16746 - [Fix][Builtin] Fix "GetQueryPosition" of PagedKVCache
  • #16728 - [Fix] Introduce TVM_DEBUG_WITH_ABI_CHANGE to warn ABI changes in debug mode
  • #16714 - [Fix] PagedKVCache fetching compute stream when copy stream is needed
  • #16684 - [SLM] Produce well-formed Relax for nn.modules.KVCache
  • #16659 - add the default value for DFT in ONNX frontend
  • #16637 - [Transform] Preserve symbolic variables in FuseOps
  • #16649 - [FFI] Add a missing default for datatype lanes
  • #16492 - [Executor] fix debug_executor function debug_get_output
  • #16598 - [Transform]Handle non-composite lambda functions in FuseOps
  • #16565 - [Transform] Keep private non-primitive functions in FuseTIR
  • #16518 - Use xxx instead of pow(x,3)
  • #16436 - Ensure that bf16 arrays are created as expected
  • #16361 - Disable SingleEnvThreadVerifier
  • #16289 - [AUTOTVM][FIX] Typo fixes and add a warning in the Droplet Search

CI

  • #16837 - Disable flaky unit test
  • #16765 - [AOT][Testing] Improve output mismatch information on test failure
  • #16661 - add merge_with_main in unity
  • #16611 - [AOT][Testing] Print output values on test failure
  • #16546 - Disable testing that downloads from mxnet
  • #16521 - Fix CI Script and Broken Tests
  • #16502 - Support tvm-bot rerun for tvm-unity task
  • #16435 - Update image tag to 20240126-070121-8ade9c30e
  • #16420 - [WASM] Update emsdk and nodejs version
  • #16384 - Remove NVIDIA_DISABLE_REQUIRE
  • #16382 - In jenkins.cmd_utils.Sh.tee, check for failing subprocess
  • #16366 - Upgrade sccache version to 0.7.*
  • #16369 - Upgrade Unity ci images
  • #16344 - Update docker images tag to 20240105-165030-51bdaec6
  • #16340 - [Unity][UnitTest] Increase atol to resolve flaky CI failure
  • #16337 - [Hexagon][UnitTest] Disable flaky quantization test
  • #16336 - Upgrade cmake version to 3.24.0

Docker

  • #16755 - [SME]Add Fixed Virtual Platform (FVP) and toolchain install
  • #16348 - Upgrade pip in i386 container

Disco

  • #16618 - [Disco] Propagate structlog configuration to disco workers
  • #16639 - [Disco] Expose functions to query the per-worker device/rank
  • #16617 - [Disco] Implement Session.import_python_module method
  • #16715 - [Disco] Propagate structlog/logging config to workers
  • #16845 - [Debug][Disco] Check if a PackedFunc exists before calling it
  • #16817 - [Disco] Reduce Process/ThreadSession message queue reads and writes
  • #16807 - [Disco] Support setting workers' CPU affinity
  • #16375 - [Unity] Fix creation of disco ProcessSession
  • #16821 - [Fix] Add TVM_DLL to Disco session
  • #16752 - [Fix] Lazy import of "psutil" in disco process pool

Dlight

  • #16775 - [Fix][Dlight] (Low-batched-)GeMV on small spatial loops
  • #16429 - [Unity][Dlight][Fix] Reduction rule support dyn-shape epilogue
  • #16351 - [Unity] Add dlight.gpu.Fallback in DispatchSortScan, add argsort, topk, and cumprod
  • #16338 - [Unity][DLight] Introduce Specific Rule for RMSNorm
  • #16251 - [Unity][Dlight] Support dlight gemv rule on nested inner block
  • #16878 - [Dlight] Enhance vectorization loading weight for gemv
  • #16848 - [DLight] Fix a corner case for reduction rule
  • #16701 - [Dlight] Add fallback for low batch gemv with outer reduction
  • #16678 - [Dlight] LowBatchGemv rule only apply to function with spatial symbolic var
  • #16665 - [Dlight] Skip GeMV when normalization fails
  • #16579 - [Dlight] Scheduling Low batch GEMM using GEMV-like rule
  • #16579 - [Dlight] Scheduling Low batch GEMM using GEMV-like rule
  • #16321 - [DLight] Skip rule if target is not suitable
  • #16731 - [Dlight] Fix GeMV shared memory estimation

Docs

  • #16792 - [Doc] Fix set_axis_separator example
  • #16610 - [Doc] Fixed Docstring usage example in tvm.ir.make_node
  • #16572 - [Doc] Remove MxNet related tutorials
  • #16514 - [Unity][Doc] Document passes that depend on DataflowBlocks and encourage using ConvertToDataflow
  • #16482 - [Doc] Fix Docstring in extern.py for Sphinx
  • #16346 - [Doc] Fix minor error in "Expressions in Relay"

Frontend

  • #16001 - [ONNX] Fix interpreting auto_pad parameters in ConvTranspose operator
  • #16651 - [PaddlePaddle] PaddlePaddle model with NCHW data format that supports quantization
  • #16616 - [PaddlePaddle] Support conv2d when data_format is NHWC
  • #16526 - [Keras] Enable Dense operator for any input dims
  • #16478 - [PaddlePaddle] Fixed the bug that prevented the model from being successfully converted to microTVM on MacOS

Hexagon

  • #16762 - [VM]Cache operations when bypass mode is enabled
  • #16706 - [VM] Add buffers to dma_wait builtin
  • #16448 - [VM]Implement dma_copy and dma_wait builtin for hexagon

LLVM

  • #16782 - [SVE] Support scalable vectors in LoopVectorizer
  • #16812 - Fix compilation failure due to minor change
  • #16808 - [Runtime]Fix errors during loading of target tags
  • #16748 - Lack of DWARF type is not an error
  • #16696 - [SVE] Add codegen support for scalable buffer accesses
  • #15964 - [RUNTIME] Add optional LLVM ORCJIT runtime executor
  • #16612 - [SVE] Add support for scalable data type strings
  • #16523 - [SVE] Change the dtype of Ramp and Broadcast lanes to PrimExpr
  • #16484 - [SVE] Add vscale builtin
  • #16373 - Update Host.h path

MetaSchedule

  • #16725 - Make the opt_level of tune_relay() adjustable

Metal

  • #16713 - [RUNTIME]Provide richer runtime when error happens
  • #16605 - [RUNTIME]Fix multithreading access of metal runtime
  • #16438 - Dispatch numerically stable tanh for metal

OpenCL & CLML

  • #16854 - [OpenCL] Add OpenCL device for automatic target detection
  • #16846 - [Meta-Schedule][OpenCL] Enable MS tuning for Android OpenCL
  • #16768 - [RUNTIME][OPENCL] Bugfix for ciImage create with host ptr
  • #16672 - [CLML] Fix build TVM with CLML on MacOS
  • #16328 - [RUNTIME][CLML] Fix for Softmax op for 4D tensors
  • #16394 - [OpenCL][CMake] Fix OpenCL tests compilation

ROCm

  • #16441 - [WebGPU] Intrin Dispatch: tanh, erf, log
  • #16404 - Some fixes of ROCm codegen

Relax

  • #16872 - Enhance symbolic expr estimation in memory planning
  • #16867 - Dispatch sort/scan for non-cuda gpu backends
  • #16852 - Fix EliminiateCommonSubexpr removing alloc tensor
  • #16851 - [Relax,Topi] Allow passing workspace to thrust to avoid allocations
  • #16841 - Provide well-formed output in transform.LazyGetInput
  • #16798 - [Transform] Provide callback versions of LazyTransformParams
  • #16801 - Allow DeadCodeElimination within ApplyPassToFunction
  • #16834 - Capture symbolic vars in struct info of weights
  • #16830 - Share storage allocs among functions after cuda graph rewriting
  • #16823 - [VM] Refactor CUDA graph builtins as VM extension
  • #16828 - [Bugfix] Provide the full Expr to pattern-match rewriter
  • #16805 - [Bugfix]BlockBuilder may not assume unique input functions
  • #16815 - Enable capturing symbolic shapes in cuda graph
  • #16642 - Allow R.Prim('bool') in relax::If and assert_op
  • #16796 - Unit-test for structural equal of recursive function
  • #16732 - Allow composition of DFPattern replacements
  • #16783 - Improve CanonicalizeBindings in DataflowVar edge case
  • #16721 - Implement operators to inspec DLTensor::strides and offset
  • #16730 - Refactor PatternRewriter into separate Block/Expr mutators
  • #16756 - [IR]Improve highlighting in assert_structural_equal
  • #16779 - Improve malform error msg
  • #16569 - [Unity][Parser] Check well-formedness in the parser
  • #16759 - [Pass] Lowering passes for GPU IPC memory and allreduce
  • #16697 - Implement relax.transform.TopologicalSort
  • #16658 - Normalize use of void-type variable to inline R.tuple()
  • #16711 - [Frontend] Add op tanh, exp, negative, and permute
  • #16703 - [Fix]Fix top-p/top-k sampling kernel
  • #16669 - [Frontend][Onnx] add sum and globalavgpool 1d/3d op
  • #16691 - CUDA graph rewrite treating StringImm as static
  • #16685 - Implement StructInfoPattern for dataflow pattern matching
  • #16681 - [Frontend][Onnx] support MaxPool1/2/3D and AveragePool1/2/3D
  • #16584 - [Unity][TIR] Clear struct info when specializing PrimFunc
  • #16676 - Remove the legalization of cumsum/cumprob
  • #16654 - [Frontend][NN] Add support for Conv3D
  • #16674 - Eager free original weights in transform_params
  • #16675 - add sample_indices in sampling
  • #16648 - [Runtime] Support Unpack API for NDArrayCache
  • #16591 - [Unity][Transform] Handle dynamic shapes in CombineParallelMatmul
  • #16594 - [Transform] Preserve param names in LiftTransformParams
  • #16575 - [Unity] GPU sampling
  • #16574 - Additional unit tests for RemoveUnusedParameters
  • #16585 - [Unity][Analysis] Include impure call in VerifyWellFormed errors
  • #16421 - [Unity][Transform] Raise error in FuseOpsByPattern for SSA violation
  • #16629 - Fix error message in BlockBuilder
  • #16592 - Handle dynamic arguments in legalization of nn.attention
  • #16590 - [Unity][Transform] Check for permute_dims in ExpandMatmulOfSum
  • #16604 - [Frontend][Onnx] fix clip unsqueeze opset implement
  • #16568 - [Runtime] RNNState for Space State Models
  • #16563 - Implement operators to read runtime DLTensor* information
  • #16581 - [Unity][MSC][M4.2][Step2] Enable plugin with manager, test plugins in compile pipeline
  • #16600 - Expose name_hint field for BlockBuilder.match_cast
  • #16601 - [Transform] Canonicalize let var = R.const bindings
  • #16583 - [Unity][VM] Recursively visit match bindings in VMShapeLowerMutator
  • #16586 - Ignore non-relax functions in relax.transform.RunCodegen
  • #16573 - [VM] Re-implementation of callback functions
  • #16561 - [Bugfix]Remove call to tvm.build for empty TIR module
  • #16564 - [Unity] Check for symbolic vars in PrimValue in when lowering to TIR
  • #16558 - Minor updates for NN frontend
  • #16542 - Support callback as argument
  • #16487 - [Unity][Transform] Handle call_tir_inplace in FuseTIR and FuseOps
  • #16355 - [Unity] Infer struct info for relax.op.split on dynamic-sized index
  • #16465 - [Redo][Unity] Split DecomposeOpsForTraining into two steps
  • #16495 - [Unity][MSC][M4.2][Step1] Enable plugin with manager, test plugins in compile pipeline
  • #16498 - [Frontent] "tensor_ir_inplace" op
  • #16500 - [Unity] Support storage reuse for dynamic shapes
  • #16493 - [Pass] Skip data type node for CSE pass
  • #16467 - [Unity][MSC][Refactor] Reconstruct BYOC and runner
  • #16422 - [Unity][CodeGen] RunCodegen based on externally-exposed functions
  • #16483 - [Unity][Frontend] Add Sigmoid and Square Op
  • #16472 - [Unity] Improved error message in tvm::relax::UpdateStructInfo
  • #16473 - [Unity] Improve error message in tensor_to_shape struct inference
  • #16466 - Memory planning for "partially dynamic" shapes
  • #16464 - NDArray Cache Update with DLTensor Support
  • #16315 - [Unity][Transform] Implement relax.transform.ReorderTakeAfterMatmul
  • #16313 - [Unity][Transform] Implement relax.transform.ExpandMatmulOfSum
  • #16411 - [Unity][Transform] Handle symbolic variables in LambdaLift
  • #16443 - [Unity][FIX] fix thread dtype mismatch
  • #16442 - Revert "[Unity] Split DecomposeOpsForTraining into two steps"
  • #16437 - [Unity] Improve buffer allocation for handling duplicated buffer names.
  • #16439 - [Unity] Support cumsum with pure int32
  • #16432 - [Unity] downgrade cmake version requirement
  • #16427 - [Unity][Frontend][NN] Better support for dynamic convolutions
  • #16418 - [Unity][Fix] Fix mismatched intrinsic name
  • #16129 - [Unity][Transform] Replace eligible operators with in-place versions in dataflow blocks
  • #16414 - [Bugfix][Unity] Recover MSVC/NVCC/ROCm/Vulkan
  • #15954 - [Unity] Split DecomposeOpsForTraining into two steps
  • #16111 - [Unity][Transform] Memory planning for dynamic-shape func return
  • #16396 - [Unity] PagedKVCache supporting on-the-fly RoPE calculation
  • #16395 - [Frontend][ONNX]fix onnx frontend parse
  • #16385 - [Unity][Op] Add Conv3D Operator
  • #16284 - [Unity][nnModule] Dynamic shape support in nn Module
  • #16378 - [Unity][BlockBuilder] Restore bb.get()
  • #16374 - [Unity] Support TIR kernel for PagedKVCache
  • #16314 - [Unity][Transform] Implement relax.transform.AdjustMatmulOrder
  • #16349 - [Unity][MSC] Avoid depending on trivial bindings in Relax intermediate
  • #16376 - [Unity][Contrib] Fix a bug due to typo in vllm reconstruct_from_cache kernel and add test
  • #16388 - [Unity] Update dispatch test cases following the merge from main
  • #16335 - [Unity] Set CMAKE_CUDA_ARCHITECTURES default to native
  • #16306 - [Unity][Transform] Update LambdaLift to use name of lifted lambda
  • #16310 - [Unity][Analysis] Show objects instead of names in WellFormedChecker
  • #16362 - [Unity][Fix] Memory planning check value type of 'tir_var_upper_bound'
  • #16367 - [Unity][Transform] Handle replacement at both var binding and usage
  • #16309 - [Unity][Transform] Use parameter name in BundleModelParams
  • #16307 - [Unity] Improved error message in ExprMutator::ReEmitBinding
  • #16308 - [Unity] Improved error message for matmul shape mismatch
  • #16360 - [Unity] Enhance Torch-consistency in rehsape
  • #16350 - [Unity][Contrib] Add vLLM paged attention kernel
  • #16303 - [Unity][NN] Use Linear name for nn.op.permute_dims
  • #16325 - [Unity][MSC][Legalize] legalize codes and mute logging
  • #16312 - [Unity][Analysis] Add utility for collecting compile-time bindings
  • #16330 - [Unity][WEBGPU] Enable wasm exception propagation
  • #16304 - [Unity][Analysis] Handle PrimStructInfo in EraseToWellDefined
  • #16305 - [Unity][Transform] Implement UpdateParamStructInfo
  • #16331 - [Unity] Alter op impl handling empty transform for output
  • #16254 - [Unity] Dispatch cumsum and sort
  • #16120 - [Unity][Transform] Extract partial-tuple-usage from FuseTIR
  • #16311 - [Unity] Validate struct info in relax::Call constructor
  • #16333 - [Unity] Fix nn.op.tensor_ir_op signature
  • #16302 - [Unity] Cutlass kernel compatibility with cmake 3.18+

Relay

  • #16622 - [ONNX] Fix the attribute mode parse of operator Upsample
  • #16626 - [ONNX] Fix the Resize operator in ONNX frontend
  • #16624 - [ONNX] fix the wrong default value about dtype in Multinomial converter
  • #16417 - [Frontend][Torch] fix pytorch frontend linspace op
  • #16400 - [Frontend][Torch] fix pytorch frontend not support logical or
  • #16390 - [Frontend][Torch] fix a typo mistake in nonzero_numpy
  • #16324 - make "ToScalar" support directly obtaining "int64_t"

Runtime

  • #16804 - Introduce MSCCLPP with NCCL equivalent interface
  • #16809 - Add "TVM_DLL" to NVTX header
  • #16750 - CUDA IPC Memory support and custom allreduce kernels
  • #16738 - [Refactor]Always specify device in allocator interface
  • #16716 - Ensure NDArray.CopyTo(Device) always sync
  • #16705 - Add TVM_DLL to memory manager functions
  • #16692 - PagedKVCache execute data copy on a separate stream
  • #16647 - [RPC] Fix FreeObject in minrpc server
  • #16667 - [Builtin] Using float32 accumulation in attention kernel
  • #16635 - [RPC] Enable RPCObjectRef over multi-hop RPC
  • #16630 - Add TVM_DLL to threading backend funcs
  • #16541 - Add "TVM_DLL" to NDArray cache load func
  • #16550 - [ROCM] Properly align rocm parameter buffer
  • #16545 - Fix dtype conversion for bf16 and fp8
  • #16508 - ParallelFor skipping thread backend for unit extent
  • #16486 - KV cache providing workspace for attn kernel
  • #16456 - [KVCache] AttentionWithFusedQKV and RoPE mode
  • #16415 - [Memory] Implement support for non-zero offset within a storage object in AllocNDArr…
  • #16387 - [RPC] Enable RPCObjectRef return in RPC
  • #16377 - Use cudaGetDeviceCount to check if device exists

TIR

  • #16832 - Use constructor for new PrimFunc in TransformLayout
  • #16543 - Fix segfaults from ordering of Let/Assert in MakePackedAPI
  • #16795 - Ramp and Broadcast lanes fixed to int32 dtype
  • #16767 - [Driver] Use BindTarget to specify target for FP8 legalization
  • #16742 - [Bugfix]Fix cache_read update buffer region
  • #16726 - [Bugfix]Avoid overwrite of unmanaged buffer allocations
  • #16548 - [CUDA] Add native FP8 support to codegen
  • #16723 - Implement max/min_value for fp8 data types
  • #16655 - Improve well-formed check's handling of match buffer
  • #16673 - Support Vector Reinterpret Calls
  • #16682 - [Bugfix]Handle AttrStmt of upcoming tir.Var in ConvertSSA
  • #16560 - Enhance and fix tensorize schedule for some case
  • #16660 - [Bugfix]Fix duplicate AllocateConst in CacheReadWrite schedule primitive
  • #16544 - Expand debug symbol output for CodeGenLLVM
  • #16553 - Fix get_block_access_region for let bindings
  • #16515 - Require exactly same-dtype matching for Vulkan smem reuse
  • #16406 - Fix of inter thread reduction with shared memory prefetch
  • #16293 - Extend DP4A tensor intrin
  • #16345 - Allow sync threads inside condition
  • #16250 - In SplitHostDevice, check for variables in thread extents
  • #16184 - [Transform] Implement InlinePrivateFunctions

TOPI

  • #16652 - improve inclusive_scan for thrust
  • #16383 - [Target] Add fp16 SIMD support for conv2d on arm_cpu targets

TVMC

  • #16261 - Add tvmc flag to print ir before and print ir after named pass

TVMScript

  • #16864 - Add parser and printer support for e4m3/e5m2 fp8
  • #16844 - Produce empty DictAttrs when R.func_attrs is absent
  • #16811 - Do not throw error for duplicate definitions
  • #16641 - Allow use of relax.Expr with void type as a statement
  • #16663 - Infer T.reads() for DeclBuffer nodes
  • #16640 - Represent tir::builtin::ret() using python "return"
  • #16562 - [Bugfix]Handle R.match_cast as last binding in if/else
  • #16593 - [Unity]Parse R.Object return type from call_pure_packed
  • #16356 - [Unity]Optionally hide StructInfo that can be inferred
  • #16379 - [Unity]Update call_packed semantics to support empty sinfo_args

Vulkan

  • #16858 - Fix CLZ support for Vulkan

cuda & cutlass & tensorrt

  • #16865 - [Codegen, CUDA] Add handling of fp8 broadcast / const
  • #16818 - [Cutlass] Fix usage of cuda stream for group gemm
  • #16788 - [Cutlass] Add check for group gemm param shapes
  • #16789 - [Bugfix][Cutlass] Remove a typo in cutlass build
  • #16787 - [Codegen, Cuda] Add overload for fp8x4 e5m2 <-> half4 conversion
  • #16751 - [Cutlass] Add group gemm kernels
  • #16736 - [Target][CUDA] Allow non-numeric arch as needed for latest gpu
  • #16619 - [Bugfix][Cutlass] Check if function attributes is None
  • #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
  • #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
  • #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
  • #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
  • #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.

micoNPU

  • #16266 - [microNPU][ETHOSU] Add fixed point for tanh
  • #16680 - [microNPU][ETHOSU] Fix LUT size for int16 activations
  • #16401 - [microNPU][ETHOSU] Add fixed point for matmul

web

  • #16733 - Support web indexDB cache for larger model storage
  • #16810 - Support building tvm/web on Windows
  • #16825 - Allow custom bc files in emcc making
  • #16791 - Add kv_state and rnn_state to wasm_runtime
  • #16722 - Implement linear congruential generator, make runtime seedable
  • #16650 - Seperate parallel shard download and iterative shard loading
  • #16694 - Initial support for asyncify
  • #16631 - Fix NDArrayCache loading report callback
  • #16525 - Move ArtifactCache to Interface, Support Cache delete and Batch Delete, Remove typo
  • #16554 - Compatibility with PagedKVCache in WebGPU
  • #16527 - Revert "[Unity]Temp disable wasm exception (#16444)"
  • #16504 - [Relax]Add ApplyPresenceAndRequencyPenalty
  • #16485 - [wasm] Enlarge initial memory for emcc
  • #16444 - [Unity]Temp disable wasm exception

Misc

  • #16873 - [Thrust] Fix thrust workspace allocation
  • #16868 - [3rdparty] Bump flashinfer
  • #16871 - [PageKV] allow PopN to pop all the tokens in last block
  • #16866 - [3rdparty] Bump FlashInfer
  • #16863 - [Picojson] Let the key of objects in json be ordered by default
  • #16856 - [Thrust] Use pointer to tls pool to prevent creating new pool
  • #16850 - Fixing probability comment
  • #16849 - [KVCache] Initialize one extra page than specified
  • #16843 - [IR] Provide well-formed intermediate in ApplyPassToFunction
  • #16772 - [MSC][M5.3] Support torch.dynamo for dynamic models
  • #16839 - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/cmsisnn
  • #16838 - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/ethosu
  • #16831 - [KVCache] Reducing CacheAuxDataManager copy size
  • #16794 - [SME] Target parser support for SME
  • #16824 - [KVCache] Introducing auxiliary data manager
  • #16800 - [BugTIR]fix error merging shared memory for ptx_cp_async
  • #16822 - [VM] Recycle VMFrame
  • #16813 - [KVCache] Support forking sequence at specific posotion
  • #16786 - [Codegen] Add check to disable invalid reinterpret
  • #16816 - [Cmake] Allow using custom CCCL path for thrust
  • #16784 - [SLM] Add unit tests for SLM to Relax exporter
  • #16814 - Fix includes of custom allreduce kernel
  • #16806 - [Debug] Improve error message in VMShapeLower
  • #16802 - [Debug] Improve error messages in LiftTransformParams
  • #16425 - [Target] Use LLVM target parser for determining Arm(R) A-Profile Architecture features
  • #16797 - [3rdparty] AUTO mode for custom all-reduce strategy
  • #16761 - [SME] Add support for inserting processor state annotations
  • #16778 - [Analysis] Allow calls to GlobalVar in @R.function
  • #16745 - [IR] Default to empty attributes, instead of NULL
  • #16777 - Revert "[SLM] Allow modules to define pre-processing of weights"
  • #16776 - [Contrib] Remove thrust "built but not used" warning
  • #16757 - [SLM] Allow modules to define pre-processing of weights
  • #16763 - [CONTRIB] Add nm symbol dump
  • #16717 - Enable Shared Function in LiftTransformParam Pass
  • #16729 - [Builtin] Sliding window and sink support for PagedKVCache
  • #16724 - Fix cpp_rtvm cmake build on Windows
  • #16513 - [Target] Automatically detect system triple when not specified by the user
  • #16710 - [CMake] Add "USE_FLASHINFER" to libinfo
  • #16702 - [MSC][M5.2] Enable quantize && prune with gym by wrapper
  • #16699 - [Transform] Remove R.Object parameters after LazyTransformParams
  • #16668 - [MSC][M5.1] Build wrapper to support compression
  • #16693 - [Contrib] Support NDArray cache taking generator
  • #16412 - [Lint] Add check to prevent usage of #include <regex>
  • #16689 - [DeviceAPI] Support "GetCurrentStream"
  • #16690 - Use target name instead of node name as function name
  • #16683 - [skip ci] Fix wasm exception flag
  • #16609 - Minor update docs instructions
  • #16656 - Simplify Windows CMake Command
  • #16666 - [KVCache] Fix the reference counter in sequence fork
  • #16662 - Fixing workload comment
  • #16595 - [Transform] Check for zero-param operators in LiftTransformParams
  • #16599 - [Transform] De-duplicate MatchCast nodes in EliminateCommonSubexpr
  • #16596 - [Transform] Implement relax.transform.ReorderPermuteDimsAfterConcat
  • #16597 - [Transform] Allow explicit name of bundled model parameters
  • #16602 - [Transform] Improvements to LazyTransformParams
  • #16606 - [KVCache] Support passing in attn_score_scaling_factor into KV cache
  • #16608 - Extend gpu memory bandwidth test to work through RPC
  • #16587 - [Debug] Improve error message for codegen pattern mismatches
  • #16570 - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 1
  • #16576 - Update the 3rdparty/libflash_attn submodule
  • #16580 - [KVCache] Support mode "None" for Rotary Embebdding
  • #16578 - [KVCache] Support returning query positions
  • #16571 - Fix compile warnings
  • #16540 - [Upd] Enable lld search to include /opt/rocm/llvm/bin for rocm
  • #16539 - Improve error message in NDArray::CopyFromTo
  • #16524 - [Build] Improving debug and build-dir options
  • #16551 - [KVCache] Fix attention kernel for ROCm
  • #16512 - Cut pytest-lazy-fixture
  • #16506 - Bump 3rdparty/cutlass_fpA_intB_gemm version
  • #16511 - [Minor] Fix Clang compilation warning in fuse_tir.cc and codegen_c_host.cc
  • #16516 - Add Relax, Unity Tags in make_notes.py
  • #16497 - [Instrument] Add default instrument to print all passes
  • #16494 - [DPL] Support tir_vars field in is_call_tir pattern
  • #16453 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm
  • #16454 - [BugTIR] fix thread_sync occurs in letstmt
  • #16468 - [LINT] Fix pylint issues in test_dma_builtin.py
  • #16413 - [Contrib] Workspace for cuBLAS backend
  • #16460 - [Cherry-pick][MSC][M4.1] Add plugin && plugin_builder, enable build and test in different frameworks (#16397)
  • #16461 - [Minor] Fix Docstring for sphinx-build
  • #16431 - [Schedule] Loop-Partition Scheduling Primitive
  • #16451 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/ethosu
  • #16452 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/cmsisnn
  • #16445 - [skip ci] update branch rule to prepare for unity transition
  • #16426 - [CMake] Enable cuda lang if USE_CUDA is on
  • #16407 - Add NVIDIA Hopper H100 target tag
  • #16398 - [DeviceAPI] Support querying total global memory
  • #16357 - [RPC] Fix tuning on macOS and Windows (#15771)
  • #16386 - [Thrust] Use no sync exec policy and caching allocator
  • #16343 - [CMake][MSVC] Disable permissive mode for MSVC builds
  • #16242 - [Codegen] Fix if_then_else codegen
  • #16341 - [CMake] Use ccache as CMAKE_CUDA_COMPILER_LAUNCHER
  • #16332 - Change metal dtype of ceil_log2 to fp32
Source: README.md, updated 2024-04-13