Learning the underlying geometry and view-dependent appearance of the real-world 3D scene enables better novel-view synthesis. Recently, implicit neural representations have shown superior performance for the novel-view synthesis task. In this paper, we explore the challenge of decomposing the implicit representation of a 3D scene, addressing occluded regions and facilitating the manipulation of decomposed objects for editing tasks. Existing state-of-the-art methods often employ different object codes to model the radiance field of each object or predict densities along a ray for individual objects, resulting in unnecessary complexity when modeling the composited scene. To streamline this process, we propose a straightforward pipeline named OM-NeRF, which frames the problem as a joint learning task for the standard scene and object removal. This disentanglement facilitates modeling the scene with and without foreground objects. Unlike alternative methods, we avoid dividing the optimization into stages, opting to jointly learn the scene with and without foreground objects. Our approach demonstrates compelling qualitative and quantitative results on standard novel-view synthesis datasets for object manipulation: DMSR and ToyDesk. We also show results on the forward-facing LLFF datasets. OM-NeRF pipeline effectively accomplishes the decomposition of objects within the 3D scene.