High-resolution gridded population data are crucial for various fields. Estimating heterogeneous urban populations presents challenges due to nonlinear relationships between influential factors and population density, which vary spatially across grids within different land-use parcels. This study developed a Contextualized Geographically Weighted Neural Network (CGWNN) model to estimate population density on 100 × 100m grid cells in Beijing, China, using multi-source data. This model integrated the artificial neural network with geographically weighted regression to account for nonlinear associations that are similar across proximate grids. By incorporating parcel-level land uses as variable weights, it also considered contextually varying associations across proximate grids located in different land-use parcels. Our CGWNN model achieved superior accuracy (R2 = 0.85) compared to other models that ignored the aforementioned associations and widely used population datasets. The top three important variables were the distances to the nearest school, restaurant, and auto service, all negatively associated with population density. Additionally, the intensity of artificial light at night (ALAN) exhibited both positive and negative associations with population density in different regions, suggesting that the increased ALAN did not necessarily indicate higher population density in urban areas. Our modeling approach shows promise for accurate population estimation, which could be extended to larger areas, benefiting various fields.