Cross-modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation